BioHackrXiv Preprints

Executing workflows in the cloud with WESkit

2023-02-21T00:00:00+00:00

With the exponential increase in genomic data, analyzing and processing large datasets has become a challenging task in healthcare. To address this issue, the Global Alliance for Genomics and Health (GA4GH) has proposed a set of community standards for enabling the adoption of FAIR principles for data, software, and infrastructure. These standards promote the concept of sending analysis and processing workflows to the data rather than transferring large datasets, thereby increasing efficiency and data security. In this paper, we present the outcomes of the ELIXIR Biohackathon 2021 project, where we worked on our software WESkit, which implements the GA4GH WES standard for running Snakemake and Nextflow workflows. During the hackathon, we implemented basic GA4GH TRS support, deployed a cloud platform, and added S3 support for downloading result files.

CiTO support for BioHackrXiv

2023-02-03T00:00:00+00:00

In this paper we present the work executed on BioHackrXiv during the international ELIXIR BioHackathon in Barcelona, Spain, 2021.

Addressing sex bias in biological databases worldwide

2023-02-02T00:00:00+00:00

Precision medicine aims at tailoring treatments to individual patient needs. In this context, artificial intelligence (AI)-based technologies are viewed as revolutionary since they have the capacity to identify key features that link genomic and phenotypic traits at the individual level. AI techniques therefore depend on the quantity and quality of patient data. When variables like sex, age, or race are ignored in sample records, it can result in biased predictions as they will not be considered in the training of the AI algorithm. To this end, the European Genome-phenome Archive (EGA) took action in 2018 and put into place a rule that requires data providers to declare the sex of donor samples uploaded into their repository to improve data quality and prevent the spread of biased results. In this work we quantified biases in sex classification over time in human data from studies deposited in EGA and the database of Genotypes and Phenotypes (dbGaP), which represents the EGA’s equivalent in the USA. The main result is that the EGA policy is effective to fight sex classification biases because there are significantly less samples classified as unknown after 2018 in this repository than in dbGaP. Additionally, we qualitatively assessed public opinion on this issue. A survey addressed to users, creators, maintainers, and developers of biological databases revealed that specialized training and additional knowledge about diversity criteria are required. Based on our findings, we raise awareness of sample bias problems and provide a list of recommendations for enhancing biomedical research practices.

Mapping OHDSI OMOP Common Data Model and GA4GH Phenopackets for COVID-19 disease epidemics and analytics

2022-11-26T00:00:00+00:00

The COVID-19 crisis demonstrates a critical requirement for rapid and efficient sharing of data to facilitate the global response to this and future pandemics. Our project aims are to enhance interoperability between health and research data by mapping Phenopackets and OMOP schemas, and representing COVID-19 metadata using the FAIR principles to enable discovery, integration and analysis of genotypic and phenotypic data. Here, we present our outcomes after one week of BioHacking together 17 participants (10 new to the project), from different countries (CH, US and in EU), and continents.

Bioschemas data harvesting project report

2022-03-25T00:00:00+00:00

The promise of Bioschemas is that it makes consuming data from multiple resources more straightforward. However, this hypothesis has not been tested by conducting a large scale harvest of deployed markup and making this available for others to reuse. Therefore, the goal of this hackathon project is to harvest a collection of Bioschemas markup from a number of different sites listed on the Bioschemas live deploys page using the Bioschemas Markup Scraper and Extractor (BMUSE). The harvested data will be made available for others and loaded into a triplestore to allow for further exploration.

DS Wizard Meets DAISY: A Romance Solving Data Protection Requirements in Data Management Planning

2021-12-16T00:00:00+00:00

This report summarises our activities and achievements in integrating the Data Stewardship Wizard (DSW) and Data Information System (DAISY) tools during the ELIXIR BioHackathon Europe 2021. As a data information system for GDPR compliance, DAISY is focused on a single goal – gathering all information required for GDPR accountability of biomedical research projects. On the other hand, DSW is very flexible and can be used beyond data management planning. We worked on the integration between both tools on two fronts. Firstly, we created a new Knowledge Model in DSW together with a document output template to be able to generate a data protection impact assessment (DPIA). Secondly, we introduced a new integration type between projects in DSW and DAISY that allows the querying of DAISY data upon document generation in DSW. Both of these independent activities brought successful results that were polished and published after the actual BioHackathon. Finally, we provide the related materials as an on-demand training course in the ELIXIR eLearning Platform.

Network analysis of specimen co-collection

2021-12-07T00:00:00+00:00

We took data on the collectors of specimens from natural history collections. Co-collectors of specimens were extracted from the data and a network of co-collection was constructed. This network was used to analyze the age and gender balance of collectors and how this has changed with time. Men outnumber women in the network, but women participation increases with time, as are the all female pairs of collectors. Most collector pairs have less than 50 years age difference and it is suggested that co-collections above this age difference should be checked for errors. This project has proven the value of analyzing co-collection data, but also highlighted the many additional avenues for future research on this subject.