Meetings

Recent preprints

  • Comparison of orthology finding tools using plant genes

    Orthology finding tools are valuable for analyzing biological data from multiple species and predicting the functions of uncharacterized genes. Although several tools are available for this purpose, the characteristics of their results for plant genes are not well compared. In this hackathon, we examined three tools (OMA, OrthoDB, and Ensembl Plants) by extracting ortholog pairs between Arabidopsis and soybean and analyzing each result, focusing on five plant genes with varying degrees of conservation. We observed that changes in the taxonomic ranges of OMA and OrthoDB affected ortholog detection, and the range of ortholog detection across the three tools was inconsistent, suggesting the importance of comparing multiple tools to obtain more accurate information on orthologs.
  • Streamlining data brokering from Research Data Management platforms to ELIXIR Repositories

    Mobilizing data from data producers to data deposition databases is an integral service that research data management (RDM) platforms could offer. However, brokering the heterogeneous mixture of scientific data requires systems that are compatible with the diverse (meta)data models of the different RDM platforms, and diverse submission routes of different domain/techniques-specific repositories.Existing tools for brokering of research (meta)data in life sciences often are technique or domain specific and aimed at only one specific deposition database at a time, which does not reflect the way scientific projects are often conducted. As a result, infrastructure providers or research laboratories have to invest resources in manual curation and mapping of (meta)data in order to help researchers deposit their outputs into specialized repositories.This BioHackathon 2022 project specifically focused on designing and implementing a prototype of a data brokering system from ISA-JSON to multiple ELIXIR Deposition Databases, starting with the European Nucleotide Archive (ENA). Specifically, we started from a ISA-JSON file exported from the DataHub, a metadata management platform (an instance of the FAIRDOM-SEEK software) which uses the well-established ISA (Investigation Study Assay) framework to describe multi-omics metadata and link to the location of data files.During this project we performed a high-level mapping of the ISA-JSON schema to the ENA XML files necessary for metadata submission. We also described a flexible, sustainable and domain/technique-agnostic brokering strategy from ISA-JSON to multiple ELIXIR deposition databases and developed a prototype of an EBI multi-repositories converter tool.
  • Executing workflows in the cloud with WESkit

    With the exponential increase in genomic data, analyzing and processing large datasets has become a challenging task in healthcare. To address this issue, the Global Alliance for Genomics and Health (GA4GH) has proposed a set of community standards for enabling the adoption of FAIR principles for data, software, and infrastructure. These standards promote the concept of sending analysis and processing workflows to the data rather than transferring large datasets, thereby increasing efficiency and data security. In this paper, we present the outcomes of the ELIXIR Biohackathon 2021 project, where we worked on our software WESkit, which implements the GA4GH WES standard for running Snakemake and Nextflow workflows. During the hackathon, we implemented basic GA4GH TRS support, deployed a cloud platform, and added S3 support for downloading result files.
  • An evaluation of EDAM coverage in the Tools Ecosystem and prototype integration of Galaxy and WorkflowHub systems

    Here we report the results of a project started at the BioHackathon Europe 2022. Its goals were to cross-compare and analyze the metadata centralized in the Tools Ecosystem, and linked to the EDAM ontology, as well as to explore methods for connecting tools used in registered Galaxy workflows (i.e. WorkflowHub entries) to the annotations available in bio.tools.
  • Empowering the community with notebooks for bespoke microbiome analyses

    MGnify is EMBL-EBI’s metagenomics resource. MGnify’s recently launched Notebook Server provides an online Jupyter Lab environment for users to explore programmatic access to MGnify’s datasets using Python or R. Here, we report several developments to the Notebook Server completed during the BioHackathon Europe 2022. The developments range from establishing an instance of the notebooks on the Galaxy platform, to adding new notebooks and Jupyter UI extensions enabling more users to perform downstream analysis tasks on MGnify’s extensive metagenomics datasets.
  • CiTO support for BioHackrXiv

    In this paper we present the work executed on BioHackrXiv during the international ELIXIR BioHackathon in Barcelona, Spain, 2021.
  • Addressing sex bias in biological databases worldwide

    Precision medicine aims at tailoring treatments to individual patient needs. In this context, artificial intelligence (AI)-based technologies are viewed as revolutionary since they have the capacity to identify key features that link genomic and phenotypic traits at the individual level. AI techniques therefore depend on the quantity and quality of patient data. When variables like sex, age, or race are ignored in sample records, it can result in biased predictions as they will not be considered in the training of the AI algorithm. To this end, the European Genome-phenome Archive (EGA) took action in 2018 and put into place a rule that requires data providers to declare the sex of donor samples uploaded into their repository to improve data quality and prevent the spread of biased results. In this work we quantified biases in sex classification over time in human data from studies deposited in EGA and the database of Genotypes and Phenotypes (dbGaP), which represents the EGA’s equivalent in the USA. The main result is that the EGA policy is effective to fight sex classification biases because there are significantly less samples classified as unknown after 2018 in this repository than in dbGaP. Additionally, we qualitatively assessed public opinion on this issue. A survey addressed to users, creators, maintainers, and developers of biological databases revealed that specialized training and additional knowledge about diversity criteria are required. Based on our findings, we raise awareness of sample bias problems and provide a list of recommendations for enhancing biomedical research practices.