BioHackathon Europe, Paris, France, 2022

Preprints

  • Exploring the landscape of the genomic wastewater surveillance ecosystem: a roadmap towards standardization

    The landscape of genomic wastewater surveillance in the context of infectious disease monitoring is rapidly evolving, and this came into sharp focus during the COVID-19 pandemic. Here we highlight the significance of wastewater surveillance as a passive monitoring system complementary to clinical genomic surveillance activities. Emphasizing the need for coordination, standardization, and the development of a unified catalog of software tools and services, we aim to streamline the implementation of end-to-end genomic wastewater surveillance pipelines.Key considerations such as defining variants, understanding antimicrobial resistance, and assessing viral fitness within the framework of wastewater surveillance are explored, linking to examples of respective tools and existing pipelines. The challenges of wastewater data analysis, the need for specialized tools and bioinformatics workflows, and the significance of integrated pipelines are also discussed in detail. The article presents case studies, including the V-pipe integrated bioinformatics workflow and the integration of tools into the Galaxy platform, underscoring their role in enhancing data analysis efficiency and standardization within the field.Overall, the review highlights the critical importance of continued research efforts to advance understanding and implementation of bioinformatic approaches in wastewater surveillance for the effective monitoring and management of infectious diseases.
  • BioHackEU22 Report: Enhancing Research Data Management in Galaxy and Data Stewardship Wizard by utilising RO-Crates

    This report describes the integration of RO-Crates into Data Stewardship Wizard and Galaxy during the BioHackathon Europe 2023, aiming to improve data management and sharing in scientific research. By utilizing RO-Crates, researchers can easily create machine-readable metadata for their datasets, ensuring long-term discoverability, accessibility, and reusability. The seamless integration of RO-Crates in these platforms enhances collaboration between researchers and institutions, facilitating data sharing and reuse across projects and domains. Future efforts may focus on enhancing RO-Crate’s interoperability with other standards and platforms, as well as promoting wider adoption through outreach and education initiatives to meet the evolving needs of researchers and institutions in data stewardship.
  • Infrastructure for synthetic health data

    Machine learning (ML) methods are becoming ever more prevalent across all domains of lifesciences. However, a key component of effective ML is the availability of large datasets thatare diverse and representative. In the context of health systems, with significant heterogeneityof clinical phenotypes and diversity of healthcare systems, there exists a necessity to developand refine unbiased and fair ML models. Synthetic data are increasingly being used to protectthe patient’s right to privacy and overcome the paucity of annotated open-access medical data. Here, we present our proof of concept for the generation of synthetic health data and our proposed FAIR implementation of the generated synthetic datasets. The work was developed during and after the one-week-long BioHackathon Europe, by together 20 participants (10 new to the project), from different countries (NL, ES, LU, UK, GR, FL, DE, . . . ).
  • Bioinforming

    Optimal formats to inform and engage young students in novel biology-related fields are short courses. Training schools, e.g. those lasting for five days, can provide enough content to introduce students to an extensive overview of bioinformatics and scientific career opportunities.In this work, we define a five-day training school format tailored to three target groups of young students: high school students, undergraduate students in biology-related fields and undergraduate students in computational fields. We structure the content and sessions around learning areas consisting of learning topics, detailing the dependencies between them.For each learning topic, we define learning outcomes and learning activities. Moreover, we conceptualize a teaching platform to manage FAIRyfied (Findable, Accessible, Interoperable, Reusable) training materials that anyone will be able to use to design a new training school in bioinformatics.
  • BioHackEU22 Report for Project 31: The What & How in data management: Improving connectivity between RDMkit and FAIR Cookbook

    This report describes the work completed during the ELIXIR Biohackathon 2022 for project 31: The What & How in data management: Improving connectivity between RDMkit and FAIR Cookbook. The project covered 3 subjects: the technical connectivity between the two primary resources, an editorial alignment and gap analysis of their content, and the creation of user journeys incorporating the wider ELIXIR Research Data Management (RDM) ecosystem.
  • Operator dashboard for controlling the NeIC Sensitive Data Archive

    Human genome and phenome data is classified as special categories data under the EU GDPR legislation (Art. 9 GDPR). This requires special care to be taken when processing and reusing this data for research. To enable this in a compliant way, a federated approach was applied to the existing European Genome-phenome Archive ([EGA(https://ega-archive.org/)]) (Freeberg et al., 2022), creating the Federated EGA ([FEGA(https://ega-archive.github.io/ FEGA-onboarding/#what-is-federated-ega)]) (EGA Consortium, n.d.) in 2022. The Nordic countries, Norway, Finland and Sweden, together with Spain and Germany, represent the first federated partners.In the Nordics we have collaborated around our own implementation for our federated EGA nodes. We have done this under the umbrella of the Nordic e-Infrastructure Collaboration (NeIC)[https://neic.no/] (NeIC, n.d.), where we have had three projects over the last 7 years: Tryggve1 (NeIC, 2014-2017), Tryggve2 (NeIC, 2017-2020) and now Heilsa (NeIC, 2021-2024).As we in the nordics now move into production there is a need for both system administrators and helpdesk staff to be able to control and inspect the system. We need to answer questions related to operations, identify errors in order to better manage the services and infrastructure. To standardize this workflow and make the system easier to use, we decided to build a Minimal Viable Product (MVP) for such an “Operator Dashboard” during the ELIXIR Biohackathon 2022.
  • Onboarding suite for Federated EGA nodes

    The European Genome-phenome Archive (EGA) (Freeberg et al., 2022) (also known as CentralEGA - cEGA) is a service for permanent archiving and sharing personally identifiable geneticand phenotypic data resulting from biomedical research projects. The Federated EGA (EGAConsortium, n.d.), consisting of the Central and Federated EGA nodes, will be a distributednetwork of repositories for sharing human -omics data and phenotypes. Each node of thefederation is responsible for its own infrastructure and the connection to the Central EGA.Currently, the adoption and deployment of a new federated node is challenging due to thecomplexity of the project and the diversity of technological solutions used, in order to ensurethe secure archiving of the data and the transfer of the information between the nodes.The goal of this project was to develop an onboarding suite consisting of simple scripts,supplemented by documentation, that would help newcomers to the EGA federation in orderunderstand in depth the main concepts, while enabling them to get involved in the developmentof the technology as quickly as possible.At the same time we aimed to identify existing technologies and standards across FEGA nodesthat can be used as a reference to upcoming nodes.
  • Enabling profile updates through the Data Discovery Engine (DDE)

    Bioschemas is a grassroots community effort to improve FAIRness of resources in the Life sciences by defining specific Life Science metadata schemas and exposing that metadata from resources that have adopted it. Now that some initial types have been adopted directly into schema.org, an improved mechanism is required to reignite community engagement and encourage profile development. The current process for creating or updating Bioschemas profiles and types is technical and convoluted which creates accessibility issues that can hamper community participation. As adoption of Bioschemas grows and more of the Life Science community considers contributing specific types and profiles, a more accessible creation/modification process is necessary to avoid a loss in engagement. To address this issue, and to drive further Bioschemas adoption, the community has exploited the Data Discovery Engine (DDE) for profile and type development. DDE provides a schema registry and user-friendly tools for creating and editing schemas. The goal of this project is to update existing Bioschemas community profiles in a targeted and crowd-sourced manner, add new profiles as required, and to ensure the documentation is fit for purpose to enable further Bioschemas contributions, at scale.
  • Streamlining data brokering from Research Data Management platforms to ELIXIR Repositories

    Mobilizing data from data producers to data deposition databases is an integral service that research data management (RDM) platforms could offer. However, brokering the heterogeneous mixture of scientific data requires systems that are compatible with the diverse (meta)data models of the different RDM platforms, and diverse submission routes of different domain/techniques-specific repositories.Existing tools for brokering of research (meta)data in life sciences often are technique or domain specific and aimed at only one specific deposition database at a time, which does not reflect the way scientific projects are often conducted. As a result, infrastructure providers or research laboratories have to invest resources in manual curation and mapping of (meta)data in order to help researchers deposit their outputs into specialized repositories.This BioHackathon 2022 project specifically focused on designing and implementing a prototype of a data brokering system from ISA-JSON to multiple ELIXIR Deposition Databases, starting with the European Nucleotide Archive (ENA). Specifically, we started from a ISA-JSON file exported from the DataHub, a metadata management platform (an instance of the FAIRDOM-SEEK software) which uses the well-established ISA (Investigation Study Assay) framework to describe multi-omics metadata and link to the location of data files.During this project we performed a high-level mapping of the ISA-JSON schema to the ENA XML files necessary for metadata submission. We also described a flexible, sustainable and domain/technique-agnostic brokering strategy from ISA-JSON to multiple ELIXIR deposition databases and developed a prototype of an EBI multi-repositories converter tool.
  • An evaluation of EDAM coverage in the Tools Ecosystem and prototype integration of Galaxy and WorkflowHub systems

    Here we report the results of a project started at the BioHackathon Europe 2022. Its goals were to cross-compare and analyze the metadata centralized in the Tools Ecosystem, and linked to the EDAM ontology, as well as to explore methods for connecting tools used in registered Galaxy workflows (i.e. WorkflowHub entries) to the annotations available in bio.tools.
  • Empowering the community with notebooks for bespoke microbiome analyses

    MGnify is EMBL-EBI’s metagenomics resource. MGnify’s recently launched Notebook Server provides an online Jupyter Lab environment for users to explore programmatic access to MGnify’s datasets using Python or R. Here, we report several developments to the Notebook Server completed during the BioHackathon Europe 2022. The developments range from establishing an instance of the notebooks on the Galaxy platform, to adding new notebooks and Jupyter UI extensions enabling more users to perform downstream analysis tasks on MGnify’s extensive metagenomics datasets.
  • BioHackEU22 Report for Project 16: Make your own or favourite software available on your cluster with EasyBuild/EESSI

    EasyBuild is a community effort to develop a software build and installation framework that allows you to manage (scientific) software on High Performance Computing (HPC) systems in an efficient way. As its name suggests, EasyBuild makes software installation easy by automating builds, making previous builds reproducible, resolving dependencies, and retaining logs for traceability. It is also one of the components of the European Environment for Scientific Software Installations (EESSI), a collaboration between different European HPC sites and industry partners, with the common goal to set up a shared repository of scientific software installations that can be used on a variety of operating systems and computer architectures. It can be applied in a full size HPC cluster, a cloud environment, a container or a personal workstation.With the deluge of data in the genomics field (e.g., clinical data) and the concomitant development of new technologies, the number of data analysis software has exploded in recent years. The fields of bioinformatics and cheminformatics follow this same trend with ever more developments to optimize and parallelize analyses. The bioinformatics field is now the main provider of new software in EasyBuild. Developers of those tools are not always professional developers, and they do therefore not always follow best practices when releasing their software. As a result, many tools are complicated to install, making them ideal candidates for porting their installation to EasyBuild so that they become more easily accessible to end users.We propose to introduce users to EasyBuild and EESSI, and to port new software to EasyBuild/EESSI (e.g., the participant’s own or favourite software), thereby making it available and discoverable to the entire EasyBuild community. In parallel we would like to build bridges between EESSI and Galaxy to make the scientific software more accessible to researchers in the domain.
  • Validating Subtype Specific Oncology Drug Predictions

    There is an impressive number of data and code reproducibility initiatives, both within Europe and across the world. To motivate researchers to use this amazing infrastructure, we must show the translational research community that the aforementioned initiatives are able to drive change in translational science. Here we demonstrate that using public datasets, it is reasonable to build a pipeline for proposal and validation of driver mutation and subtype-specific colorectal cancer medications. While all three molecular, clinical and chemical name harmonization were necessary, open data and code initiatives, while varied in their approaches, made this project possible.
  • BioHackEU22 Project 22: Plant data exchange and standard interoperability

    Status of discussions around the topic of data standard formats for plant sciences and their interoperability at the BioHackathon Europe 2022 in Paris.
  • An ETL pipeline to construct the Intrinsically Disordered Proteins Knowledge Graph (IDP-KG) using Bioschemas JSON-LD data dumps

    Schema.org and Bioschemas are lightweight vocabularies that aim at making the contents of web pages machine-readable so that software agents can consume that content and understand it in an actionable way. Due to the time needed to process each page, extracting markup by visiting each page of a site is not practical for huge sites. This approach imposes processing requirements on the publisher and the consumer. The Schema.org community proposed a method for exchanging markup from various pages as a DataFeed published at a recognized address in February 2022. This would ease publisher and customer processing requirements and accelerate data collection. In this work, we report on the implementation of a JSON-LD consumer ETL (Extract-Transform-Load) pipeline that enables data dumps to be ingested into knowledge graphs (KG). The pipeline loads scraped JSON-LD from the three sources, converts it to RDF, applies SPARQL construct queries to map the source RDF to a unified Bioschemas-based model and stores the resulting KG as a turtle file. This work was conducted during the one-week Biohackathion Europe 2022 in Paris France, under Project 23 titled, “Publishing and Consuming Schema.org DataFeeds.”
  • Enhancement and Reusage of Biomedical Knowledge Graph Subsets

    Knowledge Graphs (KGs) such as Wikidata act as a hub of information from multiple domains and disciplines, and is crowdsourced by multiple stakeholders. The vast amount of available information makes it difficult for researchers to manage the entire KG, which is also continually being edited. It is necessary to develop tools that extract subsets for domains of interest. These subsets will help researchers to reduce costs and time, making data of interest more accessible. In the last two BioHackathons (BH20, BH21), we have created prototypes to extract subsets easily applicable to Wikidata, as well as to define a map of the different approaches used to tackle this problem. Building on those outcomes, we aim to enhance subsetting in both definitions using Entity schemas based on Shape Expressions (ShEx) and extraction algorithms, with a special focus on the biomedical domain. Our first aim is to develop complex subsetting patterns based on qualifiers and references for enhancing credibility of datasets. Our second aim is to establish a faster subsetting extraction platform applying new algorithms based on Apache Spark and new tools like a document-oriented DBMS platform.
  • CWLD: Mapping colloquial wet lab language to ontologies

    The use of ontology terms can make data more FAIR and tractable by machines. However, the highly formalised terminology used by these ontology terms does not always match the colloquial language used by practitioners. This disparity can (a) make it difficult for practitioners to understand the language used by knowledge stored in ontologies; and (b) make it difficult to machine-interpret information written by practitioners to map it to ontologies. This problem is particularly relevant in the ELIXIR Microbial Biotechnology (MB) community, as although the domain has adopted ontologies and data standards such as SO, SBO, GO, and SBOL for data representation, the tools developed often use ontology terms directly rather than the language used in the wet lab (i.e. by the people using the tools.) At the BioHackathon 2022 in Paris, France, we initiated an effort to address this problem by (a) mining the internet for colloquial language used by biologists; (b) constructing a dictionary (CWLD: colloquial wet lab dictionary) of this language and its mappings to ontology terms; and (c) constructing a table of the occurrences of different terminology used in MB tools and resources. While initially developed to serve the MB community, we hope that the dictionary will serve as a helpful resource for anyone hoping to map from colloquial wet lab language to ontology terms for e.g. text mining applications.