BioHackathon Europe, Barcelona, Spain, 2023

Preprints

  • Ontologies for single-cell experiments

    Research data management is becoming increasingly important in the scientific community. Acritical challenge in this field is making research data FAIR (findable, accessible, interoperableand reusable, (Wilkinson et al., 2016)). Metadata plays a vital role in this challenge as it allowsresearchers to accurately understand and recreate experiments. To tackle this challenge, variousapproaches are being taken towards this goal, including the development of domain-overarchingand domain-specific standards.In the different scientific communities, multiple general, as well as domain-specific minimuminformation standards have been developed, such as MIAPPE (Ćwiek-Kupczyńska, 2016), theminimum information about a plant phenotyping experiment, MIAME (H. Brazma A., 2001),the minimum information about a microarray experiment, and MINSEQE (B. Brazma A., 2012),the minimum information about a high-throughput sequencing experiment. These standards aredesigned to describe specific types of experiments. Recently, a minimum information standardfor single-cell experiments, minSCe (Minimum Information about a Single-Cell Experiment),has been introduced (Füllgrabe et al., 2020). However, it is not yet widely applied.Minimum information standards are an important part of the solution and should be built upon.In addition, the use of controlled vocabularies and ontology terms is also essential. Ontologyterms have a persistent identifier, an expressive name and a curated definition. Using theseterms enables different researchers to understand and recreate annotated experiments. In thisBioHackathon Europe project, we propose to expand biological, experimental and technicalmetadata schema as well as ontologies for single-cell experiments across domains with a focuson transcriptomics. This will facilitate the sharing and reuse of single-cell data and promotecollaboration among researchers in different domains. Our goal is to improve data managementpractices and enhance the reproducibility of single-cell research.
  • How to improve the annotation of Galaxy resources? Outcomes of an online hackathon for improving the annotation of Galaxy resources for microbial data resources

    Galaxy hosts a vast array of tools, tutorials, and workflows, with the exact number of workflows remaining uncertain. To address the challenge of enhancing tool visibility within this expansive ecosystem, a pipeline called the Galaxy Tool Metadata Extractor was created during the BioHackathon Europe 2023. This pipeline aggregates Galaxy tool suites from various sources, automatically extracts metadata such as bio.tools identifiers and EDAM ontology, and presents the information in an interactive table. Users can filter this table to find tools relevant to their research community. Throughout development, it was noted that many tools lack EDAM annotations. An effort of the microGalaxy community was started to update 50+ microbial-related Galaxy tools, link them to their respective bio.tools entries, and collectively peer-review the results. However, that was far from enough to properly annotate all Galaxy tools and other types of Galaxy resources like training material and workflows. In addition to tools, the community offers other resources that are not properly annotated using ontologies like EDAM. Annotating all mentioned resources would improve their findability but also allow for aggregation and display of resources covering similar topics. To facilitate this work and work on a proof-of-concept for other communities, the microGalaxy community organized an online hackathon in April 2024. During this hackathon, 41 new bio.tools have been created, 85 Galaxy tool suites linked to bio.tools and EDAM terms, and 33 tutorials annotated with EDAM terms. Some microbial-related Galaxy tools have been improved. In addition, new features and improvements have been added to Galaxy Tool Metadata Extractor. The hackathon was successful with outcomes beyond the initial expectations.
  • Synergising ELIXIR resources for training in systems biology

    Systems biology (SB) is a new ELIXIR community, that aims to utilise the ELIXIR ecosystem, such as the Training eSupport System (TeSS) and bio.tools, a registry of software tools and data resources for life sciences. One of the main initial objectives of the SB community is to create an SB-themed domain hosted by TeSS, encompassing SB-related ELIXIR services and events, in a fully automated way.Most content in TeSS is sourced through automated aggregation (“scraping”) of external sources containing resources marked up with semantic metadata, like Bioschemas. Currently, TeSS cannot recognize references to bio.tools identifiers from a Bioschemas-annotated resource, so the number of resources linked to bio.tools is relatively low.In this project, we will focus on selected SB disciplines from the priority areas of the ELXIR SB community to integrate and cross-link related ELIXIR products - training events, training materials, computational and bioinformatics tools, databases and services from the bio.tools registry.This will be achieved using suitable ontologies identified by the SB community and by careful curation of SB-related materials. We aim to extend this work to other ELIXIR products such as lists of trainers, related ELIXIR Innovation and Industry events and publications. This will serve as a pilot project leading to broader integration with other SB disciplines, and will be of interest to several other ELIXIR communities.
  • Enhancing the image analysis community in Galaxy

    Project 6 during the 2023 BioHackathon Europe in Barcelona focused on “Enhancing the image analysis community in Galaxy.” Despite Galaxy’s strong presence in genomics and proteomics, its image analysis tools and workflows are currently scattered. This project aimed to gather efforts in image analysis across fields to build a robust interdisciplinary community.
  • Secure data-out API - enabling encrypted htsget transactions

    The European Genome-phenome Archive (EGA) is a service for archiving and sharing personally identifiable genetic and phenotypic data, while the The Genomic Data Infrastructure (GDI) project is enabling access to genomic and related phenotypic and clinical data across Europe. Both projects are focused on creating federated and secure infrastructure for researchers to archive and share data with the research community, to support further research.This project proposal is focusing on the data access part of the infrastructure. The files are encrypted in the archives, using the crypt4gh standard. Currently, there exist data access processes, where the files are either decrypted on the server side and then transferred to the user or re-encrypted server-side and provided to the user in an outbox.Htsget as a data access protocol also allows access to parts of files, but there’s currently no production-level client tools that support access to encrypted data. Our goal is to create a client tool that can access encrypted data over the htsget protocol. It should also work with the GA4GH Passport and Visa standard so we can then enhance the security of our data access interfaces. We will also modify htsget-rs, a Rust htsget server, and crypt4gh-rust as required to support the aforementioned standards. Finally, there will be an effort to implement this feature in already existing tools, like samtools and IGV.
  • How to increase the findability, visibility, and impact of Galaxy tools for your scientific community

    The scale and diversity of available software options in the Galaxy ecosystem can make domain or community specific discovery of software challenging. Here, we present a semi-automated and reusable pipeline for creating tailored interactive tables that list the identity and metadata (e.g. bio.tools, EDAM) available for Galaxy tools in a specific community (e.g. microGalaxy, imaging). In addition, we also describe an annotation framework to improve the quality of the table contents, and training material to support the reuse of both the pipeline and table by additional communities. The sum of these contributions is expected to make it easier for Galaxy users to discover and understand the software within their research area, improve the annotation of these software resources, and allow other domains to enable equivalent discovery processes for their community.This work is the outcome of a BioHackathon Europe 2023 project.
  • BioHackEU23: FAIR Workflow Execution with WfExS and Workflow Run Crate

    FAIR Computational Workflows argues that workflows should be FAIR scholarly community research objects in their own right as a kind of FAIR Research Software. In this project we go one step further, and argue that workflow executions should also be published with sufficient traces and structured metadata. Workflow Run RO-Crate is a set of profiles of RO-Crate that capture workflow provenance in a lightweight FAIR data package based on existing standards, in order to support traceability, reproducibility and interoperable description of diverse computational analysis. This use of RO-Crate allows the contextualization of a computational workflow and its execution, e.g. relating to people, organisations, projects, funding, data sources and wider research questions and studies.We have implemented the profile in multiple workflow systems, including Galaxy, COMPSs, StreamFlow, WfExS, Sapporo and Autosubmit. The command line tool runcrate can convert from the precursor CWLProv and display or validate crates according to the profiles. The crates are compatible with ELIXIR’s WorkflowHub and support increasing levels of details, including documenting ad-hoc scripts without a workflow engine.WfExS is a workflow orchestrator designed for reproducible and secure workflow executions in isolated environments (like HPC). Every input, workflow and container being used in an execution must have either a public or permanent identifier, or at least a resolvable URI, so the execution scenario can be materialised. The execution scenario before and/or after the execution can be saved to RO-Crate.Here we bring together FAIR Computational Execution practitioners to mature and generalise this approach using Workflow Run Crate.
  • Building Towards a Machine-Actionable Software Management Plan: A BioHackathon Europe 2023 Report

    This report provides an overview of our activities and accomplishments concerning machine-actionable Software Management Plans (SMPs) and the Software Management Wizard (SMW) during the ELIXIR BioHackathon Europe 2023. ELIXIR acknowledges the critical role of effective software management in facilitating sustainable and reproducible research outcomes. The Software Best Practices group is actively committed to establishing a robust framework for SMP creation. In this project, our primary focus is on streamlining the SMP creation process for research software within ELIXIR. To achieve this, we are working on developing essential integrators and identifying and reviewing the relevant metadata schema. This effort is closely aligned with various related initiatives such as OpenEBench, FAIR4RS, RDA, maSMPs, among others. The outcomes of the BioHackathon project are now available for immediate use and can be further refined in the future based on community feedback and advancements in research software best practices.
  • Metadata handling for BioHackathon publications through BioHackrXiv

    This paper presents the work executed on BioHackrXiv during the international ELIXIR BioHackathon Europe in Paris, France, 2022. BioHackrXiv is a scholarly publication service for BioHackathons and codefests that target biology and the biomedical sciences in the spirit of pre-publishing platforms.
  • Genome Annotation and Other Post-Assembly Workflows for the Tree of Life

    Rapid advances in genome sequencing technologies have resulted in an explosion of referencequality genome assemblies across the tree of life. While these resources will be invaluable towards goals of species and biodiversity conservation, their application is limited when they lack accurate annotations of their functional elements. The European Reference Genome Atlas (ERGA) is the European node of the Earth Biogenome Project (EBP) and aims to share resources and knowledge to create fully-annotated reference genomes. ERGA strives to do this in a distributed manner, bringing together researchers from across the world, with common goals and understandings.In the BioHackathon Europe 2023, we came together to construct and test tools, pipelines and workflows for annotating protein-coding regions in assembled genomes. We specifically aimed to evaluate (a) the performance in a wide variety of non-model organisms and (b) the “usability” of pipelines for newcomers to annotation. This work required installing and implementing tools in a number of computational environments and infrastructures, sharing of both genomic resources and expertise between researchers from a range of institutes, and evaluation of annotation workflows performance and what input data is required in order to achieve a high quality genome annotation. Here we present the results of over 20 researchers in 8 time-zones working towards a robust implementation of genome annotation workflows in eukaryotic organisms.
  • Improving Bioschemas creation and community adoption through process improvements, tool development, and advancing compliance to FAIR standards

    Nowadays scientists massively produce diverse datasets in many communities. They need to combine them to answer scientific or novel questions. To do so, these diverse computational resources need first to be found by search engines. Bioschemas provides a simple and lightweight mechanism to annotate online resources in a standardized way and expose key metadata. To improve the accessibility and value of Bioschemas to existing and emerging communities, we aim to develop an automated system to assess the adoption of Bioschemas, work with identified groups that have specific needs addressable by Bioschemas, address usability issues in the Bioschemas profile and type development process, and extend the reach of Bioschemas by making it available in a domain-agnostic manner.
  • Bioschemas Resource Index for Chem and Plants

    As part of the BioHackathon Europe 2023, we here report on the progress of the hacking team preparing a resource index and knowledge graph based on the JSON-LD Bioschemas markup from several resources in the life- and natural sciences, predominantly from the fields of plant- and (bio)chemistry research. This preliminary analysis will allow us to better understand how Bioschemas markup is currently used in these two communities, so we can take actions to improve guidelines and validation on the Bioschemas markup and the data providers side. The lessons learnt will be useful for other communities as well. The ultimate goal is facilitating and improving interoperability across resources.
  • BioHackEU23 report: Enabling FAIR Digital Objects with RO-Crate, Signposting and Bioschemas

    As part of the BioHackathon Europe 2023, we here report from the progress of the hackathon project #15: “Enabling FAIR Digital Objects with RO-Crate, Signposting and Bioschemas”. We added Signposting to three existing resources, and made a Chrome browser extension to show Signposting headers. We added RO-Crate to two existing resources, and explored making a hybrid FDO using both a Handle PID Record and Signposting/RO-Crate approach.
  • Benchmarks for Bioinformatics Workflow Bake Offs

    This BioHackathon Project focused on establishing a “Great Bake Off of Bioinformatics Workflows” by developing workflow-level benchmarks for evaluating tools in computational tasks. Initially tested in proteomics, the project expanded to genomics and metabolomics. Collaborating with ELIXIR Implementation Studies, the team created rudimentary benchmarks, aiming for formalization before production use. The project consolidated efforts to produce a minimum set of workflow-specific benchmarks, aligning tools and workflow definitions. Short-term goals include drafting benchmarks with examples, while long-term plans involve implementing them in the Workflomics project and Proteomics Community ELIXIR Implementation Studies for community sharing.
  • BioHackEU23 report: Enabling continuous RDM using Annotated Research Contexts with RO-Crate profiles for ISA

    A prevailing paradigm in Research Data Management (RDM) is to publish research datasets in designated archives upon conclusion of a research process. However, it is beneficial to abandon the notion of final or static data artifacts and instead adopt a continuous approach towards working with research data, where data is constantly shared, versioned, and updated. This immutable yet evolving perspective allows for the application of existing technologies and processes from software engineering, such as continuous integration, release practices, and version management backed by decades of experience, and adaptable to RDM.To facilitate this, we propose the Annotated Research Context (ARC), a data and metadata layout convention based on the well-established ISA model for metadata annotation and implemented using Git repositories. ARCs are amenable towards frequent, lightweight data management operations, such as (meta)data validation and transformation. The Omnipy Python library is designed to help develop stepwise validated (meta)data transformations as scalable data flows that can be incrementally designed, updated, and rerun as requirements or data evolve.To demonstrate the concept of continuous RDM we will use Omnipy to define and orchestrate Git-backed CI/CD (Continuous Integration/Continuous Delivery) data flows to convert ISA metadata present in ARCs into validated RO-Crate representations adhering to the Bioschemas convention. A RO-Crate package combines the actual research data with its metadata description. Downstream, this allows semantic interpretation by Galaxy for e.g. workflow execution as well as machine-readable data access and data harvesting for search engines such as FAIDARE.
  • BioHackEU23 report: Extending interoperability of experimental data using modular queries across biomedical resources

    This report provides an overview of the significant accomplishments achieved during the ELIXIR Biohackathon 2023 under Project 17: “Extending interoperability of experimental data using modular queries across biomedical resources”. The project diligently addressed four key aspects: the expansion of data resources, the creation of knowledge graphs, advancements in data visualization, and the development of a use-case-driven pipeline. The collective efforts during the Biohackathon aimed to enhance the integration and accessibility of experimental data across diverse biomedical resources by developing a tool named BioDataFuse.
  • Rendering co-author graphs using linked-open-data from Wikidata

    Wikidata is the linked-open-data graph of the Wikimedia foundation with its most known sibling Wikipedia (Vrandečić, 2012). What Wikipedia is to text, Wikidata is to data. Like in Wikipedia linked-data can be added for everyone, by everyone. This makes Wikidata a very rich source of data. A substantial part of the data on Wikidata is about scientific publications and the authors of these publications (Taraborelli et al., 2016). Scholia is a tool that uses this data to create a profile page for authors and publications (Nielsen et al., 2017). This report describes a workflow to create co-author graphs using the data from Scholia.