BioHackathon Europe, Barcelona, Spain, 2024

BioHackathon Europe is ELIXIR’s annual flagship event that brings together bioinformaticians from around the world to work collaboratively on various hacking projects. This year, it was in Barcelona, Spain with 180 in-person participants and over 140 online, hacking through 30 bioinformatics projects.

Source: https://elixir-europe.org/news/biohack2024

Preprints

  • Software Quality Indicators: extraction, categorisation andrecommendations from canonical sources

    Research software plays a central role in modern science, and its quality is increasinglyrecognized as essential for reproducibility, sustainability, and trust. Numerous initiatives haveproposed indicators to guide quality assessment, yet these indicators are dispersed acrossdomains and vary in scope, terminology, and practical use. This work presents a curatedcatalogue of software quality indicators tailored to the needs of research software. Developedduring BioHackathon Europe 2024 and refined in collaboration with the ELIXIR Tools Platformand EVERSE project, the catalogue consolidates and structures indicators from a range ofauthoritative sources.
  • Leveraging RDF and CURIE metadata resolution with identifiers.org

    Identifiers.org provides two core services for CURIEs in life sciences. One is a registry of CURIE prefixes and URL locations that contain entries for the main life sciences datasets. The other is a resolver that allows for consistent data access using registry information to redirect to current URLs for CURIE identifiers. For this work, we aimed to expand these services to facilitate the integration of CURIE-related metadata into different contexts. The first part of this exports the registry in RDF with a SPARQL server to allow queries on the dataset. Through these, RDF-based systems can associate with registry metadata on different data collections. Allowing, for example, services that have identifiers.org URLs to collect metadata on the collection that it references. The second part expands on the existing metadata resolver to be able to collect CURIE-related metadata from different metadata providers.While the previous resolver could only collect LDJSON notations from pages, it can now be expanded to collect from any metadata provider.For this work, we implement two proof of concept retrievers, one for EBI Search, a text search engine that allows for metadata acquisition, and one for TogoID, an ID mapping service for life sciences.Finally, we gather some future tasks for identifiers.org services.
  • BioHackEU24 report: Expanding FAIR database integration through elucidation and transformation of underlying graph schemas

    The BioDataFuse (BDF) project aims to enhance the interoperability of biomedical data through modular integration of data from diverse life sciences resources into context-specific knowledge graphs. This paper discusses the efforts made during BioHackathon Europe 2024 to improve the FAIR (Findable, Accessible, Interoperable, and Reusable) data integration process by clarifying and transforming graph schemas. We explored tools such as VoID-generator, RDF-config, and sheXer for data schema extraction and the integration of RDF Portal data into the BDF framework. By leveraging these tools, we automated the generation of SPARQL queries, created GraphQL endpoints, and enhanced BDF’s ability to integrate new databases. Additionally, we explored the potential of large language models (LLMs) for automated reasoning and data interpretation within the BDF ecosystem. This work lays the foundation for building more efficient and standardized data models, contributing to the seamless integration of multiple biomedical databases.
  • Reusable RDM Planning Environments for Trainings and Workshops: A BioHackathon Europe 2024 Report

    This report provides an overview of our activities and accomplishments related to the creation of reusable RDM (Research Data Management) Planning Environments for trainings and workshops conducted during the ELIXIR BioHackathon Europe 2024. ELIXIR recognizes the critical role of effective data management planning in enabling sustainable and reproducible research outcomes. This effectiveness is achieved through the use of appropriate Data Management Planning tools, such as the Data Stewardship Wizard. The Data Stewardship Wizard is used to conduct various trainings which require instance with data which are different for each training. Goal of this project was to provide easy and effective way to prepare “recipes” for DSW Data Seeder
  • Enhancing bio.tools by Semantic Literature Mining

    Mining mentions of software tools in scientific literature is important for resource discovery and analysis in bioinformatics. Despite advancements in deep-learning-based natural language processing techniques, accurately identifying software mentions remains challenging due to naming ambiguities, inconsistent citation practices, and homonyms. In this study, we developed methods to enhance the bio.tools registry through integration with Europe PMC. We systematically explored three distinct article-tool relationships: direct associations, citations of associated articles, and textual mentions without explicit citations. A hybrid approach combining rule-based heuristics and machine learning was evaluated at a F1-score of 74.4% in contextual software mention disambiguation tasks. We further demonstrated the potential for mining software co-mentions and co-citations from EuropePMC, constructing interactive networks in Cytoscape to visualize relationships between tools. Leveraging bio.tools metadata significantly improved disambiguation accuracy, including for tools with generic names. In the future, we will expand annotated datasets, handle software synonyms, and make bio.tools software mentions retrievable through the Europe PMC Annotations API to enrich bio.tools with usage data, making software more findable, including for recommendation systems.
  • BioHackEU24 report: Integrating Bioconductor packages with the ELIXIR Research Software Ecosystem using EDAM

    This project seeks to enhance the ELIXIR Research Software Ecosystem (RSEc) by increasing the findability, accessibility, interoperability, and reusability (FAIR principles) of Bioconductor’s extensive collection of over 2,000 bioinformatics packages. By aligning Bioconductor metadata with the EDAM ontology and integrating detailed package descriptions into the bio.tools registry, we aim to improve the discoverability and usability of bioinformatics analysis tools. Short-term goals include mapping Bioconductor’s biocViews controlled vocabulary to EDAM concepts, developing a set of manually annotated “gold standard” packages, and evaluating tools for automated EDAM concept suggestions. Long-term, we intend to expand EDAM coverage across Bioconductor, phase out biocViews, and implement automated synchronisation with bio.tools. This initiative fosters collaboration between Bioconductor and ELIXIR, establishing a foundation for sustainable software management in European bioinformatics.Key results from the ELIXIR BioHackathon 2024 week include substantial progress in mapping the biocViews vocabulary to EDAM concepts, initiating the curation of a reference set of packages with manual annotations, integrating Bioconductor metadata into the ELIXIR Research Software Ecosystem (RSEc) with automated updates, and prototyping a tool for automated EDAM concept suggestions. Together, these achievements establish a strong foundation for further integration and refinement.
  • An assessment of Croissant ML metadata descriptors for AI-ready datasets

    To advance the use of machine learning to address humanity’s grand challenges such as the understanding of disease conditions and biodiversity loss in the anthropocene, it is important to promote FAIR AI-ready datasets, since data scientists and bioinformaticians spend 80% of their time in data finding and preparation. Metadata descriptors for datasets are pivotal for the creation of machine learning models as they facilitate the definition of strategies for data discovery, feature selection, data cleaning, and data pre-processing. ML-ready datasets, whether by design or after pre-processing, can be enriched with metadata so they become FAIRer, i.e., autonomously discoverable and processable by machines (machine-actionable). Croissant ML is an extension of schema.org to better describe ML-ready datasets, released early 2024 and already adopted by some ML-model platforms such as Hugging Face (see Croissant ML viewer documentation) and OpenML. However, as it commonly happens with metadata, there are some limitations to the amount of metadata that can be automatically extracted. How much Croissant metadata can be programmatically extracted from ML-ready datasets? And how could this automation be improved? In this project, we explored answers to these two questions.
  • Development of FAIR image analysis workflows and training in Galaxy

    Although image analysis tools are available within the Galaxy platform, they remain underutilised. During the 2023 BioHackathon Europe, our efforts focused on enhancing the image analysis community in Galaxy by cataloguing and annotating tools and facilitating community discussions to establish naming conventions that promote standardisation. These initial efforts, detailed in the project outcomes, laid the foundation for the ongoing expansion of Galaxy’s image analysis capabilities.Building on these achievements, this year’s work aimed to exploit and demonstrate theGalaxy platform’s full potential to address the needs of the image analysis community.This project involved developing FAIR (Findable, Accessible, Interoperable, and Reusable)image analysis workflows, creating tutorials for the Galaxy Training Network (GTN) to providedocumentation, and fostering broader adoption and facilitating theapplication of these workflows across scientific domains.
  • BioHackEU24 report: ORCID and ROR identifiers in BioHackrXiv reports

    The first BioHackrXiv preprint was published in 2020, using a platform based on the idea of using Markdown, and just weeks ago, BioHackrXiv published their 100th preprint. Machine-readable etadata added to the Markdown that is added includes the title, keywords, the author names, their affiliations, and details about the Biohackathon event the preprint is related to. The metadata in 2000 already supported listing the ORCID identifier of the authors, but this was not added to the author list in the generated PDF. This report describes two improvements of the platform: visualization of the ORCID identifiers in the preprint PDF and support for Research Organization Registry (ROR) identifiers of the affiliations.
  • BioHackEU24 report: Bioschemas for Mortals

    We report here on the progress of project #10: “Bioschemas for Mortals” from BioHackathon Europe 2024. The goal of this project is to reimagine, reframe and supplement the existing Bioschemas guidance available. We identified patterns of use, commonly undertaken tasks and user personas and roles. This information will be used to identify what is needed by less technical users, ultimately providing specific code examples that can be copy/pasted, documented examples for different web setups, customised guidance for different personas, and to address usability and content accessibility. We will also use the learnings from the Bioschemas hackathons to progress more quickly on the domain-agnostic schemas.sci site.
  • BioHackEU24 report: Creating user benefit from ARC-ISA RO-Crate machine-actionability & Increasing FAIRness of digital agrosystem resources by extending Bioschemas

    As part of the BioHackathon Europe 2024, we here report on the progress that both project 19 and project 24 have made during the event. For the purpose of this report we will present the abstract of both projects and then dive deeper on what work was done during the BioHackathon.
  • Enhancing multi-omic analyses through a federated microbiome analysis service

    Multi-omics datasets are an increasingly prevalent and necessary resource for achieving scientificadvances in microbial ecosystem research. However, they present twin challenges to researchinfrastructures: firstly the utility of multi-omics datasets relies entirely on interoperability ofomics layers, i.e. on formalised data linking. Secondly, microbiome derived data typically leadto computationally expensive analyses, and so rely on the availability of high performancecomputing (HPC) or cloud infrastructures. These challenges can be better met by combining the resources of multiple groups, services and infrastructures. In this BioHackathon Europe 2024 project, we envisioned a “federated microbiome analysis service” and worked on three tracks of development towards it: mapping metagenomics metadata standards to Schema.org and Bioschemas terms, rendering Nextflow workflow executions as RO-Crates, and tooling for creating, viewing and interlinking human-readable RO-Crate previews.