Meetings

Recent preprints

  • Metadata for BioHackrXiv Markdown publications

    biohackrxiv.org is a scholarly publication service forBioHackathons and Codefests where papers are generated from Markdowntemplates where the header is a YAML/JSON record that includes thetitle, authors, affiliations and tags. Many projects in BioHackathons are about using FAIR data. Because the current setup is lacking in the findable (F) andaccessible (A) of FAIR, for the ELIXIR BioHackathon 2020, we decidedto add an additional service that provides a SPARQL endpoint forqueries and some simple HTML output that can be embedded in aBioHackathon website.
  • Mapping OHDSI OMOP Common Data Model and GA4GH Phenopackets for COVID-19 disease epidemics and analytics

    The COVID-19 crisis demonstrates a critical requirement for rapid and efficient sharing of data to facilitate the global response to this and future pandemics. Our project aims are to enhance interoperability between health and research data by mapping Phenopackets and OMOP schemas, and representing COVID-19 metadata using the FAIR principles to enable discovery, integration and analysis of genotypic and phenotypic data. Here, we present our outcomes after one week of BioHacking together 17 participants (10 new to the project), from different countries (CH, US and in EU), and continents.
  • An ETL pipeline to construct the Intrinsically Disordered Proteins Knowledge Graph (IDP-KG) using Bioschemas JSON-LD data dumps

    Schema.org and Bioschemas are lightweight vocabularies that aim at making the contents of web pages machine-readable so that software agents can consume that content and understand it in an actionable way. Due to the time needed to process each page, extracting markup by visiting each page of a site is not practical for huge sites. This approach imposes processing requirements on the publisher and the consumer. The Schema.org community proposed a method for exchanging markup from various pages as a DataFeed published at a recognized address in February 2022. This would ease publisher and customer processing requirements and accelerate data collection. In this work, we report on the implementation of a JSON-LD consumer ETL (Extract-Transform-Load) pipeline that enables data dumps to be ingested into knowledge graphs (KG). The pipeline loads scraped JSON-LD from the three sources, converts it to RDF, applies SPARQL construct queries to map the source RDF to a unified Bioschemas-based model and stores the resulting KG as a turtle file. This work was conducted during the one-week Biohackathion Europe 2022 in Paris France, under Project 23 titled, “Publishing and Consuming Schema.org DataFeeds.”
  • Enhancement and Reusage of Biomedical Knowledge Graph Subsets

    Knowledge Graphs (KGs) such as Wikidata act as a hub of information from multiple domains and disciplines, and is crowdsourced by multiple stakeholders. The vast amount of available information makes it difficult for researchers to manage the entire KG, which is also continually being edited. It is necessary to develop tools that extract subsets for domains of interest. These subsets will help researchers to reduce costs and time, making data of interest more accessible. In the last two BioHackathons (BH20, BH21), we have created prototypes to extract subsets easily applicable to Wikidata, as well as to define a map of the different approaches used to tackle this problem. Building on those outcomes, we aim to enhance subsetting in both definitions using Entity schemas based on Shape Expressions (ShEx) and extraction algorithms, with a special focus on the biomedical domain. Our first aim is to develop complex subsetting patterns based on qualifiers and references for enhancing credibility of datasets. Our second aim is to establish a faster subsetting extraction platform applying new algorithms based on Apache Spark and new tools like a document-oriented DBMS platform.
  • CWLD: Mapping colloquial wet lab language to ontologies

    The use of ontology terms can make data more FAIR and tractable by machines. However, the highly formalised terminology used by these ontology terms does not always match the colloquial language used by practitioners. This disparity can (a) make it difficult for practitioners to understand the language used by knowledge stored in ontologies; and (b) make it difficult to machine-interpret information written by practitioners to map it to ontologies. This problem is particularly relevant in the ELIXIR Microbial Biotechnology (MB) community, as although the domain has adopted ontologies and data standards such as SO, SBO, GO, and SBOL for data representation, the tools developed often use ontology terms directly rather than the language used in the wet lab (i.e. by the people using the tools.) At the BioHackathon 2022 in Paris, France, we initiated an effort to address this problem by (a) mining the internet for colloquial language used by biologists; (b) constructing a dictionary (CWLD: colloquial wet lab dictionary) of this language and its mappings to ontology terms; and (c) constructing a table of the occurrences of different terminology used in MB tools and resources. While initially developed to serve the MB community, we hope that the dictionary will serve as a helpful resource for anyone hoping to map from colloquial wet lab language to ontology terms for e.g. text mining applications.
  • GEM: Genome Editing Meta-database

    Genome editing is a widely used tool to create precise changes in a genome. However, no specialized database for genome editing is available. Therefore, we have been developing genome editing meta-database (GEM) which aims to collect the exhaustive dataset of metadata related to genome editing. Currently, GEM consists primarily of a subset of genome editing- related metadata from PubMed articles. Metadata is extracted from research articles that have the contents with experiments using either of 7 types of genome editing tools: CRISPR-Cas9, Transcription activator-like effector nuclease (TALEN), Zinc finger nuclease (ZFN), CRISPR- Cas12, CRISPR-Cas3, Base editor, and Prime editor. Those tools are often used for knock-out or knock-in of genes to elucidate the biological functions of them. In domestic version of BioHackathon in 2022 (BH22.9), we have discussed the datasets and the usage of GEM, and also updated the scripts for GEM in github based on the discussion.
  • NBS Hack Week: Pilot study on ARRIVE guidelines E10 compliance

    The ARRIVE (Animal Research: Reporting of In Vivo Experiments) guidelines are reportingguidelines designed to improve transparency and facilitate critical assessment ofexperiments involving animal research. Although it represents essential reportinginformation for animal experiments, compliance with ARRIVE items is not commonlydemanded by journals and is often lacking in animal studies. In this small pilot project, weevaluated compliance with ARRIVE 2.0 essential 10 items in 64 papers from 2018 and 2020,either citing or not the ARRIVE manuscripts. Papers that cited the ARRIVE guidelines hadslightly higher reporting scores, but we did not detect an effect of the time period nor aninteraction effect between ARRIVE versions 1.0 and 2.0. This work was conducted during theNo-Budget Science Hack Week 2021 event, an extended hackathon to discuss and developprojects in metascience. In future work, this pilot can be expanded to better estimate theeffects of the ARRIVE guidelines on reporting practices.