DBCLS BioHackathon 2023, Kagawa, Japan, 2023

Preprints

  • SPARQL services for InterMine databases

    InterMine is an open source data warehouse system that can be used to create biological databases that can be accessed via web query tools. There are many public InterMine instances that are currently deployed worldwide and they share a core data model pertaining to common biological entities. Besides the core data model, each instance of InterMine typically has an extended data model to cover data specific to that particular deployment. The data is organised according to the graph-based data model but exists in a relational store (Postgres). The goal of this project was to explore the possibility of translating InterMine data from relational form to a graph form using Resource Description Framework (RDF) as the exchange format. This could provide a route to exposing data from InterMine instances as RDF triples and thus making it possible to query the data using the SPARQL Protocol and RDF Querying Language (SPARQL).
  • BioHackJP 2023 Report R1:Improving phenotype ontology interoperability

    Ontologies play a crucial role in data management and especially in life science, they have been indispensable for decades as the complexity of life science data requires rigor. Biomedical ontologies often undergo change and improvement, as e.g. disease and phenotype ontologies develop constantly along with our scientific understanding. In order to bridge the gap between ontologies and annotated datasets and thus to semantically enable applications and datasets to retrieve insights and improve interoperability, ontology mapping plays a key role.To implement a sophisticated search supported by semantics, interoperability to address cross-disciplinary needs is crucial. In this paper we focus on different aspects of interoperability of ontologies, especially in the phenotype and disease domain and how they could be improved. During the BioHackJP 2023, a variety of approaches were discussed and evaluated. In this paper, we report overviews of the result of each investigation including, 1: Linguistic and Social Interoperability, 2: Technical and Structural Interoperability, 3: Ontology Alignments and Mappings, 4: Use of Large Language Models (LLMs), 5: Model Mice Exploration, and discuss future works to address these challenges.
  • BioHackJP 2023 Report R1: Mapping human genome variations to their mouse counterparts for identifying disease model mouse strains

    In disease model mouse strains used for human disease studies, information on genomic variations is essential for elucidating the relationship between haplotypes and disease susceptibility. To select a disease model mouse appropriately, it is crucial to identify mouse variants with the same effect as disease-causing variants in humans. In BioHackathon Japan J2023, we focused on nucleotide variants involved in amino acid substitutions. We developed an API that matches mouse variants from the MoG+ database to human variants within gene regions defined by HGNC identifiers or symbols. After the Hackathon, we will map non-coding variants in addition to coding variants. The outcomes of our variant mapping will be presented as links connecting the comprehensive human variation database, TogoVar, and the model mouse genome database, MoG.
  • Efforts to analyze pathways in non-model organisms

    In addition to functional annotation of genes, annotating genes to pathways is important in current molecular biology.But, pathway diagrams are required to annotate genes to nodes of those.Therefore, it is important to draw pathway diagrams with assignment to genes and metabolites.Existing metabolic pathway databases focus on generic pathways, while secondary metabolism is emphasized in organisms producing useful substances.Moreover they cannot accept third party annotation of those data.A practical system for pathway analyses is therefore really needed.Following on from the previous BioHackathon (BH23), we first discussed how to create a database of pathway information in non-model species in a domestic version of the BioHackathon called BH23.9 held in Shirahama, Wakayama, Japan (25-29 September 2023).We then gave a tutorial on how to write a pathway diagram using PathVisio, which is a free open-source pathway analysis and drawing software which allows drawing, editing, and analyzing biological pathways. Finally we tried to establish the conversion system from text data to Graphical Pathway Markup Language (GPML), which is called txt2gpml.txt2gpml will drastically reduce the time and effort required to create pathway diagrams.After a stimulus discussion in BH23 and BH23.9, we could clarify the current issues in the pathway analysis for non-model organisms.
  • BioHackJP 2023 Report R3: Plant data integration for findability across multiple databases

    Plant research generate vast amount of heterogeneous data available in dispersed repositories. Therefore, accessing, integrating, and analyzing these datasets is a challenge caused by their low findability as well as format and standards variability. Several solutions including data standards (MIAPPE, BrAPI) and portals (FAIDARE) are recommended by the ELIXIR plant community through the RDM Kit plant pages. The BioHackathon Japan 2023 was an ideal event to outreach those solutions toward the Japanese researchers and bioinformaticians in order to increase visibility of Japanese databases in the plant research data discovery portal FAIDARE and explore the use of the Breeding API for knowledge graph.
  • Redesign of the validation framework in LinkML

    LinkML is a data modeling language that can be used to describe the structure and semantics of data from a specific domain. But as with any modeling language, there is a need for tools that support validation of data. The LinkML provides a set of validation tools but there is a growing need to adapt the tools for a broader audience. The work highlighted in this report describes the efforts of redesigning the validation framework in LinkML to better support a wider range of validation scenarios and use cases.
  • Machine learning of transcriptome data treated with DNA base editor

    Base Editor, a technique that utilizes Cas9 nickase fused with deaminase to introduce single base substitutions, has significantly facilitated the creation of valuable genome variants in medical and agricultural fields. However, a phenomenon known as RNA off-target effects is recognized with Base Editor, resulting in unintended substitutions in the transcriptome. It has been reported that such substitutions often occur in specific base motifs (ACW), but whether these motif mutations are dominant has not been investigated. In this study, we constructed a pipeline for analyzing RNA off-target effects, called the Pipeline for CRISPR-induced Transcriptome-wide Unintended RNA Editing (PiCTURE), and analyzed RNA-seq data previously reported. We found minor RNA off-target effects associated with the reported base motifs, and most were indistinguishable in motif analysis.Consequently, we trained a Large Language Model (LLM) specialized for DNA base sequences on RNA off-target sequences and developed a classifier for assessing the risk of RNA off-target effects based on the sequences. When the model’s estimations were applied to the RNA off-target data for BE4-rAPOBEC1 and BE4-RrA3F, satisfactory determination results were obtained. This study is the first to demonstrate the efficacy of machine learning approaches in determining RNA off-target effects caused by Base Editor and presents a predictive model for the safer use of Base Editor.
  • BioHackJP 2023 Report R3: Expand the pathway analysis environment to non-model organisms

    Despite decades of pathway database efforts and freely available pathway modeling tools, most researchers publish their biological pathway knowledge as static image figures made with general illustration tools. Prior to the BioHackathon, we had identified 103,009 pathway figures in the literature and performed optical character recognition (OCR) (Pathway Figure OCR (Hanspers et al., 2020). As an initial exploration, we extracted chemical names, disease terms, and human gene names. We knew, however, that many of the pathways represented biological processes and entities specific for plant, microbial and numerous non-model organisms.To expand the pathway analysis environment to non-model organisms whose genomic and functional annotations are not organized in a central public database, we sought to expand the number of organism species included in the Pathway Figure OCR (PFOCR) database. Also, with continuing goal of expanding the use of WikiPathways (Pico et al., 2008) and the practice of modeling pathway information as proper data models, we trained new users of PathVisio (Kutmon et al., 2015) and guided them through the process of publishing at WikiPathways.
  • BioHackJP 2023 Report R4: integration of glyco data with chemo-, geno-, lipid-omics and pathway data

    GlyTouCan is the international glycan repository which assigns unique accession numbers to glycans; it serves an important role in the interoperability of glycan-related databases and Web resources. GlyCosmos is a Web portal for glycoscience data, using semantic Web technologies to integrate heterogeneous data related to glycans. It currently contains information about glycogenes, glycoproteins, glycolipids, pathways, and diseases, in addition to providing various tools for glycan analysis. In the BH23, we were studied and developed as follows:1) Integrate the glycan data in GlyTouCan and PubChem: analyzing the glycan structures and the chemical representation of data in PubChem. 2) Integration of glycan data from GlyCosmos with UniProt. 3) Investigation of glycogene variants and phenotypes to integrate with GlyCosmos: investigating variants and phenotypes in the current life science database landscape. The glycogenes in GlyCosmos are managed using HGNC symbols and NCBI Gene IDs. So resources that could easily provide variants and phenotypes for a list of such genes would be the strongest candidates. Comprehensiveness and accuracy are also important factors. 4) Update of Glyco-tools: software used by the glycomics community. 5) Semantic inferencing to enhance the knowledge in GlyCosmos: organizing the ontologies used in GlyCosmos to enable inferencing and incorporation of inferencing rules to generate new knowledge from existing data.
  • RDF Data integration using Shape Expressions

    The paper contains a report of the activities that have been done during the Biohackathon 2023 in Shodoshima, Japan in a project about RDF data integration using Shape Expressions. The paper describes several approaches that have been discussed to create RDF data subsets and some preliminary results applying some of those technologies. It also describes the work that has been done comparing RDF data modeling approaches like ShEx, LinkML and YAML files from rdfconfig.
  • Evaluating Oxigraph Server as a triple store for small and medium-sized datasets

    With the escalating complexity and volume of bioinformatics data, there is an escalating demand for efficient and multifaceted triplestore technologies. Contemporary programming languages, such as Rust, provide solutions to the constraints identified in traditional languages, placing emphasis on safety, performance, and enhanced developer experience. A paradigm of this modern approach is Oxigraph, a Rust-based graph database demonstrating proficient graph data management, predominantly targeting single-node use case applications. Despite its genesis as a hobby project, Oxigraph yields competitive performance in administering straightforward Online Transaction Processing (OLTP) workloads, exhibiting a considerable potential for future refinement. This study is focused on a comprehensive appraisal of the Oxigraph server’s efficacy in distinct use cases, transcending beyond the typical SPARQL performance. The evaluation thoroughly examines various operational aspects, including data loading, backup procedures, deployment strategies, maintenance protocols, and overall server usability. The authors used a subset of PDB/RDF and complete chem_comp/RDF archives; totals around 0.5 B triples have been used to conduct this evaluation.