DBCLS BioHackathon 2023, Kagawa, Japan

BH23JP

2023-09-14 - 2023-09-20
https://2023.biohackathon.org/

YAML instructions

biohackathon_name: "DBCLS BioHackathon 2023"
biohackathon_url: "https://2023.biohackathon.org/"
biohackathon_location: "Kagawa, Japan"

Preprints

Dec 23, 2025
https://doi.org/10.37044/osf.io/hw2fj_v1

Enhancement of the Interoperability of Trait Data on Genetic Resources between Japan and France

Japan’s National Agriculture and Food Research Organization initiated a collaborative research project with France’s National Research Institute for Agriculture, Food and Environment to evaluate wheat genetic resources and to identify materials with desirable traits using standardized criteria. This paper presents the current status of trait data standardization between the two organizations and outlines a direction for standardization. Trait data for genetic resources in Japan and France are managed using independently developed standards. The lack of mapping standards hinders data integration and interoperability. To support experts in the mapping process, we developed a tool that translates trait terms. A generative AI-based translation tool appears to be applicable for collecting relevant information to support mapping between trait terms, as well as translating newly submitted Japanese trait terms into English. less than 1 minute read

Apr 24, 2024
https://doi.org/10.37044/osf.io/dpnry

SPARQL services for InterMine databases

InterMine is an open source data warehouse system that can be used to create biological databases that can be accessed via web query tools. There are many public InterMine instances that are currently deployed worldwide and they share a core data model pertaining to common biological entities. Besides the core data model, each instance of InterMine typically has an extended data model to cover data specific to that particular deployment. The data is organised according to the graph-based data model but exists in a relational store (Postgres). The goal of this project was to explore the possibility of translating InterMine data from relational form to a graph form using Resource Description Framework (RDF) as the exchange format. This could provide a route to exposing data from InterMine instances as RDF triples and thus making it possible to query the data using the SPARQL Protocol and RDF Querying Language (SPARQL). less than 1 minute read

Jan 24, 2024
https://doi.org/10.37044/osf.io/d27fw

BioHackJP 2023 Report R1:Improving phenotype ontology interoperability

Ontologies play a crucial role in data management and especially in life science, they have been indispensable for decades as the complexity of life science data requires rigor. Biomedical ontologies often undergo change and improvement, as e.g. disease and phenotype ontologies develop constantly along with our scientific understanding. In order to bridge the gap between ontologies and annotated datasets and thus to semantically enable applications and datasets to retrieve insights and improve interoperability, ontology mapping plays a key role.To implement a sophisticated search supported by semantics, interoperability to address cross-disciplinary needs is crucial. In this paper we focus on different aspects of interoperability of ontologies, especially in the phenotype and disease domain and how they could be improved. During the BioHackJP 2023, a variety of approaches were discussed and evaluated. In this paper, we report overviews of the result of each investigation including, 1: Linguistic and Social Interoperability, 2: Technical and Structural Interoperability, 3: Ontology Alignments and Mappings, 4: Use of Large Language Models (LLMs), 5: Model Mice Exploration, and discuss future works to address these challenges. less than 1 minute read

Jan 20, 2024
https://doi.org/10.37044/osf.io/8kuzr

BioHackJP 2023 Report R1: Mapping human genome variations to their mouse counterparts for identifying disease model mouse strains

In disease model mouse strains used for human disease studies, information on genomic variations is essential for elucidating the relationship between haplotypes and disease susceptibility. To select a disease model mouse appropriately, it is crucial to identify mouse variants with the same effect as disease-causing variants in humans. In BioHackathon Japan J2023, we focused on nucleotide variants involved in amino acid substitutions. We developed an API that matches mouse variants from the MoG+ database to human variants within gene regions defined by HGNC identifiers or symbols. After the Hackathon, we will map non-coding variants in addition to coding variants. The outcomes of our variant mapping will be presented as links connecting the comprehensive human variation database, TogoVar, and the model mouse genome database, MoG. less than 1 minute read

Oct 26, 2023
https://doi.org/10.37044/osf.io/spf3q

Efforts to analyze pathways in non-model organisms

In addition to functional annotation of genes, annotating genes to pathways is important in current molecular biology.But, pathway diagrams are required to annotate genes to nodes of those.Therefore, it is important to draw pathway diagrams with assignment to genes and metabolites.Existing metabolic pathway databases focus on generic pathways, while secondary metabolism is emphasized in organisms producing useful substances.Moreover they cannot accept third party annotation of those data.A practical system for pathway analyses is therefore really needed.Following on from the previous BioHackathon (BH23), we first discussed how to create a database of pathway information in non-model species in a domestic version of the BioHackathon called BH23.9 held in Shirahama, Wakayama, Japan (25-29 September 2023).We then gave a tutorial on how to write a pathway diagram using PathVisio, which is a free open-source pathway analysis and drawing software which allows drawing, editing, and analyzing biological pathways. Finally we tried to establish the conversion system from text data to Graphical Pathway Markup Language (GPML), which is called txt2gpml.txt2gpml will drastically reduce the time and effort required to create pathway diagrams.After a stimulus discussion in BH23 and BH23.9, we could clarify the current issues in the pathway analysis for non-model organisms. 1 minute read

Sep 14, 2023
https://doi.org/10.37044/osf.io/ghzcx

BioHackJP 2023 Report R3: Plant data integration for findability across multiple databases

Plant research generate vast amount of heterogeneous data available in dispersed repositories. Therefore, accessing, integrating, and analyzing these datasets is a challenge caused by their low findability as well as format and standards variability. Several solutions including data standards (MIAPPE, BrAPI) and portals (FAIDARE) are recommended by the ELIXIR plant community through the RDM Kit plant pages. The BioHackathon Japan 2023 was an ideal event to outreach those solutions toward the Japanese researchers and bioinformaticians in order to increase visibility of Japanese databases in the plant research data discovery portal FAIDARE and explore the use of the Breeding API for knowledge graph. less than 1 minute read

Jul 18, 2023
https://doi.org/10.37044/osf.io/8ukwz

Redesign of the validation framework in LinkML

LinkML is a data modeling language that can be used to describe the structure and semantics of data from a specific domain. But as with any modeling language, there is a need for tools that support validation of data. The LinkML provides a set of validation tools but there is a growing need to adapt the tools for a broader audience. The work highlighted in this report describes the efforts of redesigning the validation framework in LinkML to better support a wider range of validation scenarios and use cases. less than 1 minute read

Jul 13, 2023
https://doi.org/10.37044/osf.io/zytkj

Machine learning of transcriptome data treated with DNA base editor

Base Editor, a technique that utilizes Cas9 nickase fused with deaminase to introduce single base substitutions, has significantly facilitated the creation of valuable genome variants in medical and agricultural fields. However, a phenomenon known as RNA off-target effects is recognized with Base Editor, resulting in unintended substitutions in the transcriptome. It has been reported that such substitutions often occur in specific base motifs (ACW), but whether these motif mutations are dominant has not been investigated. In this study, we constructed a pipeline for analyzing RNA off-target effects, called the Pipeline for CRISPR-induced Transcriptome-wide Unintended RNA Editing (PiCTURE), and analyzed RNA-seq data previously reported. We found minor RNA off-target effects associated with the reported base motifs, and most were indistinguishable in motif analysis.Consequently, we trained a Large Language Model (LLM) specialized for DNA base sequences on RNA off-target sequences and developed a classifier for assessing the risk of RNA off-target effects based on the sequences. When the model’s estimations were applied to the RNA off-target data for BE4-rAPOBEC1 and BE4-RrA3F, satisfactory determination results were obtained. This study is the first to demonstrate the efficacy of machine learning approaches in determining RNA off-target effects caused by Base Editor and presents a predictive model for the safer use of Base Editor. 1 minute read

Jul 12, 2023
https://doi.org/10.37044/osf.io/4uskb

BioHackJP 2023 Report R3: Expand the pathway analysis environment to non-model organisms

Despite decades of pathway database efforts and freely available pathway modeling tools, most researchers publish their biological pathway knowledge as static image figures made with general illustration tools. Prior to the BioHackathon, we had identified 103,009 pathway figures in the literature and performed optical character recognition (OCR) (Pathway Figure OCR (Hanspers et al., 2020). As an initial exploration, we extracted chemical names, disease terms, and human gene names. We knew, however, that many of the pathways represented biological processes and entities specific for plant, microbial and numerous non-model organisms.To expand the pathway analysis environment to non-model organisms whose genomic and functional annotations are not organized in a central public database, we sought to expand the number of organism species included in the Pathway Figure OCR (PFOCR) database. Also, with continuing goal of expanding the use of WikiPathways (Pico et al., 2008) and the practice of modeling pathway information as proper data models, we trained new users of PathVisio (Kutmon et al., 2015) and guided them through the process of publishing at WikiPathways. less than 1 minute read

Jul 4, 2023
https://doi.org/10.37044/osf.io/mq54k

BioHackJP 2023 Report R4: integration of glyco data with chemo-, geno-, lipid-omics and pathway data

GlyTouCan is the international glycan repository which assigns unique accession numbers to glycans; it serves an important role in the interoperability of glycan-related databases and Web resources. GlyCosmos is a Web portal for glycoscience data, using semantic Web technologies to integrate heterogeneous data related to glycans. It currently contains information about glycogenes, glycoproteins, glycolipids, pathways, and diseases, in addition to providing various tools for glycan analysis. In the BH23, we were studied and developed as follows:1) Integrate the glycan data in GlyTouCan and PubChem: analyzing the glycan structures and the chemical representation of data in PubChem. 2) Integration of glycan data from GlyCosmos with UniProt. 3) Investigation of glycogene variants and phenotypes to integrate with GlyCosmos: investigating variants and phenotypes in the current life science database landscape. The glycogenes in GlyCosmos are managed using HGNC symbols and NCBI Gene IDs. So resources that could easily provide variants and phenotypes for a list of such genes would be the strongest candidates. Comprehensiveness and accuracy are also important factors. 4) Update of Glyco-tools: software used by the glycomics community. 5) Semantic inferencing to enhance the knowledge in GlyCosmos: organizing the ontologies used in GlyCosmos to enable inferencing and incorporation of inferencing rules to generate new knowledge from existing data. 1 minute read

Jul 4, 2023
https://doi.org/10.37044/osf.io/md73k

RDF Data integration using Shape Expressions

The paper contains a report of the activities that have been done during the Biohackathon 2023 in Shodoshima, Japan in a project about RDF data integration using Shape Expressions. The paper describes several approaches that have been discussed to create RDF data subsets and some preliminary results applying some of those technologies. It also describes the work that has been done comparing RDF data modeling approaches like ShEx, LinkML and YAML files from rdfconfig. less than 1 minute read

Jul 1, 2023
https://doi.org/10.37044/osf.io/yru4b

Evaluating Oxigraph Server as a triple store for small and medium-sized datasets

With the escalating complexity and volume of bioinformatics data, there is an escalating demand for efficient and multifaceted triplestore technologies. Contemporary programming languages, such as Rust, provide solutions to the constraints identified in traditional languages, placing emphasis on safety, performance, and enhanced developer experience. A paradigm of this modern approach is Oxigraph, a Rust-based graph database demonstrating proficient graph data management, predominantly targeting single-node use case applications. Despite its genesis as a hobby project, Oxigraph yields competitive performance in administering straightforward Online Transaction Processing (OLTP) workloads, exhibiting a considerable potential for future refinement. This study is focused on a comprehensive appraisal of the Oxigraph server’s efficacy in distinct use cases, transcending beyond the typical SPARQL performance. The evaluation thoroughly examines various operational aspects, including data loading, backup procedures, deployment strategies, maintenance protocols, and overall server usability. The authors used a subset of PDB/RDF and complete chem_comp/RDF archives; totals around 0.5 B triples have been used to conduct this evaluation. less than 1 minute read