DBCLS BioHackathon 2025, Mie, Japan

BH25JP

2025-09-14 - 2025-09-20
https://2025.biohackathon.org/

The BioHackathon in Japan, organized by the Database Center for Life Science (DBCLS), continues to prioritize the integrated use of databases in the life sciences, with a strong emphasis on interoperability, standardization, and the construction of FAIR knowledge graphs. In recent years, the development of tools and workflows for integrating heterogeneous biological and biomedical data—including multi-omics, imaging, clinical, and environmental data—has become increasingly central. This integration is key to enabling cross-domain analysis and improving reproducibility in data-driven research.

Source

Previous DBCLS BioHackathon preprints

YAML instructions

biohackathon_name: "DBCLS BioHackathon 2025"
biohackathon_url: "https://2025.biohackathon.org/"
biohackathon_location: "Mie, Japan"

Preprints

Jan 6, 2026
https://doi.org/10.37044/osf.io/m37f2_v1

QPX: Pathway analysis environment

Building on our work at DBCLS BioHackathon 2023 (BH23), where we introduced QPX and promoted pathway modeling with WikiPathways (Pico et al., 2008) using PathVisio (Kutmon et al., 2015), we now focused on creating new pathway diagrams for diverse species and registering them in WikiPathways with functional annotations. In parallel, we deployed WikiPathways node data into Elasticsearch to enable fast and flexible search and integration of pathway information. less than 1 minute read

Dec 16, 2025
https://doi.org/10.37044/osf.io/8qeh5_v1

MCP server tools with RDF shapes

In this paper, we present the work we have done during the Japan Biohackathon 2025 about implementing MCP servers supported by RDF data shapes to improve natural language interactions with large RDF datasets using SPARQL. less than 1 minute read

Oct 24, 2025
https://doi.org/10.37044/osf.io/7s6da_v1

DBCLS BioHackathon 2025 report on the WikiBlitz

As part of the DBCLS BioHackathon 2025, we organized a WikiBlitz to improve biodiversity knowledge by integrating iNaturalist, GBIF, Wikidata, and Wikipedia. Participants identified local flora and fauna, filling gaps in multilingual Wikipedia articles. This report summarizes the methodology, results, and insights, illustrating the usefulness of combining citizen science with digital platforms to enrich ecological data and promote biodiversity awareness. less than 1 minute read

Oct 21, 2025
https://doi.org/10.37044/osf.io/4f763_v1

on2vec: Ontology Embeddings with Graph Neural Networks and Sentence Transformers

Ontologies provide structured vocabularies and relationships essential for organizing biological knowledge, yet their symbolic nature limits integration with modern machine learning methods. Leveraging recent advances in graph neural networks (GNNs) and transformer-based language models, we present on2vec, a toolkit developed during the DBCLS BioHackathon 2025 for generating vector embeddings from OWL ontologies. on2vec integrates structural information from ontology hierarchies with semantic features from textual annotations using HuggingFace Sentence Transformers, producing domain-aware embeddings suitable for downstream biomedical applications and ontology-based reasoning tasks. less than 1 minute read

Oct 12, 2025
https://doi.org/10.37044/osf.io/pza7v_v1

AI in Practice: Insights from a Community Survey of Biohackathon Participants

Understanding the practical application of artificial intelligence (AI) in research is increasingly important as it becomes embedded in life sciences and bioinformatics. This paper reports on a multilingual survey, developed through community discussions at the 2025 BioHackathon in Japan and distributed through its networks, to capture current practices, successes, and challenges in AI adoption. The survey, oﬀered in English, Japanese, and Thai, received 105 responses spanning diverse demographics, regions, and professional backgrounds. Findings reveal that most participants are frequent AI users, with tools like ChatGPT, Gemini, and Claude widely adopted, with ChatGPT as number one response. AI is primarily used to assist or draft tasks in coding, research, and writing, while full task automation remains uncommon, reflecting a preference for AI as a collaborative aid rather than a replacement. Successes were noted in eﬃciency, coding support, and proposal writing, whereas challenges centered on accuracy and reliability. Institutional support emerged as a key factor: respondents in Japan, Thailand, and the private sector reported stronger support and higher satisfaction than English-speaking or academic counterparts. By documenting real-world practices and concerns, this survey provides a valuable community-driven resource to guide responsible AI development and foster international collaboration in bioinformatics. 1 minute read

Sep 30, 2025
https://doi.org/10.37044/osf.io/wj8bz_v1

Translating and Formalizing the MIRAGE Guidelines to a Prototype MIRAGE Ontology and DCAT3 Extension Vocabulary for Glycomics Data Management

The Minimum Information Required for A Glycomics Experiment (MIRAGE) guidelines have established comprehensive reporting standards for glycomics research, yet their implementation in semantic web technologies remains limited. We present the first comprehensive semantic formalization of MIRAGE guidelines through an integrated RDF ontology framework comprising the MIRAGE Ontology and MIRAGE-DCAT3 vocabulary. The MIRAGE Ontology models glycan structures, biological specimens, analytical instruments, and experimental processes with formal OWL semantics and SHACL validation constraints. The complementary MIRAGE-DCAT3 vocabulary extends W3C DCAT3 with glycomics-specific metadata properties for dataset cataloging and discovery. Our implementation addresses critical challenges in glycomics data interoperability through comprehensive mappings to established ontologies including GlycoRDF, PSI-MS, and DCTERMS. This semantic framework enables automated quality assessment, federated data querying, and enhanced reproducibility in glycomics research, supporting broader adoption of FAIR principles in the glycobiology community. The framework demonstrates comprehensive coverage of MIRAGE reporting requirements across multiple analytical platforms including mass spectrometry, liquid chromatography, capillary electrophoresis, NMR spectroscopy, and lectin microarray analysis. less than 1 minute read

Sep 30, 2025
https://doi.org/10.37044/osf.io/qd5sz_v1

DBCLS BioHackathon 2025 report: Creation and Publication Analytical Workflow of Creators' Interests

At the DBCLS BioHackathon 2025, we converted metatranscriptomic analytical shell scripts into Common Workflow Language (CWL) containerized with Docker. Sub-workflows were created for metagenomic assembly, read mapping, and gene annotation, and validated with test datasets. The workflows, released on GitHub and WorkflowHub, improve reproducibility and address issues of reusability and software environment dependency. We also evaluated CWL best practices from the perspective of life scientists, classifying them by difficulty, importance, and applicability to promote FAIR principles and software quality. In parallel, we established a benchmarking framework for pangenome-based structural variant (SV) calling using data from the Dai population. Graph-based references from the Human and Chinese Pangenome Consortia were compared with linear references using minimap2 and vg giraffe. Results showed improved alignment accuracy and variant detection with pangenomes, demonstrating their value for reducing mapping bias and enhancing SV discovery. less than 1 minute read

Sep 30, 2025
https://doi.org/10.37044/osf.io/9jau6_v1

A Standards-Compliant, Multi-Modal Platform for Offline Access to SRA Metadata

The SRAmetaDBB project, presented at BioHackathon Japan 2023, introduced an experimental JavaScript pipeline for creating SQLite databases from NCBI SRA (Sequence Read Archive) metadata dumps, with a vision for offline analysis and integration with Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems. While promising, the prototype faced significant challenges in performance, memory management, and production readiness when scaling to the full SRA dataset of over 45 million records. This paper presents SRAKE (SRA Knowledge Engine), a complete reimplementation in Go that not only addresses these limitations but extends the original vision with semantic search capabilities, quality control mechanisms, and multiple access interfaces. SRAKE achieves a 20-fold improvement in ingestion speed, maintains constant memory usage through zero-copy streaming, and provides standards-compliant interfaces following clig.dev guidelines. The platform introduces biomedical-specific semantic search using SapBERT embeddings via ONNX Runtime, implements comprehensive quality control thresholds for search results, and offers multiple access modalities including a CLI, REST API, MCP server for AI integration, and a simple web interface. Our development implementation demonstrates that SRAKE successfully transforms the experimental SRAmetaDBB concept into a production-ready platform, and seamless integration with modern AI workflows while maintaining the core vision of providing offline-capable, LLM-ready access to SRA metadata. 1 minute read

Sep 30, 2025
https://doi.org/10.37044/osf.io/8kap3_v1

A Lightweight PURL Resolver for Linked Life Science Data

Knowledge graphs in the life sciences are increasingly published using the Resource Description Framework (RDF) and queried via SPARQL endpoints. While these technologies enable powerful data integration, the identifiers returned in SPARQL results often do not resolve to meaningful resources, leaving users with non-actionable links. To address this issue, we developed a lightweight Persistent Uniform Resource Locator (PURL) resolver during the BioHackathon Japan 2025. The resolver is implemented in PHP, chosen for its ubiquity on standard web servers and its compatibility with the EasyRDF library for RDF handling. It is easy to configure, requires minimal maintenance, and supports both database redirects and ontology term rendering with content negotiation for RDF serializations. The system is available as open-source software (https://github.com/JKoblitz/purl-resolver) and deployed at https://purl.dsmz.de, where it now resolves most identifiers from the DSMZ Digital Diversity SPARQL endpoint (https://sparql.dsmz.de). Database IRIs lead to the corresponding web interfaces, ontology IRIs from the DSMZ Digital Diversity Ontology render directly as term pages, and unmapped entities are delegated to database-side resolvers. This approach enhances the usability of knowledge graphs by ensuring that all identifiers remain actionable for both humans and machines. 1 minute read