DBCLS BioHackathon 2025, Mie, Japan, 2025

The BioHackathon in Japan, organized by the Database Center for Life Science (DBCLS), continues to prioritize the integrated use of databases in the life sciences, with a strong emphasis on interoperability, standardization, and the construction of FAIR knowledge graphs. In recent years, the development of tools and workflows for integrating heterogeneous biological and biomedical data—including multi-omics, imaging, clinical, and environmental data—has become increasingly central. This integration is key to enabling cross-domain analysis and improving reproducibility in data-driven research.
Previous DBCLS BioHackathon preprints
- DBCLS BioHackathon 2024, Fukushima, Japan, 2024
- DBCLS BioHackathon 2023, Kagawa, Japan, 2023
- NBDC/DBCLS BioHackathon 2022, Kochi, Japan
- DBCLS BioHackathon 2021, Aomori, Japan
- NBDC/DBCLS BioHackathon, Fukuoka, Japan, 2019
Preprints
-
Translating and Formalizing the MIRAGE Guidelines to a Prototype MIRAGE Ontology and DCAT3 Extension Vocabulary for Glycomics Data Management
The Minimum Information Required for A Glycomics Experiment (MIRAGE) guidelines have established comprehensive reporting standards for glycomics research, yet their implementation in semantic web technologies remains limited. We present the first comprehensive semantic formalization of MIRAGE guidelines through an integrated RDF ontology framework comprising the MIRAGE Ontology and MIRAGE-DCAT3 vocabulary. The MIRAGE Ontology models glycan structures, biological specimens, analytical instruments, and experimental processes with formal OWL semantics and SHACL validation constraints. The complementary MIRAGE-DCAT3 vocabulary extends W3C DCAT3 with glycomics-specific metadata properties for dataset cataloging and discovery. Our implementation addresses critical challenges in glycomics data interoperability through comprehensive mappings to established ontologies including GlycoRDF, PSI-MS, and DCTERMS. This semantic framework enables automated quality assessment, federated data querying, and enhanced reproducibility in glycomics research, supporting broader adoption of FAIR principles in the glycobiology community. The framework demonstrates comprehensive coverage of MIRAGE reporting requirements across multiple analytical platforms including mass spectrometry, liquid chromatography, capillary electrophoresis, NMR spectroscopy, and lectin microarray analysis. -
DBCLS BioHackathon 2025 report: Creation and Publication Analytical Workflow of Creators' Interests
At the DBCLS BioHackathon 2025, we converted metatranscriptomic analytical shell scripts into Common Workflow Language (CWL) containerized with Docker. Sub-workflows were created for metagenomic assembly, read mapping, and gene annotation, and validated with test datasets. The workflows, released on GitHub and WorkflowHub, improve reproducibility and address issues of reusability and software environment dependency. We also evaluated CWL best practices from the perspective of life scientists, classifying them by difficulty, importance, and applicability to promote FAIR principles and software quality. In parallel, we established a benchmarking framework for pangenome-based structural variant (SV) calling using data from the Dai population. Graph-based references from the Human and Chinese Pangenome Consortia were compared with linear references using minimap2 and vg giraffe. Results showed improved alignment accuracy and variant detection with pangenomes, demonstrating their value for reducing mapping bias and enhancing SV discovery. -
A Standards-Compliant, Multi-Modal Platform for Offline Access to SRA Metadata
The SRAmetaDBB project, presented at BioHackathon Japan 2023, introduced an experimental JavaScript pipeline for creating SQLite databases from NCBI SRA (Sequence Read Archive) metadata dumps, with a vision for offline analysis and integration with Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems. While promising, the prototype faced significant challenges in performance, memory management, and production readiness when scaling to the full SRA dataset of over 45 million records. This paper presents SRAKE (SRA Knowledge Engine), a complete reimplementation in Go that not only addresses these limitations but extends the original vision with semantic search capabilities, quality control mechanisms, and multiple access interfaces. SRAKE achieves a 20-fold improvement in ingestion speed, maintains constant memory usage through zero-copy streaming, and provides standards-compliant interfaces following clig.dev guidelines. The platform introduces biomedical-specific semantic search using SapBERT embeddings via ONNX Runtime, implements comprehensive quality control thresholds for search results, and offers multiple access modalities including a CLI, REST API, MCP server for AI integration, and a simple web interface. Our development implementation demonstrates that SRAKE successfully transforms the experimental SRAmetaDBB concept into a production-ready platform, and seamless integration with modern AI workflows while maintaining the core vision of providing offline-capable, LLM-ready access to SRA metadata. -
A Lightweight PURL Resolver for Linked Life Science Data
Knowledge graphs in the life sciences are increasingly published using the Resource Description Framework (RDF) and queried via SPARQL endpoints. While these technologies enable powerful data integration, the identifiers returned in SPARQL results often do not resolve to meaningful resources, leaving users with non-actionable links. To address this issue, we developed a lightweight Persistent Uniform Resource Locator (PURL) resolver during the BioHackathon Japan 2025. The resolver is implemented in PHP, chosen for its ubiquity on standard web servers and its compatibility with the EasyRDF library for RDF handling. It is easy to configure, requires minimal maintenance, and supports both database redirects and ontology term rendering with content negotiation for RDF serializations. The system is available as open-source software (https://github.com/JKoblitz/purl-resolver) and deployed at https://purl.dsmz.de, where it now resolves most identifiers from the DSMZ Digital Diversity SPARQL endpoint (https://sparql.dsmz.de). Database IRIs lead to the corresponding web interfaces, ontology IRIs from the DSMZ Digital Diversity Ontology render directly as term pages, and unmapped entities are delegated to database-side resolvers. This approach enhances the usability of knowledge graphs by ensuring that all identifiers remain actionable for both humans and machines.