Meetings

Recent preprints

  • Efforts to analyze pathways in non-model organisms

    In addition to functional annotation of genes, annotating genes to pathways is important in current molecular biology.But, pathway diagrams are required to annotate genes to nodes of those.Therefore, it is important to draw pathway diagrams with assignment to genes and metabolites.Existing metabolic pathway databases focus on generic pathways, while secondary metabolism is emphasized in organisms producing useful substances.Moreover they cannot accept third party annotation of those data.A practical system for pathway analyses is therefore really needed.Following on from the previous BioHackathon (BH23), we first discussed how to create a database of pathway information in non-model species in a domestic version of the BioHackathon called BH23.9 held in Shirahama, Wakayama, Japan (25-29 September 2023).We then gave a tutorial on how to write a pathway diagram using PathVisio, which is a free open-source pathway analysis and drawing software which allows drawing, editing, and analyzing biological pathways. Finally we tried to establish the conversion system from text data to Graphical Pathway Markup Language (GPML), which is called txt2gpml.txt2gpml will drastically reduce the time and effort required to create pathway diagrams.After a stimulus discussion in BH23 and BH23.9, we could clarify the current issues in the pathway analysis for non-model organisms.
  • BioHackJP 2023 Report R3: Plant data integration for findability across multiple databases

    Plant research generate vast amount of heterogeneous data available in dispersed repositories. Therefore, accessing, integrating, and analyzing these datasets is a challenge caused by their low findability as well as format and standards variability. Several solutions including data standards (MIAPPE, BrAPI) and portals (FAIDARE) are recommended by the ELIXIR plant community through the RDM Kit plant pages. The BioHackathon Japan 2023 was an ideal event to outreach those solutions toward the Japanese researchers and bioinformaticians in order to increase visibility of Japanese databases in the plant research data discovery portal FAIDARE and explore the use of the Breeding API for knowledge graph.
  • BioHackEU22 Report: Enhancing Research Data Management in Galaxy and Data Stewardship Wizard by utilising RO-Crates

    This report describes the integration of RO-Crates into Data Stewardship Wizard and Galaxy during the BioHackathon Europe 2023, aiming to improve data management and sharing in scientific research. By utilizing RO-Crates, researchers can easily create machine-readable metadata for their datasets, ensuring long-term discoverability, accessibility, and reusability. The seamless integration of RO-Crates in these platforms enhances collaboration between researchers and institutions, facilitating data sharing and reuse across projects and domains. Future efforts may focus on enhancing RO-Crate’s interoperability with other standards and platforms, as well as promoting wider adoption through outreach and education initiatives to meet the evolving needs of researchers and institutions in data stewardship.
  • Infrastructure for synthetic health data

    Machine learning (ML) methods are becoming ever more prevalent across all domains of lifesciences. However, a key component of effective ML is the availability of large datasets thatare diverse and representative. In the context of health systems, with significant heterogeneityof clinical phenotypes and diversity of healthcare systems, there exists a necessity to developand refine unbiased and fair ML models. Synthetic data are increasingly being used to protectthe patient’s right to privacy and overcome the paucity of annotated open-access medical data. Here, we present our proof of concept for the generation of synthetic health data and our proposed FAIR implementation of the generated synthetic datasets. The work was developed during and after the one-week-long BioHackathon Europe, by together 20 participants (10 new to the project), from different countries (NL, ES, LU, UK, GR, FL, DE, . . . ).
  • Redesign of the validation framework in LinkML

    LinkML is a data modeling language that can be used to describe the structure and semantics of data from a specific domain. But as with any modeling language, there is a need for tools that support validation of data. The LinkML provides a set of validation tools but there is a growing need to adapt the tools for a broader audience. The work highlighted in this report describes the efforts of redesigning the validation framework in LinkML to better support a wider range of validation scenarios and use cases.
  • Machine learning of transcriptome data treated with DNA base editor

    Base Editor, a technique that utilizes Cas9 nickase fused with deaminase to introduce single base substitutions, has significantly facilitated the creation of valuable genome variants in medical and agricultural fields. However, a phenomenon known as RNA off-target effects is recognized with Base Editor, resulting in unintended substitutions in the transcriptome. It has been reported that such substitutions often occur in specific base motifs (ACW), but whether these motif mutations are dominant has not been investigated. In this study, we constructed a pipeline for analyzing RNA off-target effects, called the Pipeline for CRISPR-induced Transcriptome-wide Unintended RNA Editing (PiCTURE), and analyzed RNA-seq data previously reported. We found minor RNA off-target effects associated with the reported base motifs, and most were indistinguishable in motif analysis.Consequently, we trained a Large Language Model (LLM) specialized for DNA base sequences on RNA off-target sequences and developed a classifier for assessing the risk of RNA off-target effects based on the sequences. When the model’s estimations were applied to the RNA off-target data for BE4-rAPOBEC1 and BE4-RrA3F, satisfactory determination results were obtained. This study is the first to demonstrate the efficacy of machine learning approaches in determining RNA off-target effects caused by Base Editor and presents a predictive model for the safer use of Base Editor.
  • BioHackJP 2023 Report R3: Expand the pathway analysis environment to non-model organisms

    Despite decades of pathway database efforts and freely available pathway modeling tools, most researchers publish their biological pathway knowledge as static image figures made with general illustration tools. Prior to the BioHackathon, we had identified 103,009 pathway figures in the literature and performed optical character recognition (OCR) (Pathway Figure OCR (Hanspers et al., 2020). As an initial exploration, we extracted chemical names, disease terms, and human gene names. We knew, however, that many of the pathways represented biological processes and entities specific for plant, microbial and numerous non-model organisms.To expand the pathway analysis environment to non-model organisms whose genomic and functional annotations are not organized in a central public database, we sought to expand the number of organism species included in the Pathway Figure OCR (PFOCR) database. Also, with continuing goal of expanding the use of WikiPathways (Pico et al., 2008) and the practice of modeling pathway information as proper data models, we trained new users of PathVisio (Kutmon et al., 2015) and guided them through the process of publishing at WikiPathways.