Meetings
BioHackSWAT4HCLS 2025
BioHackathon Europe 2024
3rd BioHackathon Germany
DBCLS BioHackathon 2024
ELIXIR INTOXICOM
Recent preprints
-
Machine learning of transcriptome data treated with DNA base editor
Base Editor, a technique that utilizes Cas9 nickase fused with deaminase to introduce single base substitutions, has significantly facilitated the creation of valuable genome variants in medical and agricultural fields. However, a phenomenon known as RNA off-target effects is recognized with Base Editor, resulting in unintended substitutions in the transcriptome. It has been reported that such substitutions often occur in specific base motifs (ACW), but whether these motif mutations are dominant has not been investigated. In this study, we constructed a pipeline for analyzing RNA off-target effects, called the Pipeline for CRISPR-induced Transcriptome-wide Unintended RNA Editing (PiCTURE), and analyzed RNA-seq data previously reported. We found minor RNA off-target effects associated with the reported base motifs, and most were indistinguishable in motif analysis.Consequently, we trained a Large Language Model (LLM) specialized for DNA base sequences on RNA off-target sequences and developed a classifier for assessing the risk of RNA off-target effects based on the sequences. When the model’s estimations were applied to the RNA off-target data for BE4-rAPOBEC1 and BE4-RrA3F, satisfactory determination results were obtained. This study is the first to demonstrate the efficacy of machine learning approaches in determining RNA off-target effects caused by Base Editor and presents a predictive model for the safer use of Base Editor. -
BioHackJP 2023 Report R3: Expand the pathway analysis environment to non-model organisms
Despite decades of pathway database efforts and freely available pathway modeling tools, most researchers publish their biological pathway knowledge as static image figures made with general illustration tools. Prior to the BioHackathon, we had identified 103,009 pathway figures in the literature and performed optical character recognition (OCR) (Pathway Figure OCR (Hanspers et al., 2020). As an initial exploration, we extracted chemical names, disease terms, and human gene names. We knew, however, that many of the pathways represented biological processes and entities specific for plant, microbial and numerous non-model organisms.To expand the pathway analysis environment to non-model organisms whose genomic and functional annotations are not organized in a central public database, we sought to expand the number of organism species included in the Pathway Figure OCR (PFOCR) database. Also, with continuing goal of expanding the use of WikiPathways (Pico et al., 2008) and the practice of modeling pathway information as proper data models, we trained new users of PathVisio (Kutmon et al., 2015) and guided them through the process of publishing at WikiPathways. -
BioHackJP 2023 Report R4: integration of glyco data with chemo-, geno-, lipid-omics and pathway data
GlyTouCan is the international glycan repository which assigns unique accession numbers to glycans; it serves an important role in the interoperability of glycan-related databases and Web resources. GlyCosmos is a Web portal for glycoscience data, using semantic Web technologies to integrate heterogeneous data related to glycans. It currently contains information about glycogenes, glycoproteins, glycolipids, pathways, and diseases, in addition to providing various tools for glycan analysis. In the BH23, we were studied and developed as follows:1) Integrate the glycan data in GlyTouCan and PubChem: analyzing the glycan structures and the chemical representation of data in PubChem. 2) Integration of glycan data from GlyCosmos with UniProt. 3) Investigation of glycogene variants and phenotypes to integrate with GlyCosmos: investigating variants and phenotypes in the current life science database landscape. The glycogenes in GlyCosmos are managed using HGNC symbols and NCBI Gene IDs. So resources that could easily provide variants and phenotypes for a list of such genes would be the strongest candidates. Comprehensiveness and accuracy are also important factors. 4) Update of Glyco-tools: software used by the glycomics community. 5) Semantic inferencing to enhance the knowledge in GlyCosmos: organizing the ontologies used in GlyCosmos to enable inferencing and incorporation of inferencing rules to generate new knowledge from existing data. -
RDF Data integration using Shape Expressions
The paper contains a report of the activities that have been done during the Biohackathon 2023 in Shodoshima, Japan in a project about RDF data integration using Shape Expressions. The paper describes several approaches that have been discussed to create RDF data subsets and some preliminary results applying some of those technologies. It also describes the work that has been done comparing RDF data modeling approaches like ShEx, LinkML and YAML files from rdfconfig. -
Evaluating Oxigraph Server as a triple store for small and medium-sized datasets
With the escalating complexity and volume of bioinformatics data, there is an escalating demand for efficient and multifaceted triplestore technologies. Contemporary programming languages, such as Rust, provide solutions to the constraints identified in traditional languages, placing emphasis on safety, performance, and enhanced developer experience. A paradigm of this modern approach is Oxigraph, a Rust-based graph database demonstrating proficient graph data management, predominantly targeting single-node use case applications. Despite its genesis as a hobby project, Oxigraph yields competitive performance in administering straightforward Online Transaction Processing (OLTP) workloads, exhibiting a considerable potential for future refinement. This study is focused on a comprehensive appraisal of the Oxigraph server’s efficacy in distinct use cases, transcending beyond the typical SPARQL performance. The evaluation thoroughly examines various operational aspects, including data loading, backup procedures, deployment strategies, maintenance protocols, and overall server usability. The authors used a subset of PDB/RDF and complete chem_comp/RDF archives; totals around 0.5 B triples have been used to conduct this evaluation. -
Bioinforming
Optimal formats to inform and engage young students in novel biology-related fields are short courses. Training schools, e.g. those lasting for five days, can provide enough content to introduce students to an extensive overview of bioinformatics and scientific career opportunities.In this work, we define a five-day training school format tailored to three target groups of young students: high school students, undergraduate students in biology-related fields and undergraduate students in computational fields. We structure the content and sessions around learning areas consisting of learning topics, detailing the dependencies between them.For each learning topic, we define learning outcomes and learning activities. Moreover, we conceptualize a teaching platform to manage FAIRyfied (Findable, Accessible, Interoperable, Reusable) training materials that anyone will be able to use to design a new training school in bioinformatics. -
BioHackEU22 Report for Project 31: The What & How in data management: Improving connectivity between RDMkit and FAIR Cookbook
This report describes the work completed during the ELIXIR Biohackathon 2022 for project 31: The What & How in data management: Improving connectivity between RDMkit and FAIR Cookbook. The project covered 3 subjects: the technical connectivity between the two primary resources, an editorial alignment and gap analysis of their content, and the creation of user journeys incorporating the wider ELIXIR Research Data Management (RDM) ecosystem.