DBCLS BioHackathon 2024, Fukushima, Japan, 2024
Preprints
-
BioHackJP24 report: Running a WikiBlitz
During BioHackathon 24 in Fukushima, we organized a WikiBlitz, a collaborative effort to integrate biodiversity observations from iNaturalist into the Wikimedia ecosystem. A WikiBlitz is inspired by the concept of a BioBlitz, where participants document as many species as possible within a limited time frame, while also contributing structured data to Wikidata, Wikimedia Commons, and Wikipedia. In this report, we describe the methodology and outcomes of this event, including the collection of 109 biodiversity observations and their subsequent verification and integration into Wikimedia platforms. We highlight the challenges and best practices for running a WikiBlitz, particularly around licensing and data quality, and demonstrate how tools such as iNaturalist2Commons and Wikidata queries can enhance the reuse of citizen science data. Finally, we provide a step-by-step tutorial to support future WikiBlitz events, ensuring broader participation and sustainable knowledge-sharing across platforms. -
On the value of data
What is the value of a dataset? This is a key question for many data management decisions. It is a difficult question to answer, as “it depends”, on the needs, the consumer, other data, and many other factors. This work aims at sketching an approach to evaluate data, that is based on trade-offs decisions between different aspects to data, respect to different usage scenarios. This work hs been developed as a “collaborative paper” (on the value of data, a collaborative experiment). The version here reported is what resulted from the discussions at the BioHackathon in Fukushima, 2024. -
Exploring Bioinformatics in the Wild: Insights from Real LLM Conversations
The intersection of artificial intelligence (AI) and conversational data offers promising oppor- tunities for advancing research in specialized fields such as biology and health sciences. The WildChat dataset, comprising over one million user-ChatGPT interactions, serves as a valuable resource for analyzing how advanced language models engage with complex topics. This work aims to explore how conversational AI models interpret and manage bioinformatics-related queries, assessing their effectiveness and identifying areas for improvement. By filtering and analyzing bioinformatics-related interactions within WildChat, the study highlights the current capabilities and limitations of these models, providing insights into their potential roles in supporting and enhancing research, education, and practical applications in bioinformatics and biology. Key findings include that GPT-3.5 Turbo can save both time and money while still providing satisfactory performance in handling bioinformatics-related queries, making it a cost-effective option for many applications. However, models like Llama 3 8B Instruct and Mistral 7B Instruct were found to underperform in comparison, struggling with the specialized vocabulary and nuanced contexts inherent in bioinformatics. Additionally, it was observed that Anthropic’s Claude model is notably harder to jailbreak, suggesting stronger safeguards against misuse, which is crucial for maintaining the integrity of conversational AI in sensitive domains. Expanding the scope of conversational datasets to include a broader range of detailed interactions is crucial for developing more robust, context-aware bioinformatics tools. This investigation not only underscores the strengths and weaknesses of current conversational AI systems but also offers a roadmap for future improvements, ultimately contributing to the evolving interface between AI technology and bioinformatics. -
BioHack24 report: Toward improving mechanisms for extracting RDF shapes from large inputs
RDF shapes have proven to be effective mechanisms for describing and validating RDF content.Typically, shapes are written by domain experts. However, writing and maintaining theseshapes can be challenging when dealing with large and complex schemas. To address this issue,automatic shape extractors have been proposed. These tools are designed to analyze existingRDF content and generate shapes that conform with the underlying schemas. Nevertheless,extracting shapes from large datasets presents significant scalability challenges.In this document, we describe our work during the 2024 BioHackathon held in Fukushima,Japan, to tackle this problem. Our approach is based on slicing the input data, performingparallelized shape extraction processes, and merging the resulting partial outputs. By refiningour software and methods, we successfully extracted shapes from a subset of UniProt, containingan estimated 15.9 billion triples. -
BioHack24 report: Using discovered RDF schemes: a compilation of potential use cases for shapes reusage
RDF shapes are formal expressions of schema structures in RDF data. Their primary purposeis twofold: describing and validating RDF data. However, as machine-readable representationsof the expected structures in a given data source, RDF shapes can be applied to varioustasks that require automatic comprehension of data schemas. In this paper, we present ourwork conducted during the DBCLS BioHackathon 2024 in Fukushima, Japan, to harness thepotential of RDF shapes. The identified and partially implemented use cases include thegeneration and validation of SPARQL queries, data and schema visualization, mappings toother formal syntaxes, and applications in data modeling scenarios. -
Expanding data on cultivation media and microbial traits
The standardization and integration of cultivation media data are essential for advancing microbial research and enabling AI-based predictions of optimal growth conditions. This study addresses the challenges of data fragmentation by aligning terminologies and mapping ingredients between two prominent databases: MediaDive (DSMZ) and TogoMedium (DBCLS). We successfully linked 870 ingredients, expanded the Growth Media Ontology (GMO), and prepared data for media similarity calculations, thereby enhancing the interoperability of these resources. Additionally, we developed the first version of a BacDive RDF knowledge graph, incorporating mapping rules for 24 key entities and materializing the data in turtle format to facilitate integration into broader knowledge networks. We also propose a novel process for the standardized registration of media recipes by depositors, ensuring that these recipes can be cited and shared consistently. Together, these efforts contribute to the creation of a more cohesive and accessible microbial data ecosystem, supporting future research and innovation. -
Revisiting SRAmetadb.sqlite
The SRAmetadb.sqlite database, which compiles Sequence Read Archive (SRA) metadata into an offline SQLite format, has been a crucial resource for bioinformatics tools like the SRAdb R package and the pysradb. Despite its utility, the database has not been regularly updated, with the last refresh occurring in late 2023. Moreover, no public tools exist to rebuild or update this database. This report introduces an open-source pipeline developed during the 2024 international biohackaton, designed to generate and update a similar SRAmetadb.sqlite database from SRA metadata, addressing the gap left by the lack of recent updates.The SRAmetadb.sqlite database’s value extends beyond its original use cases, offering potential integration with other tools such as DuckDB and programmatically accessing from custom scripts. The proposed pipeline introduces features like the generation of metadata subsets, enabling researchers to focus on specific species. It also offers offline access to SRA metadata, significantly enhancing query speed and efficiency. This adaptability is particularly relevant as new use cases emerge, including applications in large language models (LLMs) and Retrieval-Augmented Generation (RAG).This pipeline prioritizes low resource usage and ease of maintenance. It is not intended as a direct replacement for the original SRAmetadb.sqlite but seeks to maintain compatibility while exploring the benefits of modern SQLite features. By providing this tool as an open-source resource, the project encourages community involvement to ensure its ongoing development and relevance in the evolving landscape of bioinformatics research. -
DBCLS BioHackathon 2024 Report for Project: Human Glycome Atlas
As part of BioHackathon 2024, we here report on our analysis of tools reviewed by this group to implement a new knowledgebase called TOHSA for the Human Glycome Atlas (HGA) Project. In particular, we focus on the experiences of the integration process of the QLever framework, a promising Semantic Web tool for handling “Triple Stores” and SPARQL technologies in the scope of creating a reliable and performant Semantic Knowledge-Base (Infrastructure and Portal) for TOHSA. QLever highlights the ongoing relevance and potential of these technologies to deliver scalable and reliable solutions. It was nice to see that “Triple Stores” and SPARQL technology implementations and developments for the community are ongoing and that progressively useful and performant and scalable/reliable, open-source software is being implemented. And we did a general revision and comparison of relevant Semantic Web Frameworks for our use-case. -
The Plant Breeding Ontology (PBO): towards an ontology for the plant breeding community
The need of standardizing the language used within a community has been recognized as one of the major components to enable a better integration of data as well as their further analysis. The plant breeding community makes use of a very specialized language, which has been evolving according to the new technologies and needs of their final users (e.g. farmers). That community is disparate all over the world. Therefore, a translation of the most common used terms has always been a key asset to accomplish their objectives as well as the ones of their collaborators. Here, we present PBO (Plant Breeding Ontology), an ontology for the plant breeding community which captures more than 2200 entries where 80 represent the core terms. PBO has translations in 8 different languages: English (main language), Spanish, French, Dutch, German, Japanese, Catalan and Thai, as well as their definitions, synonyms, derived terms and samples of their usage. PBO has been built partially manually and semiautomatically. -
DBCLS BioHackathon 2024 report: Everything about workflow and container
Workflow engines are now widely used for genome analysis workflows.On the other hand, there are still difficulties to build and execute their workflows in various aspects.Here are examples of such difficulties:How to develop our workflows in workflow languages such as Common Workflow Language (CWL), Snakemake, Nextflow, and others?How to integrate our workflows with containers such as Docker, Singularity, and Podman?How to integrate our workflows with job schedulers such as Slurm and GridEngine?Our group solved these problems with the following activities. First, we cooperated with other groups to develop their workflows, and to make their workflows integrated with containers.Second, we developed and improved workflow ecosystems to remove the barriers to develop and execute their workflows. Ecosystems include workflow executors, specifications of workflow languages, and workflow-related tools.This paper reports what we did during the DBCLS BioHackathon 2024.