Meetings

Recent preprints

  • On the value of data

    What is the value of a dataset? This is a key question for many data management decisions. It is a difficult question to answer, as “it depends”, on the needs, the consumer, other data, and many other factors. This work aims at sketching an approach to evaluate data, that is based on trade-offs decisions between different aspects to data, respect to different usage scenarios. This work hs been developed as a “collaborative paper” (on the value of data, a collaborative experiment). The version here reported is what resulted from the discussions at the BioHackathon in Fukushima, 2024.
  • Exploring Bioinformatics in the Wild: Insights from Real LLM Conversations

    The intersection of artificial intelligence (AI) and conversational data offers promising oppor- tunities for advancing research in specialized fields such as biology and health sciences. The WildChat dataset, comprising over one million user-ChatGPT interactions, serves as a valuable resource for analyzing how advanced language models engage with complex topics. This work aims to explore how conversational AI models interpret and manage bioinformatics-related queries, assessing their effectiveness and identifying areas for improvement. By filtering and analyzing bioinformatics-related interactions within WildChat, the study highlights the current capabilities and limitations of these models, providing insights into their potential roles in supporting and enhancing research, education, and practical applications in bioinformatics and biology. Key findings include that GPT-3.5 Turbo can save both time and money while still providing satisfactory performance in handling bioinformatics-related queries, making it a cost-effective option for many applications. However, models like Llama 3 8B Instruct and Mistral 7B Instruct were found to underperform in comparison, struggling with the specialized vocabulary and nuanced contexts inherent in bioinformatics. Additionally, it was observed that Anthropic’s Claude model is notably harder to jailbreak, suggesting stronger safeguards against misuse, which is crucial for maintaining the integrity of conversational AI in sensitive domains. Expanding the scope of conversational datasets to include a broader range of detailed interactions is crucial for developing more robust, context-aware bioinformatics tools. This investigation not only underscores the strengths and weaknesses of current conversational AI systems but also offers a roadmap for future improvements, ultimately contributing to the evolving interface between AI technology and bioinformatics.
  • BioHack24 report: Toward improving mechanisms for extracting RDF shapes from large inputs

    RDF shapes have proven to be effective mechanisms for describing and validating RDF content.Typically, shapes are written by domain experts. However, writing and maintaining theseshapes can be challenging when dealing with large and complex schemas. To address this issue,automatic shape extractors have been proposed. These tools are designed to analyze existingRDF content and generate shapes that conform with the underlying schemas. Nevertheless,extracting shapes from large datasets presents significant scalability challenges.In this document, we describe our work during the 2024 BioHackathon held in Fukushima,Japan, to tackle this problem. Our approach is based on slicing the input data, performingparallelized shape extraction processes, and merging the resulting partial outputs. By refiningour software and methods, we successfully extracted shapes from a subset of UniProt, containingan estimated 15.9 billion triples.
  • BioHack24 report: Using discovered RDF schemes: a compilation of potential use cases for shapes reusage

    RDF shapes are formal expressions of schema structures in RDF data. Their primary purposeis twofold: describing and validating RDF data. However, as machine-readable representationsof the expected structures in a given data source, RDF shapes can be applied to varioustasks that require automatic comprehension of data schemas. In this paper, we present ourwork conducted during the DBCLS BioHackathon 2024 in Fukushima, Japan, to harness thepotential of RDF shapes. The identified and partially implemented use cases include thegeneration and validation of SPARQL queries, data and schema visualization, mappings toother formal syntaxes, and applications in data modeling scenarios.
  • Publishing FAIR datasets from Bioimaging repositories

    We assessed the implementation of the FAIR principles in the currentbioimaging and clinical imaging data repositories. Additionally, to make the RDF export triples from the IDR discoverable, we also explored the Fair Data Point interface (Silva Santos et al.,2023), as a mechanism to facilitate the exposure of machine-actionable metadata and how itcould be added to the Imaging Data Resource (IDR) portal.
  • INTOXICOM Workshop Report: FAIRification of Toxicological Research Output: Leveraging ELIXIR Resources

    This report documents the first workshop of the ELIXIR Toxicology Community (Martenset al., 2023), held in Utrecht on May 28-29, 2024 (FAIRification of Toxicological ResearchOutput: Leveraging ELIXIR Resources, 2024), as part of the INTOXICOM ImplementationStudy workshop series (Integrating the Toxicology Community into ELIXIR, 2024). The main topic of the meeting was the FAIRification of toxicological research outputs and exploring thepotential role of ELIXIR resources in this process. A team of ten people from the ELIXIRToxicology Community, including Marvin Martens, Penny Nymark, Iseult Lynch, Meike Bünger, Rob Stierum, Thomas Exner, Egon Willighagen, Ammar Ammar, Dominik Martinát, and Karel Berka, coordinated the event.
  • Unveiling ecological dynamics through simulation and visualization of biodiversity data cubes

    The gcube R package, developed during the B-Cubed hackathon (Hacking Biodiversity Data Cubes for Policy), provides a flexible framework for generating biodiversity data cubes using minimal input. The package assumes three consecutive steps (1) the occurrence process, (2) the detection process, and (3) the grid designation process, accompanied by three main functions respectively: simulate_occurrences(), sample_observations(), and grid_designation(). It allows for customisable spatial and temporal patterns, detection probabilities, and sampling biases. During the hackathon, collaboration was highly efficient due to thorough preparation, task division, and the use of a scrum board. Fourteen participants contributed 209 commits, resulting in a functional package with a pkgdown website, 67 % code coverage, and successful CMD checks. However, certain limitations were identified, such as the lack of spatiotemporal autocorrelation in the occurrence simulations, which affects the model’s realism. Future development will focus on improving spatiotemporal dynamics, adding comprehensive documentation and testing, and expanding functionality to support multi-species simulations. The package also aims to incorporate a virtual species workflow, linking the virtualspecies package to the gcube processes. Despite these challenges, gcube strikes a balance between usability and complexity, offering researchers a valuable tool for simulating biodiversity data cubes to assess research questions under different parameter settings, such as the effect of spatial clustering on the occurrence-to-grid designation and the effect of different patterns of missingness on data quality and robustness of derived biodiversity indicators.