Revisiting SRAmetadb.sqlite

The SRAmetadb.sqlite database, which compiles Sequence Read Archive (SRA) metadata into an offline SQLite format, has been a crucial resource for bioinformatics tools like the SRAdb R package and the pysradb. Despite its utility, the database has not been regularly updated, with the last refresh occurring in late 2023. Moreover, no public tools exist to rebuild or update this database. This report introduces an open-source pipeline developed during the 2024 international biohackaton, designed to generate and update a similar SRAmetadb.sqlite database from SRA metadata, addressing the gap left by the lack of recent updates.The SRAmetadb.sqlite database’s value extends beyond its original use cases, offering potential integration with other tools such as DuckDB and programmatically accessing from custom scripts. The proposed pipeline introduces features like the generation of metadata subsets, enabling researchers to focus on specific species. It also offers offline access to SRA metadata, significantly enhancing query speed and efficiency. This adaptability is particularly relevant as new use cases emerge, including applications in large language models (LLMs) and Retrieval-Augmented Generation (RAG).This pipeline prioritizes low resource usage and ease of maintenance. It is not intended as a direct replacement for the original SRAmetadb.sqlite but seeks to maintain compatibility while exploring the benefits of modern SQLite features. By providing this tool as an open-source resource, the project encourages community involvement to ensure its ongoing development and relevance in the evolving landscape of bioinformatics research.