BioHackathon Europe 2020 Online
Preprints
-
Metadata for BioHackrXiv Markdown publications
biohackrxiv.org is a scholarly publication service forBioHackathons and Codefests where papers are generated from Markdowntemplates where the header is a YAML/JSON record that includes thetitle, authors, affiliations and tags. Many projects in BioHackathons are about using FAIR data. Because the current setup is lacking in the findable (F) andaccessible (A) of FAIR, for the ELIXIR BioHackathon 2020, we decidedto add an additional service that provides a SPARQL endpoint forqueries and some simple HTML output that can be embedded in aBioHackathon website. -
ELIXIR Software Management Plan for Life Sciences
Data Management Plans are now considered a key element of Open Science. They describe the data management life cycle for the data to be collected, processed and/or generated within the lifetime of a particular project or activity. A Software Manag ement Plan (SMP) plays the same role but for software. Beyond its management perspective, the main advantage of an SMP is that it both provides clear context to the software that is being developed and raises awareness. Although there are a few SMPs already available, most of them require significant technical knowledge to be effectively used. ELIXIR has developed a low-barrier SMP, specifically tailored for life science researchers, aligned to the FAIR Research Software principles. Starting from the Four Recommendations for Open Source Software, the ELIXIR SMP was iteratively refined by surveying the practices of the community and incorporating the received feedback. Currently available as a survey, future plans of the ELIXIR SMP include a human- and machine-readable version, that can be automatically queried and connected to relevant tools and metrics within the ELIXIR Tools ecosystem and beyond. -
Measuring outcomes and impact from the BioHackathon Europe
One of the recurring questions when it comes to BioHackathons is how to measure their impact, especially when funded and/or supported by the public purse (e.g., research agencies, research infrastructures, grants). In order to do so, we first need to understand the outcomes from a BioHackathon, which can include software, code, publications, new or strengthened collaborations, along with more intangible effects such as accelerated progress and professional and personal outcomes. In this manuscript, we report on three complementary approaches to assess outcomes of three BioHackathon Europe events: survey-based, publication-based and GitHub-based measures. We found that post-event surveys bring very useful insights into what participants feel they achieved during the hackathon, including progressing much faster on their hacking projects, broadening their professional network and improving their understanding of other technical fields and specialties. With regards to published outcomes, manual tracking of publications from specific servers is straightforward and useful to highlight the scientific legacy of the event, though there is much scope to automate this via text-mining. Finally, GitHub-based measures bring insights on some of the software and data best practices (e.g., license usage) but also on how the hacking activities evolve in time (e.g., activities observed in GitHub repositories prior, during and after the event). Altogether, these three approaches were found to provide insightful preliminary evidence of outcomes, thereby supporting the value of financing such large-scale events with public funds. -
Exploiting Bioschemas Markup to Populate IDPcentral
One of the goals of the ELIXIR Intrinsically Disordered Protein (IDP) community is create a registry called IDPcentral. The registry will aggregate data contained in the community’s specialist data sources such as DisProt, MobiDB, and Protein Ensemble Database (PED) so that proteins that are known to be intrinsically disordered can be discovered; with summary details of the protein presented, and the specialist source consulted for more detailed data. At the ELIXIR BioHackathon-Europe 2020, we aimed to investigate the feasibility of populating IDPcentral harvesting the Bioschemas markup that has been deployed on the IDP community data sources. The benefit of using Bioschemas markup, which is embedded in the HTML web pages for each protein in the data source, is that a standard harvesting approach can be used for all data sources; rather than needing bespoke wrappers for each data source API. We expect to harvest the markup using the Bioschemas Markup Scraper and Extractor (BMUSE) tool that has been developed specifically for this purpose. The challenge, however, is that the sources contain overlapping information about proteins but use different identifiers for the proteins. After the data has been harvested, it will need to be processed so that information about a particular protein, which will come from multiple sources, is consolidated into a single concept for the protein, with links back to where each piece of data originated.As well as populating the IDPcentral registry, we plan to consolidate the markup into a knowledge graph that can be queried to gain further insight into the IDPs. -
Knowledge graphs and wikidata subsetting
Knowledge graphs have successfully been adopted by academia, governement and industry to represent large scale knowledge bases. Open and collaborative knowledge graphs such as Wikidata capture knowledge from different domains and harmonize them under a common format, making it easier for researchers to access the data while also supporting Open Science.Wikidata keeps getting bigger and better, which subsumes integration use cases. Having a large amount of data such as the one presented in a scopeless Wikidata offers some advantages, e.g., unique access point and common format, but also poses some challenges, e.g., performance.Regular wikidata users are not unfamiliar with running into frequent timeouts of submitted queries. Due to its popularity, limits have been imposed to allow for fair access to many.However this suppreses many interesting and complex queries that require more computational power and resources. Replicating Wikidata on one’s own infrastructure can be a solution which also offers a snapshot of the contents of wikidata at some given point in time. There is no need to replicate Wikidata in full, it is possible to work with subsets targeting, for instance, a particular domain. Creating those subsets has emerged as an alternative to reduce the amount and spectrum of data offered by Wikidata. Less data makes more complex queries possible while still keeping the compatibility with the whole Wikidata as the model is kept. In this paper we report the tasks done as part of a Wikidata subsetting project during the Virtual BioHackathon Europe 2020 and SWAT4(HC)LS 2021, which had already started at NBDC/DBCLS BioHackathon 2019 in Japan, SWAT4(HC)LS hackathon 2019, and Virtual COVID-19 BioHackathon 2019. We describe some of approaches we identified to create subsets and some susbsets from the Life Sciences domain as well as other use cases we also discussed. -
SB4ER: an ELIXIR Service Bundle for Epidemic Response
Epidemic spread of new pathogens is quite a frequent event that affects not only humans but also animals and plants, and specifically livestock and crops. In the last few years, many novel pathogenic viruses have threatened human life. Some were mutations of the traditional influenza viruses, and some were viruses that crossed the animal-human divide.In both cases, when a novel virus or bacterial strain for which there is no pre-existing immunity or a vaccine released, there is the possibility of an epidemic or even a pandemic event, as the one we are experiencing today with COVID-19.In this context, we defined an ELIXIR Service Bundle for Epidemic Response: a set of tools and workflows to facilitate and speed up the study of new pathogens, viruses or bacteria. The final goal of the bundle is to provide tools and resources to collect and analyse data on new pathogens (bacteria and viruses) and their relation to hosts (humans, animals, plants). -
Connecting molecular sequences to their voucher specimens
When sequencing molecules from an organism it is standard practice to create voucher specimens. This ensures that the results are repeatable and that the identification of the organism can be verified. It also means that the sequence data can be linked to a whole host of other data related to the specimen, including traits, other sequences, environmental data, and geography. It is therefore critical that explicit, preferably machine readable, links exist between voucher specimens and sequence. However, such links do not exist in the databases of the International Nucleotide Sequence Database Collaboration (INSDC). If it were possible to create permanent bidirectional links between specimens and sequence it would not only make data more findable, but would also open new avenues for research. In the Biohackathon we built a semi-automated workflow to take specimen data from the Meise Herbarium and search for references to those specimens in the European Nucleotide Archive (ENA). We achieved this by matching data elements of the specimen and sequence together and by adding a “human-in-the-loop” process whereby possible matches could be confirmed. Although we found that it was possible to discover and match sequences to their vouchers in our collection, we encountered many problems of data standardization, missing data and errors. These problems make the process unreliable and unsuitable to rediscover all the possible links that exist. Ultimately, improved standards and training would remove the need for retrospective relinking of specimens with their sequence. Therefore, we make some tentative recommendations for how this could be achieved in the future. -
Linking PubDictionaries with UniBioDicts to support Community Curation
One of the many challenges that biocurators face, is the continuous evolution of ontologies and controlled vocabularies and their lack of coverage of biological concepts. To help biocurators annotate new information that cannot yet be covered with terms from authoritative resources, we produced an update of PubDictionaries: a resource of publicly editable, simple-structured dictionaries, accessible through a dedicated REST API. PubDictionaries was equipped with both an enhanced API and a new software client that connects it to the Unified Biological Dictionaries (UBDs) uniform data exchange format. This client enables efficient search and retrieval of ad hoc created terms, and easy integration with tools that further support the curator’s specific annotation tasks. A demo that combines the Visual Syntax Method (VSM) interface for general-purpose knowledge formalization, with this new PubDictionaries-powered UBD client, shows it is now easy to incorporate the user-created PubDictionaries terminologies into biocuration tools. -
Progress on Data Stewardship Wizard during BioHackathon Europe 2020
We used the Virtual BioHackathon Europe 2020 to work on a number of projects for improvement of the data stewardship wizard: (a) We made first steps to analysis of what is needed to make all questions and answers machine actionable (b) We worked on supporting the Horizon 2020 Data Management Plan Template (c) Several new integrations were made, e.g. to ROR and Wikidata (d) we made a draft plan for supporting multiple languages and (e) we implemented many suggestions for improvement of the knowledge model that had been suggested to us over the past time. Quickly after the BioHackathon, the adapted knowledge model, new integrations and the H2020 template have been made available to all users of the wizard.