ASV portal: an interface to DNA-based biodiversity data in the Living Atlas – BMC Bioinformatics

The ASV portal provides options to submit and search for denoised metabarcoding data and associated occurrence records via intermediary ASV and BLAST databases (Fig. 1).

Data providers submit their data using a spreadsheet template based on the Darwin Core (DwC) standard for biodiversity data [14]. Specifically, the template corresponds to a DwC event core with associated contextual (‘extended Measurement or Facts’) and sequence-related (‘DNA derived data’ [15]) extensions. Each event is also associated with occurrences reported in ASV table format, i.e. as read counts given per sample (row) and ASV (column), rather than in the typical DwC occurrence format.

Submitted data files are curated and imported into the ASV database by portal administrators. A standard taxonomic annotation is then applied to each ASV, using current versions of selected classification algorithms and reference databases. The database schema also allows for successive re-annotations, enabling improved taxonomic accuracy and resolution as reference databases develop. Each DwC occurrence is, however, also assigned a unique taxon ID, based on the MD5 checksum of the underlying ASV sequence. This ensures that identification is consistent between data providers, and unaffected by changes in the mapping of ASVs to different taxon concepts.

Imported datasets are shared with GBIF and LA via the Integrated Publishing Toolkit [16]. The ASV database schema includes linked DwC views that can be accessed and filtered to create a new data resource in the IPT. The portal administrator then invites the data provider to fill in dataset-level metadata in the IPT form, before the dataset is formally published and made available to LA users.

The ASV portal provides two options for finding ASVs and published LA records: BLAST or FILTER search. In the BLAST form, users can paste in FASTA sequences, and set the minimum identity and query coverage of returned hits. Sequences are then aligned against a BLAST database that portal administrators rebuild when new data are imported into the ASV database. The FILTER form lets the user filter out ASVs based on sequencing details (e.g. target gene) and taxonomy. Search results are presented in similar, paginated tables in which users can select specific ASV records. Users can download these directly, in Excel or delimited text format, or choose to explore associated occurrence records in the LA platform. An illustrated use case for ASV portal search is given in Fig. 2, and a video tutorial covering both data submission and searching is available on YouTube [17].

Fig. 2figure 2

Use-case: Searching for Acinetobacter sequences and occurrence records in the ASV portal. A user interested in finding denoised sequences and associated occurrence records of a specific taxon, is presented with two search options in the start page of the ASV portal: BLAST and filter search (a). Filtering for ASVs derived from the 16S rRNA target gene in the genus Acinetobacter (b), produces a list of six ASVs, available for direct download (c). Alternatively, BLAST:ing against a known marker sequence from the targeted taxon (d), results in a corresponding list of ASVs (e). The user then opts for showing associated occurrence records in the main atlas platform (f), where data can be visualised and analysed together with other species observations as well as environmental and contextual data layers

Full size image

Mục lục bài viết

Future development

The ASV portal is currently an integral part of the Swedish LA instance [18], but given the rate at which sequence-based biodiversity data are being collected around the world, we envision that the LA community at large will benefit from our initiative to integrate this data source. We aim to keep the portal up to date, and welcome user requests, as well as contributions from biodiversity informatics programmers that want to join this open source project. The application will likely need to be optimised to support larger amounts of data in the future, and possible development includes adding an option for direct API access to data, by providing custom R and Python client libraries for this.