We are pleased to announce the 13th release of WormBase ParaSite, bringing new and updated genomes, data from RNASeq studies, and more useful annotation files for download.
We are releasing a PacBio assembly of the soybean cyst nematode, Heterodera glycines (PRJNA381081), and an alternative assembly of Clonorchis sinensis (PRJNA386618, the Korean strain).
After Prabh et al. (2018) we present the genomes of many relatives of Pristionchus pacificus: Micoletzkya japonica, Parapristionchus giblindavisi, and Pristionchus arcanus, entomophagus, exspectatus, fissidentatus, japonicus, maxplancki, and mayeri.
We are also updating the annotation forHaemonchus contortus (PRJEB506) – over 3500 genes were revised or updated. The authors are still working on the annotation using Apollo, a collaborative gene curation platform: if you would like to lend them a hand and curate a few (or a few hundred) gene models, do write to us through the helpdesk.
We are steadily extending our capability in offering access to RNASeq data. In this release, apart from displaying read alignments in our genome browsers, we are also providing quantitative information for 46 species with data from over 120 studies.
Alignments and read counts per run were produced by our collaborators, the Gene Expression team at the EBI. We have collated and further processed their results based on our curation. The data available includes:
- read counts and transcripts per million reads (TPMs) for each gene in each ENA run
- TPMs in a condition – median value for all runs representing the same condition
- differential expression analysis between selected pairs of conditions
RNASeq studies page
The data is presented on a new, separate page. For species that we have the data for, you can navigate to it using an icon on the species page.
Menus on the gene page
A subset of this data is also presented per gene, available under an option in the gene page left hand side menu.
We have divided studies into categories – those fitting into “Life cycle”, “Cell types” , and “Organism parts” are presented as TPMs per condition, and those we categorised as “Response to treatment” are displayed in a table of differential expression results.
Studies without a sufficiently focused experimental design or sufficiently many replicates are presented as a statistical summary in the “Other” section.
Changes to the comparative genomics pipeline
Our data on gene trees, orthologs, and paralogs is calculated by a pipeline developed by Ensembl’s Compara team. The newest version changes its approach to clustering by involving an HMM library, created to recognise protein families that are common across the tree of life. The proteins that did not get classified as part of a family are then clustered via all-to-all sequence matching, and the rest of the process – multiple alignment, tree building, estimating branch lengths – proceeds as before. A full description is available in the Ensembl documentation.
We feel it is important to point out this change of methodology involves some inherent biases due to the uneven coverage by the HMMs. Given the training set (from 708 species used to construct it, only 11 are nematodes, and just one is a flatworm) we trust the HMMs’ ability to recognise genes common to all life, to all animals, to all nematodes, and to genes that are represented in the genus Caenorhabditis.
This would lead us to expect such genes to be parts of larger families. Conversely, we might expect genes that are not well represented by an HMM library to be in families that are slightly smaller – for example, a gene family only specific to tapeworms or root-knot nematodes would shrink if a few ambiguous genes are instead classified outside the family.
Here are some summary statistics:
|Species||Genes with … ||Previous||Current||% change||% New||% Lost|
|C. elegans||orthologs in C. briggsae||15581||15052||-3.4||4.2||7.6|
|C. elegans||orthologs in S. mansoni||4231||5028||+18.8||38.8||20|
|S. mansoni||orthologs in S. japonicum||7533||6753||-10.4||5.5||15.9|
|S. mansoni||orthologs in C. elegans||3975||5006||+25.9||39.6||13.7|
Overall, we are cautiously optimistic about the new method. We have added both the previous and the current results (lists of paralogs and orthologs) through our FTP site.
More useful GFF3 files
WormBase has a long tradition of delivering its data in maximally convenient file formats. We have undertaken some effort this release to further improve this aspect of our service.
Working with genomic data on the level of FASTAs and GFFs can be very powerful, especially when running clade-scale analyses or retrieving bulk lists in ways that would be hard to do with BioMart or the REST API.
Genes and their structural models are currently annotated in the GFFs with:
- Systematic IDs
- Previous systematic IDs
- InterPro domains
The files contain also the coordinates of repetitive and low complexity elements, annotated with repeat type and class.
You can get the files from individual species pages. Bulk download is also possible, using a tool like wget:
wget -A '*.annotations.gff3.gz' --recursive --no-directories ftp://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS13/species
Corrections to names and sequences
A number of genomes that appeared in early WormBase ParaSite releases have also later been made available through archive sites. We made sure we link to the right entries, and had a look at the amount of divergence between us and NCBI / ENA. This has initiated a number of corrections and changes:
- Diphyllobothrium latum renamed to Dibothriocephalus latus, following the NCBI taxonomy entry
- Corrected BioProjects, following NCBI/ENA: O. ochengi – PRJEB1465, A. viteae – PRJEB1697, L. loa – PRJNA37757, P. exspectatus -PRJEB24288
- Updated the assembly for D. medinensis: removed three scaffolds that ENA has removed in the submission process
- Updated the assembly for S. papillosus: removed N’s from the end of one scaffold, and removed one repeat feature that was there
- Updated the assembly for S. stercoralis: removed four N’s from the beginning of a scaffold, and coorrespondingly shifted all the feature coordinates by four ( :-# )
We prefer to rely on NCBI or ENA to provide reference copies of the sequence, and stay in sync with these sites. This is balanced against the principle that data that is worth accessing is almost always more valuable when it’s available sooner. Wherever we have diverged we are making an effort to notify you of this in the genome description on the species page.