We are pleased to announce the 13th release of WormBase ParaSite, bringing new and updated genomes, data from RNASeq studies, and more useful annotation files for download.Continue reading Announcing WormBase ParaSite 13
We’re pleased to announce that a new Wellcome Advanced Course in helminth bioinformatics is now open for applications. The course is aimed at Africa-based researchers at various levels. It will be a hands on and practical introduction to bioinformatics for helminth researchers, covering:
- The use of public databases (including WormBase ParaSite) to explore gene and protein function
- Genome assembly
- Variant calling
- Differential gene expression
- Unix/linux command-line and some basic R
Dates and Deadlines
The course will be held at the West African Centre for Cell Biology of Infectious Pathogens (WACCBIP), University of Accra, Ghana, from 8 -13th September 2019.
The course is free to attend for non-commercial applicants, and a number of bursaries are available to cover travel, accommodation and sustenance.
The deadline for applications is 9th May.
More details are available here: https://coursesandconferences.wellcomegenomecampus.org/our-events/helminth-bioinformatics-ghana-2019/
We are pleased to announce the 12th release of WormBase ParaSite, bringing new and updated genomes, and better handling of old identifiers and history.
New and updated genomes
There are also new clade IV nematode genomes: root-knot nematodes Meloidogyne graminicola (PRJNA411966) and Meloidogyne arenaria (PRJNA438575) (alternative assembly), and a bacteria-feeding Acrobeloides nanus (PRJEB26554).
The rest of the updates are:
- annotation update to Ancylostoma ceylanicum (PRJNA72583)
- updating WormBase core species B. malayi and C. briggsae to WS267 (with C. elegans and T. muris still corresponding to version WS265 due to our omission – we apologise!)
- about a 100 updated gene models for Schistosoma mansoni (PRJEA36577)
- fixed typo in the name of Oscheius tipulae (PRJEB15512) (Oschieus -> Oscheius)
- renamed genes for Plectus sambesii (PRJNA390260), from names like “g2” to names like “PSAMB.scaffold2size251193.g730”
Data and tools
We have re-ran our comparative genomics pipeline, constructing gene trees and finding orthologs and homologs. We have also reran newest InterProScan (5.30-69.0) and our cross-references pipeline for all our genomes.
We re-imported all public RNASeq data from our collaborators, and did a round of minor improvements to the displays. We now use information from PubMed to let you find data sets corresponding to a publication of interest.
Archived gene IDs
New sequencing technologies let labs construct better genome assemblies, bringing access to chromosome level assembly data even to relatively small research communities. We are excited to see this trend. Each genome update brings new evidence and potentially unlocks research into previously forbiddingly difficult biological questions.
At the same time, insights gathered in work published using previous assemblies should stay accessible to the community, so there is a need to connect different assemblies with each other. As of this release, WormBase ParaSite will keep track of previous identifier versions at gene level and display annotation history. Authors of a genome update do not always provide a mapping to previous version, so we developed a pipeline to match up identifiers between genome versions.
Overview of new functionality
For an overview of how this now works consider Smp_340760, a Schistosoma mansoni gene. The gene model was revised in the past, twice: it used to be called Smp_044010, but in “Schisto_7.1” version of the annotation that we published in WBPS11 the authors changed gene structure enough that they decided to assign it a new identifier, and in “Schisto_7.2” which we publish now the gene model was corrected slightly.
Searching by Smp_044010 now leads to a page explaining that the identifier was deprecated and redirecting to Smp_340760. Over there, the history is represented by a diagram:
The site also displays previous protein sequences of transcripts, to help you carry forward any conclusions based on the previous gene model – the less the sequence has changed, the more similar results will be for e.g. BLAST matches.
ID mapping pipeline
We used authors’ mappings between annotation versions for updates of Schistosoma mansoni and Hymenolepis microstoma. Everywhere else we used an automated mapping pipeline, adapted from Ensembl gene build.
The pipeline runs a sequence matching tool exonerate, scoring matches of exons between the two assemblies and propagating the scores onto the
transcript and gene level. The scores are then adjusted based on synteny – if a gene A is near gene B in the previous genome, A is mapped to A’ in the new genome, and there is a gene B’ near A’, the match of B to B’ is strengthened. Finally, best matches are iteratively taken out of the scoring, producing a list of pairs.
Results and benchmarking
We find the pipeline to be quite conservative even after we relaxed a few parameters around minimal match scores and similar values. Typically only between a third and two thirds of the genes in the updated genome have a related past identifier:
|Genome||previous genes total||previous version||mapped||new genes total||fraction mapped|
|Ancylostoma ceylanicum PRJNA72583||11783||WBPS11||7564||15892||0.476|
|Ascaris suum PRJNA62057||17974||WBPS9||9468||15260||0.620|
|Fasciola hepatica PRJEB25283||16806||WBPS10||7564||22676||0.334|
|Haemonchus contortus PRJEB506||19430||WBPS10||11439||21869||0.523|
|Meloidogyne incognita PRJEB8714||45351||WBPS10||11977||19212||0.623|
We also ran the automated pipeline on the S. mansoni WBPS10->WBPS11 update, comparing the results to a manual mapping obtained by annotators tracking individual identifiers. Our pipeline carried forward 5165 genes that authors considered to have none or minor changes, and 1347 genes with larger changes, onto an annotation with 10172 genes. The pipeline missed 2584 genes present somewhere in the manual mapping that were lost in the automatic one. It disagreed with the manual mapping in only 199 cases: some were genuinely wrong calls, and some are some were on par with manual mapping by being e.g. a mapping to a paralog gene.
Update: The task has not been finished yet and may take a couple of more hours. We expect to complete the task by 10pm UK Time. Thank you for your cooperation.
Please note that we are going to perform a server maintenance for the website on Tuesday 4th Dec 2018 from 2pm to 5pm (UK Time). During this period, you will not be able to sign in and use tools on the website including BLAST and VEP. We are sorry for the inconvenience this may cause.
We are pleased to announce the 11th release of WormBase ParaSite. We have added data, new functionality, and improved some of our per-release processes. Quite a few genomes are new or have been significantly updated.
Comments in ParaSite
By popular request we have added a comment-like space to our gene pages. You can mention your own results, point out an inconsistency, or make an observation about displayed data. We hope the comments will be of scientific content, even when taking a lighter form than communication through peer-reviewed journals.
New species and updates
We are updating three flatworm genomes:
The new assembly of S. mansoni is very complete and accurate, and the gene models were manually annotated over the course of several months.
This is the list of our new or updated nematode genomes:
We are also publishing a fix from the authors to their annotation of Ascaris suum and Parascaris univalens. The gene models submitted to us in the previous release suffered from a systematic error, which resulted in much shorter proteins. We regret the error.
Additionally, the release includes genomes from the WS265 release of WormBase, including the newest WormBase core species, Trichuris muris.
In total 17 genomes were added or changed, bringing the total to 148 genomes across 124 species.
We ran all our usual tertiary annotation pipelines that identify repeats, low complexity regions, non-coding RNAs, protein domains, predicted GO terms, and more, as well as our comparative genomics pipeline, where we expect an improvement especially around recently updated branches.
We have revisited our cross-references pipeline. This pipeline lets us support discovery of what is currently known about each gene, and provide rich descriptions, through inferring links between our genes and entities from other resources. We have updated these cross-references for all our genomes, and added new references: to UniParc and RNAcentral. You can either find these references on the gene pages, or use BioMart to retrieve them in bulk.
We have configured our JBrowse track displays to include tracks with aligned reads produced by the RNASeq-er project. RNASeq-er processes all RNASeq datasets published in ENA, so there are many tracks to choose from: currently 11735, from 492 studies across 74 species.
Use case 1 : discover public datasets
Browsing our displays could be an alternative to searching a primary source like the European Nucleotide Archive, benefiting from an additional filter: they include only the runs that RNASeq-er successfully aligned and passed through QC.
For more than half the species, the only RNASeq data available is what was produced while preparing the genome.
For other species, there have been additional studies, for example see the JBrowse display of tapeworm Echinococcus multilocularis. Apart from a few studies from the same lab in the United Kingdom where the genome was sequenced, the display contains runs sequenced in China and Japan as part of BioProjects PRJNA254535 and PRJDB3524.
The WormBase ParaSite species with most RNASeq data present is unsurprisingly C. elegans, with 6446 runs across 245 studies.
Use case 2 : compare expression across tracks
Consider the gene Smp_169190 in Schistosoma mansoni. Lu et al (2018, preprint) compared expression in developmental stages, and found Smp_169190 to be differentially expressed in the cercarial life stage.
This link here takes you to two of the tracks used in Lu et al’s metaanalysis. These are ERR022872, showing RNA sequenced from cercaria, and a track ERR506086, with RNA sequenced from an an adult worm. The two tracks differ in expression dramatically in gene Smp_169190.
You can also see the expression in other life stages. A search for “cercaria” shows quite a few tracks, that will probably be similar to ERR022872. Similar search for “miracidia” yields a track SRR922067 and an interesting result: miracidia don’t express Smp_169190, but there is high expression for two of the nearby TAL genes.