Announcing release 14

We are pleased to announce the 14th release of WormBase ParaSite, bringing a new S. mediterranea assembly, and 8 other new and updated genomes.


New and updated genomes

Platyhelminths

We are happy to announce these new genomes of flatworms:

There is also an annotation update for Mesocestoides corti (PRJEB510) created with recently sequenced RNASeq data. It supports 5076 new genes and 8367 revised structures of the previous AUGUSTUS-only annotation.

If flatworm genomics is relevant to your work, be sure to also visit PlanMine, run by the authors of the Schmidtea genome. It contains many different assemblies from a number of free-living flatworms, phylogenetic data, and more.

Please note that we are deprecating the assembly SmedGD_c1.3 for Schmidtea mediterranea (PRJNA12585), corresponding to Robb et al. (2007), and intend to remove it once we are confident that no new research is being based on this assembly. Do let us know if you rely on it, or if there are good reasons for us to keep both Schmidteas around.

Nematodes

There are two new clade IV genomes, potentially relevant to agricultural research: Ditylenchus dipsaci (PRJNA498219), a plant pest, and an updated genome of an entomopathogenic nematode Steinernema carpocapsae (PRJNA202318).

There are also two genomes of free-living clade V nematodes. First is the genome of Halicephalobus mephisto (PRJNA528747), an extremophile found in deep rock fracture water in several gold mines in South Africa. We also have a genome of Mesorhabditis belari (PRJEB30104), an animal exhibiting an interesting pattern of reproduction: the eggs only mature after being activated by the males, which nevertheless do not pass on any genetic material.

Finally, we update WormBase core genomes to the WormBase version WS271.

Comparative genomics: bringing smaller trees

We remove altogether 7 genomes from our comparative genomics analysis for which there is a clearly better alternative genome of the same species.

We are hoping that this will make our results more robust overall, and their interpretation easier.

If you still need the old results, ortholog and paralog files from the last release are available through our FTP site. Apart from the previous S. mediterranea (PRJNA12585), we do not plan to remove any other genomes from our portal.

RNASeq studies

Our collaborators, the Functional Genomics group at the European Bioinformatics Institute, continue to process all public RNASeq studies through their plaform, RNASeq-er.

More studies

New data that has been produced in the last months, and more inclusive curation, helped us bring the total of studies processed on our site to 201, across 48 different species. This includes 30 studies for S. mediterranea, now aligned to the new assembly.

The total amount of studies RNASeq-er has data for is, as of July 2019, 639. Apart from 201 we have the results for, there are also 301 unannotated C. elegans or P. pacificus studies, which we skipped to reduce toil involved. The other 137 studies miss metadata or were consciously excluded, either because they did not have sufficiently many replicates, used a non-standard protocol like small miRNA-seq or Ribo-Seq, or because the authors asked us to suppress it.

Do contact us if you would like us to include a particular study, or if you have metadata that we are missing. It would be particularly helpful if you could let us know of any additional publications relating to our studies, as they are not always linked to archive records.

UI updates

There are no major changes to this aspect of our service since we rolled it out in the last release: you can browse through a list of studies by following a link on the species page, and access per-gene results on the gene page through the tab on the left.

We have improved how expression data is organized within the JBrowse track selector, separating studies into categories. If you usually access the gene expression studies through the “Gene expression” tab, have a look at how the studies are organized in the track selector (example link, S. mansoni) – it provides an interesting alternative way of viewing the results.

Analysis updates

Differential expression result files should now be slightly more convenient, and we hope that you will be able to open the files without trouble in any popular spreadsheet software. We are also providing complete results for each contrast – useful if you want to apply your own filtering criteria.

There is currently a slight non-uniformity in our count and TPM results, as RNASeq-er are switching between two different quantification methods. Either HTSeq (previous) or FeatureCounts (new) are used to quantify aligned reads within each study.

Announcing WormBase ParaSite 13

We are pleased to announce the 13th release of WormBase ParaSite, bringing new and updated genomes, data from RNASeq studies, and more useful annotation files for download.

Continue reading Announcing WormBase ParaSite 13

Announcing WormBase ParaSite 12

We are pleased to announce the 12th release of WormBase ParaSite, bringing new and updated genomes, and better handling of old identifiers and history.

New and updated genomes

The biggest updates of the release are probably two tapeworm genomes: Hymenolepis microstoma (PRJEB124), an update to a chromosome-level assembly, and a new genome Taenia multiceps (PRJNA307624).

There are also new clade IV nematode genomes: root-knot nematodes Meloidogyne graminicola (PRJNA411966) and Meloidogyne arenaria (PRJNA438575) (alternative assembly), and a bacteria-feeding Acrobeloides nanus (PRJEB26554).

The rest of the updates are:

Data and tools

We have re-ran our comparative genomics pipeline, constructing gene trees and finding orthologs and homologs. We have also reran newest InterProScan (5.30-69.0) and our cross-references pipeline for all our genomes.

We re-imported all public RNASeq data from our collaborators, and did a round of minor improvements to the displays. We now use information from PubMed to let you find data sets corresponding to a publication of interest.

Archived gene IDs

New sequencing technologies let labs construct better genome assemblies, bringing access to chromosome level assembly data even to relatively small research communities. We are excited to see this trend. Each genome update brings new evidence and potentially unlocks research into previously forbiddingly difficult biological questions.

At the same time, insights gathered in work published using previous assemblies should stay accessible to the community, so there is a need to connect different assemblies with each other. As of this release, WormBase ParaSite will keep track of previous identifier versions at gene level and display annotation history. Authors of a genome update do not always provide a mapping to previous version, so we developed a pipeline to match up identifiers between genome versions.

Overview of new functionality

For an overview of how this now works consider Smp_340760, a Schistosoma mansoni gene. The gene model was revised in the past, twice: it used to be called Smp_044010, but in “Schisto_7.1” version of the annotation that we published in WBPS11 the authors changed gene structure enough that they decided to assign it a new identifier, and in “Schisto_7.2” which we publish now the gene model was corrected slightly.

Searching by Smp_044010 now leads to a page explaining that the identifier was deprecated and redirecting to Smp_340760. Over there, the history is represented by a diagram:

SchistosomamansoniPRJEA36577_Smp_340760

The site also displays previous protein sequences of transcripts, to help you carry forward any conclusions based on the previous gene model – the less the sequence has changed, the more similar results will be for e.g. BLAST matches.

ID mapping pipeline

We used authors’ mappings between annotation versions for updates of Schistosoma mansoni and Hymenolepis microstoma. Everywhere else we used an automated mapping pipeline, adapted from Ensembl gene build.

Pipeline description

The pipeline runs a sequence matching tool exonerate, scoring matches of exons between the two assemblies and propagating the scores onto the
transcript and gene level. The scores are then adjusted based on synteny – if a gene A is near gene B in the previous genome, A is mapped to A’ in the new genome, and there is a gene B’ near A’, the match of B to B’ is strengthened. Finally, best matches are iteratively taken out of the scoring, producing a list of pairs.

Results and benchmarking

We find the pipeline to be quite conservative even after we relaxed a few parameters around minimal match scores and similar values.  Typically only between a third and two thirds of the genes in the updated genome have a related past identifier:

Genome previous genes total previous version mapped new genes total fraction mapped
Ancylostoma ceylanicum PRJNA72583 11783 WBPS11 7564 15892 0.476
Ascaris suum PRJNA62057 17974 WBPS9 9468 15260 0.620
Fasciola hepatica PRJEB25283 16806 WBPS10 7564 22676 0.334
Haemonchus contortus PRJEB506 19430 WBPS10 11439 21869 0.523
Meloidogyne incognita PRJEB8714 45351 WBPS10 11977 19212 0.623

We also ran the automated pipeline on the S. mansoni WBPS10->WBPS11 update, comparing the results to a manual mapping obtained by annotators tracking individual identifiers. Our pipeline carried forward 5165 genes that authors considered to have none or minor changes, and 1347 genes with larger changes, onto an annotation with 10172 genes. The pipeline missed 2584 genes present somewhere in the manual mapping that were lost in the automatic one. It disagreed with the manual mapping in only 199 cases: some were genuinely wrong calls, and some are some were on par with manual mapping by being e.g. a mapping to a paralog gene.

On identifiers from external databases and past releases

(updated in WBPS13 to reflect archiving functionality and mention gene IDs in GFF dumps)

As sequencing projects progress and yield improved data, we occasionally update genomes on the site to provide the best resources available. An example of this is Schistosoma mansoni, released in WBPS11. The authors of the new S. mansoni annotation preserved identifiers where genes were unchanged, and otherwise assigned new identifiers. They kept a list of changes – updates, splits and merges – between the two versions, available here in full.

Annotation authors do not always try to preserve gene identifiers, and e.g. in our other recent updates of H. contortus and F. hepatica, the identifiers are all new. In such case, we run a pipeline that tries to reconcile the two versions, and while the results are not even close in quality to a manually curated list of updates, it is still frequently helpful.

This list of gene updates is then integrated into WormBase ParaSite. Searching for an old gene should bring you to an archived ID page, showing the protein sequence of that gene and a link to a new ID if there is one. The data is also available in the GFFs, under previous_gene_id field in the column 9 of the gene.

The list is as complete as can be given the methods used to reach it – either authors tracking gene updates, or our method based on sequence similarity and homology – and sometimes our users find themselves with an old gene mentioned in a publication etc. that isn’t connected to a new identifier. We would like to describe a number of strategies that can be used to reach desired content in such cases, and go through a worked example.

Our search facility is best at suggesting and retrieving our gene identifiers e.g. WBGene00001135 and gene families e.g. hedgehog but will also retrieve a number of different identifiers from e.g. UniProt or GenBank. Do try it first, and let us know if the search is not returning something you would expect it to bring back.

FTP dumps from previous releases

Our Downloads section contains a link to our FTP site, with data for all previous releases of WormBase ParaSite. If all you require is e.g. a protein sequence for an old identifier, you are done – otherwise, you can use the sequence to search further.

BLAST

Given any sequence, you can retrieve a corresponding identifier by BLAST – perfect or near-perfect alignment across a large substring is almost always the right hit. Of course you might also find that your gene of interest has multiple copies, has been split in two in the new annotation, etc.

Other websites ūüôā

Going through the UniProt search frequently gives good results when searching for gene families, and you might want to try the BLAST service at ENA or UniProt as an alternative to our BLAST.
We aim for you to able to retrieve all the data you need from WormBase ParaSite, but using multiple resources is very pragmatic and frequently optimal, and thus with every release, we reconcile our identifiers with a number of external sources. This process lets our gene pages link to corresponding pages in another resources in the External references section. To find a relevant WormBase ParaSite page given e.g. an INSDC or UniProt ID, you can use the previously described search facility.

Exonerate

Exonerate is a very fast tool for sequence alignment. It could come in handy if you need to achieve similar results to our online BLAST for a few hundred anonymous sequences. We hope you will not have to resort to it!

Example

Chalmers et al (2015) have identified ten tegumental surface proteins in S.mansoni as important in understanding the host immune response to adult schistosomes. Two of the ten genes – Smp_081920 and Smp_158960 – are absent from the new annotation.

Smp_158960 turns up in search results under a different identifier – it is now Smp_345020. Looking in the External references table reveals that Smp_158960 is the only previous identifier for Smp_345020, and checking view all locations link reveals that Smp_345020 is the only gene annotated with Smp_158960 as a previous identifier, so the mapping is one to one. We can guess that authors modified the gene structure and decided to give the gene a new identifier.

Searching Smp_081920 doesn’t return anything in search, and it took me a bit of time to figure out what happened.
I knew I could get the sequence from the FTP site, but it saved me some effort to discover that the identifier is still hanging on in UniProt, which gave me the protein sequence.
I have then used BLASTP, which gave me a full match with Smp_166350 on about 100bp: long enough for me to potentially trust but perhaps omitted by authors reconciling old and new identifiers.
I got certain I am looking at the same gene after comparing the exon structure of Smp_166350 with the structure of Smp_081920, conveniently still online at our friends’ place, Ensembl Metazoa. The two are very similar, except Smp_166350 has three additional non-coding exons.

I was skeptical of these non-coding exons and wanted to see some expression evidence for them.
I opened the JBrowse display in ParaSite and enabled some tracks across a few studies and developmental stages, and my skepticism grew. I saw high peaks in three non-coding exons and no reads aligned to what is claimed to be the coding region in tracks for cercariae and juveniles. Then, in tracks for adult worms but also in a track for miracidia, I saw the opposite: reads in the coding region, and no expression for that UTR.

My conclusion from this data would be that the gene Smp_081920 is now Smp_166350, but the structure of Smp_166350 is not completely correct: the UTRs are spurious, and they probably make another gene instead. I have submitted this conclusion to the annotation authors – it will be corrected in future versions of the annotation.

Announcing WormBase ParaSite 11

We are pleased to announce the 11th release of WormBase ParaSite. We have added data, new functionality, and improved some of our per-release processes. Quite a few genomes are new or have been significantly updated.

 

Comments in ParaSite

By popular request we have added a comment-like space to our gene pages. You can mention your own results, point out an inconsistency, or make an observation about displayed data. We hope the comments will be of scientific content, even when taking a lighter form than communication through peer-reviewed journals.

New species and updates

We are updating three flatworm genomes:

The new assembly of S. mansoni is very complete and accurate, and the gene models were manually annotated over the course of several months.

This is the list of our new or updated nematode genomes:

Parasites:

Free-living:

We are also publishing a fix from the authors to their annotation of Ascaris suum and Parascaris univalens. The gene models submitted to us in the previous release suffered from a systematic error, which resulted in much shorter proteins. We regret the error.

Additionally, the release includes genomes from the WS265 release of WormBase, including the newest WormBase core species, Trichuris muris.

In total 17 genomes were added or changed, bringing the total to 148 genomes across 124 species.

Analyses

We ran all our usual tertiary annotation pipelines that identify repeats, low complexity regions, non-coding RNAs, protein domains, predicted GO terms, and more, as well as our comparative genomics pipeline, where we expect an improvement especially around recently updated branches.

We have revisited our cross-references pipeline. This pipeline lets us support discovery of what is currently known about each gene, and provide rich descriptions, through inferring links between our genes and entities from other resources. We have updated these cross-references for all our genomes, and added new references: to UniParc and RNAcentral. You can either find these references on the gene pages, or use BioMart to retrieve them in bulk.

RNASeq data

We have configured our JBrowse track displays to include tracks with aligned reads produced by the RNASeq-er project. RNASeq-er processes all RNASeq datasets published in ENA, so there are many tracks to choose from: currently 11735, from 492 studies across 74 species.

Use case 1 : discover public datasets

Browsing our displays could be an alternative to searching a primary source like the European Nucleotide Archive, benefiting from an additional filter: they include  only the runs that RNASeq-er successfully aligned and passed through QC.

For more than half the species, the only RNASeq data available is what was produced while preparing the genome.

For other species, there have been additional studies, for example see the JBrowse display of tapeworm Echinococcus multilocularis. Apart from a few studies from the same lab in the United Kingdom where the genome was sequenced, the display contains runs sequenced in China and Japan as part of BioProjects PRJNA254535 and PRJDB3524.

The WormBase ParaSite species with most RNASeq data present is unsurprisingly C. elegans, with 6446 runs across 245 studies.

Use case 2 : compare expression across tracks

Consider the gene Smp_169190 in Schistosoma mansoni. Lu et al (2018, preprint) compared expression in developmental stages, and found Smp_169190 to be differentially expressed in the cercarial life stage.

This link here¬† takes you to two of the tracks used in Lu et al’s metaanalysis. These are ERR022872, showing RNA sequenced from cercaria, and a track ERR506086, with RNA sequenced from an an adult worm. The two tracks differ in expression dramatically in gene Smp_169190.

You can also see the expression in other life stages. A search for “cercaria” shows quite a few tracks, that will probably be similar to ERR022872. Similar search for “miracidia” yields a track SRR922067 and an interesting result: miracidia don’t express Smp_169190, but there is high expression for two of the nearby TAL genes.

Genomes we don’t have in WormBase ParaSite

The work on the version 11 of WormBase ParaSite is ongoing! We now have a list of assemblies of new species, and improved genomes of existing species, that we plan to publish. As well as including data submitted to us directly, we also surveyed the archives and have made an effort to include all assemblies of taxa Nematoda and Platyhelminthes  available in NCBI and ENA that have been annotated with gene structures.

We seem to be doing well in our attempts to reflect the sequencing efforts of relatives of Caenorhabdhitis elegans Рthe order Rhabditida makes up 43 of our 100 Nematoda genomes. This good coverage seems to extend to other nematodes relevant to answering basic biology questions, for nematodes causing disease in humans and livestock, and for parasites of plants.

The situation is very different for our other phylum of interest, Platyhelminthes, which is more diverse, and has historically been harder to study due to complex life cycles of these animals. There isn’t a clear model organism like C. elegans for Nematoda for parasitologists to use to study Platyhelminthes, but many species, like the flukes F. hepatica and S. mansoni, are subject to active research.

In particular, phylum Platyhelminthes has a class Turbellaria of free-living marine worms, full of ancient, evolutionarily unique and remarkable organisms, like this Pseudobiceros bedfordi : a large colorful flatworm that feeds on ascidians and small crustaceans.

 

Some species of this class are studied due to their unusual properties. We are tracking research in the area, and currently have two genomes of Turbellaria РMacrostomum lignano and Schmidtea mediterranea.

M.lignano is a tiny worm living in shore sands of Adriatic sea, and is used as a model organism for a number of evolutionary studies. We’ll be publishing an update to this genome, sequenced in 2017 by University Medical Center Groningen.

S. mediterranea is a freshwater species, important for research in stem cells and regeneration. ParaSite currently hosts the original published version of the Schmidtea genome, from 2014. Another group has since published  a new and improved assembly. We are hopeful that the authors will annotate this version of the genome with gene structures soon, so that we can include it in ParaSite.

Do reach out to us if you have a potential submission, know of a published genome that could be included in ParaSite, or to tell us about ongoing research about Nematoda or Platyhelminthes that should be on our radar.