We are pleased to announce the 13th release of WormBase ParaSite, bringing new and updated genomes, data from RNASeq studies, and more useful annotation files for download.Continue reading Announcing WormBase ParaSite 13
We’re pleased to announce that a new Wellcome Advanced Course in helminth bioinformatics is now open for applications. The course is aimed at Africa-based researchers at various levels. It will be a hands on and practical introduction to bioinformatics for helminth researchers, covering:
- The use of public databases (including WormBase ParaSite) to explore gene and protein function
- Genome assembly
- Variant calling
- Differential gene expression
- Unix/linux command-line and some basic R
Dates and Deadlines
The course will be held at the West African Centre for Cell Biology of Infectious Pathogens (WACCBIP), University of Accra, Ghana, from 8 -13th September 2019.
The course is free to attend for non-commercial applicants, and a number of bursaries are available to cover travel, accommodation and sustenance.
The deadline for applications is 9th May.
More details are available here: https://coursesandconferences.wellcomegenomecampus.org/our-events/helminth-bioinformatics-ghana-2019/
We are pleased to announce the 12th release of WormBase ParaSite, bringing new and updated genomes, and better handling of old identifiers and history.
New and updated genomes
There are also new clade IV nematode genomes: root-knot nematodes Meloidogyne graminicola (PRJNA411966) and Meloidogyne arenaria (PRJNA438575) (alternative assembly), and a bacteria-feeding Acrobeloides nanus (PRJEB26554).
The rest of the updates are:
- annotation update to Ancylostoma ceylanicum (PRJNA72583)
- updating WormBase core species B. malayi and C. briggsae to WS267 (with C. elegans and T. muris still corresponding to version WS265 due to our omission – we apologise!)
- about a 100 updated gene models for Schistosoma mansoni (PRJEA36577)
- fixed typo in the name of Oscheius tipulae (PRJEB15512) (Oschieus -> Oscheius)
- renamed genes for Plectus sambesii (PRJNA390260), from names like “g2” to names like “PSAMB.scaffold2size251193.g730”
Data and tools
We have re-ran our comparative genomics pipeline, constructing gene trees and finding orthologs and homologs. We have also reran newest InterProScan (5.30-69.0) and our cross-references pipeline for all our genomes.
We re-imported all public RNASeq data from our collaborators, and did a round of minor improvements to the displays. We now use information from PubMed to let you find data sets corresponding to a publication of interest.
Archived gene IDs
New sequencing technologies let labs construct better genome assemblies, bringing access to chromosome level assembly data even to relatively small research communities. We are excited to see this trend. Each genome update brings new evidence and potentially unlocks research into previously forbiddingly difficult biological questions.
At the same time, insights gathered in work published using previous assemblies should stay accessible to the community, so there is a need to connect different assemblies with each other. As of this release, WormBase ParaSite will keep track of previous identifier versions at gene level and display annotation history. Authors of a genome update do not always provide a mapping to previous version, so we developed a pipeline to match up identifiers between genome versions.
Overview of new functionality
For an overview of how this now works consider Smp_340760, a Schistosoma mansoni gene. The gene model was revised in the past, twice: it used to be called Smp_044010, but in “Schisto_7.1” version of the annotation that we published in WBPS11 the authors changed gene structure enough that they decided to assign it a new identifier, and in “Schisto_7.2” which we publish now the gene model was corrected slightly.
Searching by Smp_044010 now leads to a page explaining that the identifier was deprecated and redirecting to Smp_340760. Over there, the history is represented by a diagram:
The site also displays previous protein sequences of transcripts, to help you carry forward any conclusions based on the previous gene model – the less the sequence has changed, the more similar results will be for e.g. BLAST matches.
ID mapping pipeline
We used authors’ mappings between annotation versions for updates of Schistosoma mansoni and Hymenolepis microstoma. Everywhere else we used an automated mapping pipeline, adapted from Ensembl gene build.
The pipeline runs a sequence matching tool exonerate, scoring matches of exons between the two assemblies and propagating the scores onto the
transcript and gene level. The scores are then adjusted based on synteny – if a gene A is near gene B in the previous genome, A is mapped to A’ in the new genome, and there is a gene B’ near A’, the match of B to B’ is strengthened. Finally, best matches are iteratively taken out of the scoring, producing a list of pairs.
Results and benchmarking
We find the pipeline to be quite conservative even after we relaxed a few parameters around minimal match scores and similar values. Typically only between a third and two thirds of the genes in the updated genome have a related past identifier:
|Genome||previous genes total||previous version||mapped||new genes total||fraction mapped|
|Ancylostoma ceylanicum PRJNA72583||11783||WBPS11||7564||15892||0.476|
|Ascaris suum PRJNA62057||17974||WBPS9||9468||15260||0.620|
|Fasciola hepatica PRJEB25283||16806||WBPS10||7564||22676||0.334|
|Haemonchus contortus PRJEB506||19430||WBPS10||11439||21869||0.523|
|Meloidogyne incognita PRJEB8714||45351||WBPS10||11977||19212||0.623|
We also ran the automated pipeline on the S. mansoni WBPS10->WBPS11 update, comparing the results to a manual mapping obtained by annotators tracking individual identifiers. Our pipeline carried forward 5165 genes that authors considered to have none or minor changes, and 1347 genes with larger changes, onto an annotation with 10172 genes. The pipeline missed 2584 genes present somewhere in the manual mapping that were lost in the automatic one. It disagreed with the manual mapping in only 199 cases: some were genuinely wrong calls, and some are some were on par with manual mapping by being e.g. a mapping to a paralog gene.
Update: The task has not been finished yet and may take a couple of more hours. We expect to complete the task by 10pm UK Time. Thank you for your cooperation.
Please note that we are going to perform a server maintenance for the website on Tuesday 4th Dec 2018 from 2pm to 5pm (UK Time). During this period, you will not be able to sign in and use tools on the website including BLAST and VEP. We are sorry for the inconvenience this may cause.
(updated in WBPS13 to reflect archiving functionality and mention gene IDs in GFF dumps)
As sequencing projects progress and yield improved data, we occasionally update genomes on the site to provide the best resources available. An example of this is Schistosoma mansoni, released in WBPS11. The authors of the new S. mansoni annotation preserved identifiers where genes were unchanged, and otherwise assigned new identifiers. They kept a list of changes – updates, splits and merges – between the two versions, available here in full.
Annotation authors do not always try to preserve gene identifiers, and e.g. in our other recent updates of H. contortus and F. hepatica, the identifiers are all new. In such case, we run a pipeline that tries to reconcile the two versions, and while the results are not even close in quality to a manually curated list of updates, it is still frequently helpful.
This list of gene updates is then integrated into WormBase ParaSite. Searching for an old gene should bring you to an archived ID page, showing the protein sequence of that gene and a link to a new ID if there is one. The data is also available in the GFFs, under previous_gene_id field in the column 9 of the gene.
The list is as complete as can be given the methods used to reach it – either authors tracking gene updates, or our method based on sequence similarity and homology – and sometimes our users find themselves with an old gene mentioned in a publication etc. that isn’t connected to a new identifier. We would like to describe a number of strategies that can be used to reach desired content in such cases, and go through a worked example.
Our search facility is best at suggesting and retrieving our gene identifiers e.g. WBGene00001135 and gene families e.g. hedgehog but will also retrieve a number of different identifiers from e.g. UniProt or GenBank. Do try it first, and let us know if the search is not returning something you would expect it to bring back.
FTP dumps from previous releases
Our Downloads section contains a link to our FTP site, with data for all previous releases of WormBase ParaSite. If all you require is e.g. a protein sequence for an old identifier, you are done – otherwise, you can use the sequence to search further.
Given any sequence, you can retrieve a corresponding identifier by BLAST – perfect or near-perfect alignment across a large substring is almost always the right hit. Of course you might also find that your gene of interest has multiple copies, has been split in two in the new annotation, etc.
Other websites 🙂
Going through the UniProt search frequently gives good results when searching for gene families, and you might want to try the BLAST service at ENA or UniProt as an alternative to our BLAST.
We aim for you to able to retrieve all the data you need from WormBase ParaSite, but using multiple resources is very pragmatic and frequently optimal, and thus with every release, we reconcile our identifiers with a number of external sources. This process lets our gene pages link to corresponding pages in another resources in the External references section. To find a relevant WormBase ParaSite page given e.g. an INSDC or UniProt ID, you can use the previously described search facility.
Exonerate is a very fast tool for sequence alignment. It could come in handy if you need to achieve similar results to our online BLAST for a few hundred anonymous sequences. We hope you will not have to resort to it!
Chalmers et al (2015) have identified ten tegumental surface proteins in S.mansoni as important in understanding the host immune response to adult schistosomes. Two of the ten genes – Smp_081920 and Smp_158960 – are absent from the new annotation.
Smp_158960 turns up in search results under a different identifier – it is now Smp_345020. Looking in the External references table reveals that Smp_158960 is the only previous identifier for Smp_345020, and checking view all locations link reveals that Smp_345020 is the only gene annotated with Smp_158960 as a previous identifier, so the mapping is one to one. We can guess that authors modified the gene structure and decided to give the gene a new identifier.
Searching Smp_081920 doesn’t return anything in search, and it took me a bit of time to figure out what happened.
I knew I could get the sequence from the FTP site, but it saved me some effort to discover that the identifier is still hanging on in UniProt, which gave me the protein sequence.
I have then used BLASTP, which gave me a full match with Smp_166350 on about 100bp: long enough for me to potentially trust but perhaps omitted by authors reconciling old and new identifiers.
I got certain I am looking at the same gene after comparing the exon structure of Smp_166350 with the structure of Smp_081920, conveniently still online at our friends’ place, Ensembl Metazoa. The two are very similar, except Smp_166350 has three additional non-coding exons.
I was skeptical of these non-coding exons and wanted to see some expression evidence for them.
I opened the JBrowse display in ParaSite and enabled some tracks across a few studies and developmental stages, and my skepticism grew. I saw high peaks in three non-coding exons and no reads aligned to what is claimed to be the coding region in tracks for cercariae and juveniles. Then, in tracks for adult worms but also in a track for miracidia, I saw the opposite: reads in the coding region, and no expression for that UTR.
My conclusion from this data would be that the gene Smp_081920 is now Smp_166350, but the structure of Smp_166350 is not completely correct: the UTRs are spurious, and they probably make another gene instead. I have submitted this conclusion to the annotation authors – it will be corrected in future versions of the annotation.
1st – 6th September, 2019: Hydra, Greece.
Registration: 13th Jan- 29th March 2019.
About the meeting
The study of helminth parasites continues to excite great interest across the suite of modern scientific themes. With a wealth of genome information and high-throughput technologies, new drug and vaccine development, and intricate host-parasite molecular interactions, we are witnessing a new era of research on these organisms and the diseases they cause. Parasitic Helminths: New Perspectives in Biology and Infection will be the 13th in a highly successful series now held every year on the beautiful island of Hydra, Greece. All major helminth research areas are covered, including new genomics of animal- and plant-parasitic nematodes, interfaces with free-living helminths such as C. elegans and planarians, developmental and molecular biology, genetics, neurobiology, pharmacology, immunology and vaccine research, all aimed at creating new strategies for control of these prevalent parasitic organisms and the diseases they cause.
Registration for 2019 will open on Monday January 14 and close on Friday March 29. For questions, please email us at firstname.lastname@example.org
Attendance is limited to 100 people, consistent with a discussion-orientated meeting in which every delegate is an active participant – early registration is encouraged.
Invited speakers 2019
Alejandro Sánchez Alvarado, USA
Small RNA Mini-Symposium
Amy Buck, Edinburgh, UK
Louisa Cochella, Vienna, Austria
Alison Elliott, Entebbe, Uganda
Carolina Escobar, Toledo, Spain
Elodie Ghedin, New York, USA
Nicola Harris, Melbourne, Australia
Karl Hoffmann, Aberystwyth UK
P’ng Loke, New York USA
Peter Sarkies, London, UK
William Sullivan, Santa Cruz, USA
Example program (2018)
The Meeting Program is organized primarily from abstracts submitted by participants, a small group of invited speakers, and an invited Keynote Speaker. Many abstracts are selected for short talks and there are two poster sessions. Ample time to interact with participants is available with free time in the afternoons to explore the island and head to the beach.
Travel to Hydra
Hydra is a small island in the Argo Saronic Gulf, lying off the east coast of the Peloponnese and easily reached by hydrofoil from the Athens port of Piraeus.
The meeting is held at the Bratsera Hotel. A wide range of accommodation is available within 5-10 minutes walk of the conference venue, including pensions for those on a tight budget.
Dick Davis, University of Colorado
Kleoniki Gounaris, Imperial College
Rick Maizels, University of Glasgow
Murray Selkirk, Imperial College