Ensembl TransView
Introduction
Ensembl 'TransView' provides detailed information about automatically annotated Ensembl, manually curated Vega, transient EST, as well as GENSCAN, SNAP or GENEID ab initio transcript model predictions. These pages include options for displaying the transcript sequence, its translation, and the effects of genetic variations on both transcript and translation sequences.
Supporting evidence for Ensembl transcript model predictions is displayed on a per exon basis on Ensembl 'ExonView' pages. For more information on the gene structure and basic properties of the encoded protein, see Ensembl 'ProteinView' pages.
Ensembl Transcript Report
Ensembl transcripts are annotated by the Ensembl automatic analysis pipeline (Curwen et al., 2004) using either a GeneWise (Birney et al., 2004) model from a species-specific or vertebrate protein, a set of aligned species-specific cDNAs followed by GenomeWise for ORF prediction or from GENSCAN exons supported by protein, cDNA and EST evidence. GeneWise models are further combined with available aligned cDNAs to annotate UTRs.
-
Transcript - Human transcript names are annotated in the following way, from release 50.
Many human transcripts have an HGNC symbol associated with them (from the HUGO Gene Nomenclature Committee). If it is an ensembl transcript, the HGNC symbol will have HGNC (automatic) associated with it. If it is a manually curated transcript from Vega/Havana, it will have HGNC (curated) associated with it.
'Clone-based' identifiers apply to transcripts that cannot be associated with an HGNC symbol and are either assigned by ensembl or vega, as above.
The list of gene name catagories for human is as follows:
1) HGNC (automatic)
2) HGNC (curated)
3) Clone-based (ensembl)
4) Clone-based (vega)
The goal is to be more consistent with the naming of our manually curated imports from Vega/Havana. A new number follows each transcript name. If the number starts with '0' (001, 002, etc) this is a manually curated transcript from Vega/Havana. If the number starts with '2' (200, 201,...) this is an automatically annotated transcript from Ensembl. This does not take the place of the Ensembl gene ID (ENSG...) which is stable from release to release, and has not changed.
A 'known' transcript will have a sequence match in a public sequence database such as NCBI RefSeq, UniProt, or EMBL-BANK. This must be for the same species, otherwise the transcript is considered to be 'novel' (for that species). This is true for both protein-coding transcripts and ncRNAs. ncRNAs generated by RFAM are searched for in the sequence databases for a match. In the case of miRNAs, to be 'known', they must have a match in miRBase (for the same species). See the 'Similarity Matches' section for transcript matches to entries in other databases. For species other than human, the name will come from a match in a database such as MGI, SwissProt, etc...
-
Ensembl Transcript ID - Ensembl stable transcript identifiers are mapped between releases. In case a transcript model changes dramatically, the old stable identifier may be retired and a new one assigned. However, the Ensembl Archive tracks all stable identifiers and should provide mappings to the current gene predictions.
-
Transcript Information - This section lists the number of exons, the transcript length and the translation length for this particular transcript model. A link to Ensembl 'GeneView' refers to the Ensembl gene to which the transcript has been assigned.
-
Genomic Location
-
Top-level - The top-level sequence (e. g. chromosome) location this transcript has been annotated on is indicated. A link to 'ContigView' zooms into a region in corresponding to the transcript.
-
Sequence-level - The sequence-level (e. g. BAC clone) location the start of this transcript has been annotated on is indicated. A transcript prediction may span more than one sequence-level entities. A link to 'ContigView' zooms into a region covered by the sequence-level entity.
In either case green backgrounds in the 'Overview' and 'Detailed View' panels will highlight the transcript.
-
-
Description - The description of the transcript is taken from the UniProt/Swiss-Prot entry the predicted gene is mapped to, or, if none available, from a NCBI RefSeq or UniProt/TrEMBL entry. If a predicted gene has not been mapped to any such entries, no description will be given. If its translation is however part of an Ensembl protein family cluster, the consensus annotation of that family may be informative. See the links to Ensembl 'FamilyView' in the 'Protein Family' section on Ensembl 'ProteinView' pages.
-
Prediction Method - A brief summary of the method used for the prediction of the gene set (Ensembl genes, EST genes) is given.
-
Similarity Matches - Ensembl maps gene, transcript and protein predictions to other databases of biological information. Links to UniProt/Swiss-Prot, NCBI RefSeq and UniProt/TrEMBL are made on the basis of sequence similarity and also display identity percentages, calculated on the protein level.
Target %id indicates the percentage of the Ensembl prediction matching the external database sequence.
Query %id indicates the percentage of the external database sequence matching the Ensembl prediction.
Other links are inferred from these mappings e. g. NCBI Gene via RefSeq; Reactome, formerly Genome Knowledge Base (GKB), EMBL and Protein IDs via UniProt/Swiss-Prot.
The [align] links lead to Ensembl 'AlignView' pages, which display an alignment of the Ensembl transcript prediction to the external database sequence.
Ensembl establishes mappings to microarray probe set identifiers by matching probe set sequences to Ensembl transcripts. Mapping is successful if the probe set sequences match the transcript or an additional 2 kb down stream sequence window immediately adjacent the most 3' exon. For more detailed information see the Microarray Probeset Mapping document.
-
GO - Gene Ontology terms represent a controlled vocabulary developed by the Gene Ontology Consortium. Ensembl associates GO terms via UniProt mappings as outlined in the 'Similarity Matches' section above. evidence codes are also shown and refer to the evidence used for the initial assignment of GO terms to UniProt records. More than one evidence code may be shown where an Ensembl protein has been mapped to more than one UniProt entry. Clicking on a particular identifier takes you to an Ensembl 'GOView' page displaying the context of the term in the ontology, as well as a list of other Ensembl genes mapped to the term. Clicking on the term itself will take you to a Gene Ontology search form. For more details on Ensembl Gene Ontology mapping, see the 'GOView' help page.
- IC - Inferred by Curator
- IDA - Inferred from Direct Assay
- IEA - Inferred from Electronic Annotation
- IEP - Inferred from Expression Pattern
- IGI - Inferred from Genetic Interaction
- IMP - Inferred from Mutant Phenotype
- IPI - Inferred from Physical Interaction
- ISS - Inferred from Sequence or Structural Similarity
- NAS - Non-traceable Author Statement
- ND - No biological Data available
- RCA - inferred from Reviewed Computational Analysis
- TAS - Traceable Author Statement
- NR - Not Recorded
-
InterPro - Ensembl characterises conserved protein domains defined by the InterPro database on translation predictions of transcripts. There is also an option [View other Ensembl genes with this domain] leading to Ensembl 'DomainView' pages for each of the conserved protein domain prediction.
-
Protein Family - All Ensembl protein predictions from all Ensembl species and all UniProt entries are clustered into Ensembl protein families. A link to Ensembl 'FamilyView' provides lists of protein family members and annotation of their chromosomal positions on a karyotype ideogram.
For putative orthologues in other species, see also the 'Orthologue Prediction' section of Ensembl 'GeneView' or 'TransView' pages.
-
Transcript Structure - A diagram illustrates exon and intron structures of the Ensembl transcript model. Any untranslated regions of the transcript (5' and 3' UTRs) are displayed as coloured outlines, while the predicted coding regions (CDS) are shown in solid colour. The black arrows indicate the size of the genomic region spanned by the transcript model. The small red arrow indicates the orientation of the transcript relative to the genome sequence in standard notation.
-
Transcript Neighbourhood - A diagram showing the transcript highlighted with a green background in the context of other transcripts in the genomic region. Pointing to another transcript in the diagram displays a pop-up window with links to more information.
-
Transcript Sequence - A pull-down menu provides access to options for sequence display. A second pull-down menu turns on numbering of nucleotides and amino acids.
-
No markup (default setting) - This option displays the complete transcript sequence as plain text, suitable for 'cut and paste'. Sections of the transcript derived from different exons are shown in alternating black and blue colour.
-
Codons highlighted - Untranslated regions (5' and 3' UTRs) of the transcript are shown with a dark yellow background. Codons within the coding sequence are distinguished by alternating light yellow and clear background.
-
Codons/peptide sequence - UTRs and codons are shown as described above. In addition, the amino acid sequence of the coding sequence (CDS) is shown below the codons.
-
Codons/peptide/SNPs - As above, but with single nucleotide polymorphisms (SNPs) and other genetic sequence variations shown. A polymorphic nucleotide is shown by background colour green if the alternative allele is synonymous, red if it is non-synonymous. The ambiguity code appears above it. Hold your mouse over the nucleotide to display the alleles. Note that the SNP alleles are shown as appropriate to the transcript sequence and may therefore differ from the alleles seen in 'ContigView' and 'SNPView', which are directly taken from the dbSNP entry. For a non-synonymous SNP, the amino acid encoded by the reference sequence is shown in red. Hold your mouse over the amino acid to display the alternative residues. Note that the mouse-over displays of alternatives may not work with all web browsers.
-
Guide to evidence codes from GO:
Vega Transcript Report
Vega transcripts (Ashurst, et al., 2005) result from a set of manually annotated genes. Finished genomic sequence is analysed on a clone-by-clone basis using a combination of similarity searches against DNA and protein databases, as well as a series of ab initio gene predictions (GENSCAN, GENEWISE). Gene structures are annotated on the basis of human interpretation of the combined supportive evidence generated during sequence analysis. In parallel, experimental methods are being applied to extend incomplete gene structures and discover new genes. The latter is initiated by comparative analysis of the finished sequence with vertebrate datasets such as the Riken mouse cDNAs, mouse whole-genome shotgun data and Genoscope Tetraodon nigrovirides evolutionarily conserved regions (ecores). The Vega genes are split into two groups, the Vega Havana genes, annotated by the Havana group at the Wellcome Trust Sanger Institute, and the Vega External genes, annotated by other groups than Havana.
EST Transcript Report
Generally, Ensembl gene builds are based on experimental evidence from UniProt/Swiss-Prot, NCBI RefSeq and UniProt/TrEMBL sequence records (Curwen et al., 2004). For all well-characterised species with sufficient amounts of supporting evidence Ensembl provides an additional EST gene build solely based on a purified Expressed Sequence Tag (EST) set (Eyras et al., 2004). In keeping both gene builds separate Ensembl accounts for the fact that the quality of EST information is quite heterogeneous. But on the other hand, ESTs can provide useful hints for the existence of further transcripts, especially alternatively spliced ones. Gene builds for less characterised species are still based on supporting evidence from the public sequence databases. Since the amount of species-specific evidence for these species is generally too small to support a full gene-build, these builds are mainly based on information from closely related species. Ensembl may also directly exploit EST evidence for its gene predictions in these cases. A separate EST gene set is in those cases not available.
Please note: Ensembl EST Gene IDs are not maintained stable between builds. Rather than documenting those identifiers the sequence should be directly recorded. Sequence similarity searches against future EST gene sets will then allow identification of the then current EST gene prediction.
More detailed information on Ensembl gene-builds and EST gene-builds is available from the publications section.
GENSCAN Transcript Report
Although 'GENSCAN Transcript Report' pages are visually very similar to 'Ensembl Transcript Report' pages, the important difference is that GENSCAN transcripts are based on an ab initio predictor. While all Ensembl predictions are based on experimental evidence, GENSCAN bases its transcript predictions only on the genome sequence. Most ab initio algorithms tend to over-predict genes so that not every GENSCAN transcript might reflect a biological transcript. GENSCAN transcripts can be useful to detect additional transcripts or missing exons care is also needed while exploiting this information.
References:
Chris Burge and Samuel Karlin
Prediction of complete gene structures in human genomic DNA.
J Mol Biol. 1997 Apr 25; 268(1):78-94.
doi:10.1006/jmbi.1997.0951
Chris B. Burge
Modeling dependencies in pre-mRNA splicing signals.
In Salzberg, S., Searls, D. and Kasif, S., eds.
Computational Methods in Molecular Biology
Elsevier Science, Amsterdam, (1998) 127-163.
ISBN:
0444828753
SNAP Transcript Report
The Semi-HMM-based Nucleic Acid Parser is an ab initio gene prediction program, which predicts transcript models solely on the basis of the underlying genomic sequence and does not take any experimental evidence into account. 'SNAP Transcript Reports' are not available for all species, but SNAP performs better than GENSCAN in some species.
References:
Ian Korf
Semi-HMM-based Nucleic Acid Parser
unpublished
GENEID Transcript Report
GENEID is another ab initio gene prediction program, presently used only in the Tetraodon nigrovirides set provided by Genoscope.
The search box at the top of the page allows you to search for any identifier present in Ensembl. For detailed instructions see the Ensembl 'TextView' page.
