Ensembl ProteinView
Introduction
Ensembl 'ProteinView' provides detailed information about automatically annotated Ensembl, manually curated Vega, transient EST, as well as ab initio predicted GENSCAN, SNAP or GENEID protein model predictions. These pages include options for displaying the protein sequence, domain predictions and the effects of genetic variations on protein sequences.
Ensembl Protein Report
Ensembl proteins are annotated by the Ensembl automatic analysis pipeline (Curwen et al., 2004) using either a GeneWise (Birney et al., 2004) model from a species-specific or vertebrate protein, a set of aligned species-specific cDNAs followed by GenomeWise for ORF prediction or from GENSCAN exons supported by protein, cDNA and EST evidence. GeneWise models are further combined with available aligned cDNAs to annotate UTRs.
-
Peptide - Protein names are obtained by comparing them to entries in UniProt/Swiss-Prot, NCBI RefSeq and UniProt/TrEMBL. For further details see the 'Similarity Matches' section of 'Transcript Reports' on Ensembl 'GeneView' pages. A single name is selected for display, and the source is shown. Preferentially, a gene symbol approved by the species-specific nomenclature committee is chosen ((e. g. the appropriate HUGO HGNC symbol for human). Failing that, a UniProt/Swiss-Prot identifier, a NCBI RefSeq accession number or a UniProt/TrEMBL accession number is assigned.
-
Ensembl Translation ID - Ensembl stable translation identifiers are mapped between releases. In case a translation model changes dramatically, the old stable identifier may be retired and a new one assigned. However, the Ensembl Archive tracks all stable identifiers and should provide mappings to the current gene predictions.
-
Translation Information - A link to Ensembl 'GeneView' refers to the Ensembl gene to which the translation has been assigned.
-
Genomic Location
-
Top-level - The top-level sequence (e. g. chromosome) location this translation has been annotated on is indicated. A link to 'ContigView' zooms into a region in corresponding to the translation.
-
Sequence-level - The sequence-level (e. g. BAC clone) location the start of this translation has been annotated on is indicated. A translation prediction may span more than one sequence-level entities. A link to 'ContigView' zooms into a region covered by the sequence-level entity.
In either case green backgrounds in the 'Overview' and 'Detailed View' panels will highlight the transcript associated with this translation.
-
-
Description - The description of the translation is taken from the UniProt/Swiss-Prot entry the predicted gene is mapped to, or, if none available, from a NCBI RefSeq or UniProt/TrEMBL entry. If a predicted gene has not been mapped to any such entries, no description will be given. If its translation is however part of an Ensembl protein family cluster, the consensus annotation of that family may be informative. See the links to Ensembl 'FamilyView' in the 'Protein Family' section on Ensembl 'ProteinView' pages.
-
Prediction Method - A brief summary of the method used for the prediction of the gene set (Ensembl genes, EST genes) is given.
-
InterPro - Ensembl characterises conserved protein domains defined by the InterPro database on translation predictions. There is also an option [View other Ensembl genes with this domain] leading to Ensembl 'DomainView' pages for each of the conserved protein domain prediction.
-
Protein Family - All Ensembl protein predictions from all Ensembl species and all UniProt entries are clustered into Ensembl protein families. A link to Ensembl 'FamilyView' provides lists of protein family members and annotation of their chromosomal positions on a karyotype ideogram.
For putative orthologues in other species, see also the 'Orthologue Prediction' section of Ensembl 'GeneView' or 'TransView' pages.
-
Protein Features - A diagram showing protein conserved protein domains from InterPro member databases (e.g. Pfam, PROSITE, ...), as well as coiled coil, low complexity, signal sequence and transmembrane regions. Pointing to a named domains pops up a menu with a link to the respective external database. The purple bar represents the protein, with alternating light and dark sections showing the regions derived from different exons. Pointing pops up exon information.
The locations of any amino acid residues associated with single nucleotide polymorphisms (SNPs) are shown below the protein sequence representation. Thereby, synonymous SNP are depicted in green, while non-synonymous SNPs are drawn in red. Point at a SNP location to display the SNP identifier together with a link to the corresponding Ensembl 'SNPView' page, which display the position of the affected residue, any alternative amino acids, as well as the nucleotide polymorphism within the context of the codon.
A scale bar in amino acid residues and the SNP legend are at the bottom of the diagram.
Please note that if you follow a link to Ensembl 'SNPView' pages, the alleles and ambiguity codes shown there will reflect those of the NCBI dbSNP entry. Alleles on Ensembl 'GeneSNPView' pages might be complementary to Ensembl 'SNPView' and to dbSNP pages, depending on the chromosome strand the corresponding transcript maps to.
The predicted domains and their positions within the protein, and details of the SNPs, are also shown in tables at the bottom of the 'ProteinView' page.
-
Peptide Sequence - A pull-down menu provides access to options for sequence display. A second pull-down menu turns on numbering of amino acids.
-
No markup (default setting) - This option displays the complete peptide sequence as plain text, suitable for 'cut and paste'.
-
Exons highlighted - Sections of the protein derived from different exons are shown in alternating black and blue colour. Amino acid residues derived from a codon formed across exon boundaries are shown in red.
-
Exons/SNPs - As above, but with amino acid residues associated with single nucleotide polymorphisms (SNPs) highlighted. Thereby, synonymous SNP are depicted in green, while non-synonymous SNPs are drawn in red. Small insertion deletion polymorphisms are denoted in blue. Point at a SNP location to display the alternative amino acid for non-synonymous SNPs or the codon sequence for synonymous SNPs using the appropriate ambiguity codes for the affected nucleotide.
-
-
Peptide Stats - A short summary of physico-chemical protein properties such as isoelectric point, charge or molecular mass.
ProteinDAS Report
'ProteinDAS' is a semantic extension to the DAS protocol that allows the exchange of annotations via gene or protein identifiers. Ensembl 'ProteinView' acts as a client for this server, using the data as additional annotation for Ensembl predicted proteins. Further 'ProteinDAS' developments will allow users to display their own gene annotations.
This implementation currently includes:
-
UniProt/Swiss-Prot literature references linking to the NCBI PubMed literature database.
-
Superfamily - HMM library and genome assignments server.
-
InterPro - a database of protein families, domains and functional sites.
Domains
Predicted protein domains as shown in the Protein structure diagram, indicating their coordinates within the protein sequence. A link is provided to the appropriate external database entry.
Other Protein Features
A table lists protein sequence coordinates for coiled coil, low complexity, signal sequence and transmembrane regions as annotated in the 'Protein Structure' diagram.
-
Coiled Coil Regions - The Ensembl analysis and annotation pipeline uses the ncoils program implemented by R.B. Russell and A.N. Lupas for coiled-coil domain characterisation and annotation. Rob Russel's group at the EMBL Heidelberg provides a public service.
Lupas A, Van Dyke M and Stock J.
Predicting coiled coils from protein sequences.
Science. 1991 May 24;252(5010):1162-1164.
[PubMed] -
Low-Complexity Regions - Low complexity regions are annotated with the SEG program.
Wootton, J. C. and S. Federhen
Statistics of local complexity in amino acid sequences and sequence databases.
Computers in Chemistry 1993; 17:149-163.
doi:10.1016/0097-8485(93)85006-XWootton, J. C. and S. Federhen.
Analysis of compositionally biased regions in sequence databases.
Methods in Enzymology 1996; 266: 554-571.
doi:10.1016/S0076-6879(96)66035-2 -
Signal Sequence Regions - are characterised with SignalP.
Nielsen H, Engelbrecht J, Brunak S, von Heijne G.
Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites.
Protein Eng. 1997 Jan;10(1):1-6.
[Abstract] [Full Text PDF]Nielsen H, Krogh A.
Prediction of signal peptides and signal anchors by a hidden Markov model.
In J. Glasgow, T. Littlejohn, F. Major, R. Lathrop, D. Sankoff, and C. Sensen, editors
Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology, pages 122-130, Menlo Park, CA, 1998. AAAI Press.
[PubMed]Bendtsen JD, Nielsen H, von Heijne G, Brunak S.
Improved prediction of signal peptides: SignalP 3.0.
J Mol Biol. 2004 Jul 16;340(4):783-795.
doi:10.1016/j.jmb.2004.05.028 -
Transmembrane Regions - Ensembl uses TMHMM for the annotation of transmebrane helices.
A. Krogh, B. Larsson, G. von Heijne, and E. L. L. Sonnhammer.
Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes.
Journal of Molecular Biology, 305(3):567-580, January 2001.
doi:10.1006/jmbi.2000.4315E. L.L. Sonnhammer, G. von Heijne, and A. Krogh.
A hidden Markov model for predicting transmembrane helices in protein sequences.
In J. Glasgow, T. Littlejohn, F. Major, R. Lathrop, D. Sankoff, and C. Sensen, editors
Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology, pages 175-182, Menlo Park, CA, 1998. AAAI Press.
[PubMed]
SNP information
Genetic variations within the coding sequence, as shown in the 'Protein Structure' diagram, indicating their coordinates within the protein sequence, and the dbSNP id. Information about SNP alleles (as appropriate to the strand of the transcript) and the peptide effects of non-synonymous SNPs are also shown.
Please note that if you follow a link to Ensembl 'SNPView' pages, the alleles and ambiguity codes shown there will reflect those of the NCBI dbSNP entry. Alleles on Ensembl 'GeneSNPView' pages might be complementary to Ensembl 'SNPView' and to dbSNP pages, depending on the chromosome strand the corresponding transcript maps to.
Vega Protein Report
Vega proteins (Ashurst, et al., 2005) result from a set of manually annotated genes. Finished genomic sequence is analysed on a clone-by-clone basis using a combination of similarity searches against DNA and protein databases, as well as a series of ab initio gene predictions (GENSCAN, GENEWISE). Gene structures are annotated on the basis of human interpretation of the combined supportive evidence generated during sequence analysis. In parallel, experimental methods are being applied to extend incomplete gene structures and discover new genes. The latter is initiated by comparative analysis of the finished sequence with vertebrate datasets such as the Riken mouse cDNAs, mouse whole-genome shotgun data and Genoscope Tetraodon nigrovirides evolutionarily conserved regions (ecores). The Vega genes are split into two groups, the Vega Havana genes, annotated by the Havana group at the Wellcome Trust Sanger Institute, and the Vega External genes, annotated by other groups than Havana.
EST Protein Report
Generally, Ensembl gene builds are based on experimental evidence from UniProt/Swiss-Prot, NCBI RefSeq and UniProt/TrEMBL sequence records (Curwen et al., 2004). For all well-established species with sufficient amounts of supporting evidence Ensembl provides an additional EST gene gene-build solely based on a purified Expressed Sequence Tag set (Eyras et al., 2004). In keeping both gene-builds separate Ensembl accounts for the fact that the quality of EST information is quite heterogeneous. But on the other hand, ESTs can provide useful hints for the existence of further transcripts, especially alternatively spliced ones. Gene-builds for less characterised species are still based on supporting evidence from the public sequence databases. Since the amount of species-specific evidence for these species is generally too small to support a full gene-build, these builds are mainly based on information from closely related species. Ensembl may also directly exploit EST evidence for its gene-predictions in these cases. A separate EST gene set is in those cases not available.
More detailed information on Ensembl gene-builds and EST gene-builds is available from the publications section.
GENSCAN Protein Report
Although 'GENSCAN Protein Report' pages are visually very similar to 'Ensembl Protein Report' pages, the major difference is that GENSCAN transcripts and proteins are based on an ab initio predictor. While all Ensembl predictions are based on experimental evidence, GENSCAN bases its transcript and protein predictions only on the genome sequence. Most ab initio algorithms tend to over-predict genes so that not every GENSCAN protein might reflect a biological protein. While GENSCAN can be useful to detect additional transcripts and protein products care is also needed exploiting this information.
Chris Burge and Samuel Karlin
Prediction of complete gene structures in human genomic DNA.
J Mol Biol. 1997 Apr 25; 268(1):78-94.
doi:10.1006/jmbi.1997.0951
Chris B. Burge
Modeling dependencies in pre-mRNA splicing signals.
In Salzberg, S., Searls, D. and Kasif, S., eds.
Computational Methods in Molecular Biology
Elsevier Science, Amsterdam, (1998) 127-163.
ISBN:
0444828753
SNAP Protein Report
The Semi-HMM-based Nucleic Acid Parser is an ab initio gene prediction program, which predicts transcript models solely on the basis of the underlying genomic sequence and does not take any experimental evidence into account. 'SNAP Protein Reports' are not available for all species, but SNAP performs better than GENSCAN in some species.
Ian Korf
Semi-HMM-based Nucleic Acid Parser
unpublished
GENEID Protein Report
GENEID is another ab initio gene prediction program, presently used only in the Tetraodon nigrovirides set provided by Genoscope.
The search box at the top of the page allows you to search for any identifier present in Ensembl. For detailed instructions see the Ensembl 'TextView' page.
