AtSubP: the Arabidopsis Subcellular Localization Prediction Server   
  About |  AtSubP |  Datasets |  Appendix |  Help
Location:   Datasets

BENCHMARK DATA
 
  • Five-fold cross-validation test : The learning dataset contains 3214 Arabidopsis thaliana proteins classified into 7 subcellular localizations under study (Chloroplast, Cytoplasm, Golgi apparatus, Mitochondrion, Extracellular, Nucleus, Plasma membrane) according to the annotation information available in Swiss-Prot (release 57.9). Both the Swiss-Prot IDs and sequences are provided in FASTA format. None of the proteins has >30% sequence identity to any other in the same subset (subcellular location). See the text of paper for further explanation.
    Click Chloroplast (601 sequences), Cytoplasm (220 sequences), Golgi apparatus (106 sequences), Mitochondrion (391 sequences), Extracellular (452 sequences), Nucleus (1197 sequences), and Plasma membrane (247 sequences) to download the training/testing dataset for each localization.
  • Independent test-I : This dataset contains 357 new Arabidopsis sequences never used in the original training/testing process when developing the prediction classifiers. About 10% of sequences from each localization were kept separate from the above training dataset. Further, no sequences in this set have >30% redundancy with any of the sequences in the actual learning dataset. Click here to download this combined dataset-I divided into 7 localizations under study.
  • Independent test-II : This dataset contains 84 experimentally proved sequences downloaded from two sources, SUBA and eSLDB databases. Only the common sequences proved to be in the same location using both the GFP and Mass Spectrometry techniques were taken into consideration. Click here to download the combined dataset-II containing these 84 new experimentally proved Arabidopsis-specific proteins.
  • All Plant Dataset : We also developed a corresponding prediction classifier on the combined protein sequences from all the plants and compared with the Arabidopsis-specifc classifier using same encoding schemes. A total of 17,708 plant proteins having subcellular information available were downloaded from release 57.9 of Swiss-Prot; and further processed (see text of paper for details). Click here to download the final 30% redundancy reduced 'All-Plant' dataset that contains 6,183 sequences divided into the 7 localizations under study.
  • ARABIDOPSIS PROTEOME ANNOTATION : Finally, our best classifier was run on the complete Arabidopsis thaliana proteome (total 27,379 sequences from TAIR release 9) and generate highly reliable predictions for each of the 7 localizations. For increased confidence, we put various cutoff levels (>0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 and 1.0) of threshold on the final predicted score. Higher the cutoff, better is the prediction reliability.

    (i) Complete proteome : At 0.0 cutoff SVM score, the complete predictions are provided here. If all the seven models of each location gives negative score, the final prediction would be "Unknown" protein. Download predictions for all the 27,379 protein sequences from TAIR 9.

    (ii) High confidence predictions : For wider applicability and increased prediction confidence, we also provide here the complete list of TAIR IDs (in decreasing order of their reliability confidence) of the top scoring proteins (>1.0 cutoff) in each class along with their corresponding 'Swiss-Prot' and 'TAIR' annotations as well as the PSI-BLAST hit information, if available. Click Chloroplast (607 sequences), Cytoplasm (1046 sequences), Golgi apparatus (83 sequences), Mitochondrion (732 sequences), Extracellular (883 sequences), Nucleus (4120 sequences), and Plasma membrane (511 sequences) to download these top scoring predictions in EXCEL format.

 

 
© 2012 by The Samuel Roberts Noble Foundation, Inc.