|The present tool is developed as a part of our broader study focused on investigating the advantages of developing "organism-specific" predictors over the "general" ones to predict protein subcellular localization. To demonstrate this, we perform a systematic study in Arabidopsis thaliana and create an integrative Support Vector Machine-based localization predictor called 'AtSubP' that is based on the combinatorial presence of diverse protein features such as its amino acid composition, sequence-order effects, terminal information, PSSM and the similarity search-based PSI-BLAST information. When used to predict on seven compartments (Chloroplast, Cytoplasm, Golgi apparatus, Mitochondrion, Extracellular, Nucleus, Plasma membrane) through a 5-fold cross-validation test, our hybrid-based best classifier achieves an overall accuracy of 90.95% with a high confidence precision and MCC values of 90.86% and 0.89, respectively. Benchmarking AtSubP on two independent datasets, one from Swiss-Prot and another containing GFP & Mass Spectrometry determined proteins from SUBA and eSLDB databases, shows a significant improvement in the prediction accuracy of 'species-specific' AtSubP over some widely used 'general' tools such as TargetP, LOCtree, PA-SUB, MultiLoc, WoLF PSORT and Plant-PLoc as well as our newly created 'All-Plant' method. As another rigorous testing, the cross-comparison of Arabidopsis-specific classifier on six non-trained eukaryotic organisms (Rice, Soybean, Human, Yeast, Fruit fly, Worm) reveals too inferior predictions.
Thereby, this comprehensive case study strongly reveals the presence of some 'species-specific' sorting patterns or signals among the individual organisms that are being skipped in the training process while developing 'general' prediction methods. We suggest to actively develop similar genome-specific systems in other organisms to fasten-up their individual proteome annotation process rather than relying on some general prediction tools. Five diverse prediction modules based on various features of a protein sequence have been implemented on the World Wide Web as a dynamic web server 'AtSubP' that provide wider options to the users extracting different features from their query protein sequences e.g. the simple amino acid composition, sequence-order based dipeptide composition, terminal-based information, Position Specific Scoring Matrix (PSSM), similarity-based PSI-BLAST, including our best performing hybrid classifier.
Currently, the TAIR community is actively using TargetP for annotating the complete Arabidopsis proteome (ftp://ftp.arabidopsis.org/home/tair/Proteins/Properties/TargetP_analysis.tair10). However, in our independent comparison on various test datasets, 'AtSubP' shows better accuracy and wider location coverage as compared to 'TargetP'. Therefore, we believe that 'AtSubP' can serve as a better complement to accurately annotate the Arabidopsis thaliana proteome. The complete list of subcellular predictions generated through 'AtSubP' are available under the 'Datasets' section.