Automated Proteogenomic Analysis of Pegalibacter Ubique.

Sequences obtained from Genbank (CP000084)


Peptide Identification

Number of genomic elements = 1
Name Length
gi|71061822|gb|CP000084.1| Candidatus Pelagibacter ubique HTCC1062, complete genome1308759

Number of Spectra: 471879

Peptide Spectra
IDs on Actual Database 7095 20155
IDs on Decoy Database 339 380
False Discovery Rate 0.048 0.019

The above mentioned false discovery rates are computed in the database search using a six-frame translation of the genome and a decoy database of the same size. The false discovery rates, counting only the hits in the coding regions and using decoy database of the size of proteome (more common practise), are lower: 0.003 at the spectrum level and 0.008 at the peptide level.

Access List of Identified Peptides



Protein Identification

Number of annotated ORFs = 1354
Number of proteins confirmed with 2 or more peptides = 696 | Access List
Number of proteins having only 1 peptide = 217 | Access List


The histogram of number of unique peptides per protein. X axis has the number of unique peptides per protein, Y axis has the frequency (number of proteins). Note that the bins are of size 2, and the plot is truncated at 100 (there may be proteins that have more peptides).



The histogram of percent coverage of the proteins by the identified peptides. X axis has the coverage in percent, Y axis has the frequency (number of proteins). The bins are of size 10 percentage points.





Improving Genome Annotation

Number of peptides within genes: 6740
Number of peptides crossing gene boundaries: 9
Number of peptides outside genes: 276

Newly Identified Protein Coding Segments:

CoordinatesLength#Peptides#SpectraPossibleExplanation
Seq1:60565-60786 (+1)7422Unknown
Seq1:502441-502575 (+1)4511N_Term_Extension of SAR11_0513
Seq1:716194-716298 (+1)3533N_Term_Extension of SAR11_0733
Seq1:969505-969516 (+1)411N_Term_Extension of SAR11_0987
Seq1:602993-603019 (+2)911N_Term_Extension of SAR11_0617
Seq1:767552-767593 (+2)1412N_Term_Extension of SAR11_0794
Seq1:1213909-1214088 (-1)6022Unknown
Seq1:666799-666840 (-1)1422N_Term_Extension of SAR11_0680
Seq1:569329-569352 (-1)811N_Term_Extension of SAR11_0579
Seq1:461389-461430 (-1)14323Unknown
Seq1:291691-291729 (-1)1324N_Term_Extension of SAR11_0291
Seq1:934935-935117 (-2)6122Unknown
Seq1:361353-361394 (-2)1422N_Term_Extension of SAR11_0369
Seq1:274863-274892 (-2)1022N_Term_Extension of SAR11_0272
Seq1:1530-1535 (-2)215N_Term_Extension of SAR11_0001
Seq1:1292837-1292923 (-3)2939N_Term_Extension of SAR11_1357
Seq1:411572-411589 (-3)611N_Term_Extension of SAR11_0423

Each segment represents a putative coding region that is not contained within known ORFs. The number of peptides reported in this table includes only the peptides that lie outside the known ORFs. If the number is 1, it means that there must be another peptide nearby within a known ORF, since we require atleast two peptides for any new segment.




Analyzing Non-tryptic Peptides

Number of peptides with both end non-tryptic = 211
Number of peptides with one end non-tryptic = 1689

Number of contained peptides = 1052
- Peptides that are contained at their N-terminal side = 760
- Peptides that are contained at their C-terminal side = 349

Number of non-Contained peptides (i.e. peptides that are not a subsequence of other peptides) = 848
Number of non-Covered peptides (non-Contained and located within confirmed proteins) = 593
Number of non-Covered Ntermini (note that one terminus of a nonCovered peptide may be covered; here we look for cases where N-terminus is not covered) = 276
Number of non-Covered Ntermini with no upstream coverage in the protein = 114

The plot gives the distribution of residues distances from a non-tryptic end-point to the endpoint in the containing tryptic peptide. X axis shows the distance and Y axis shows the frequency.



Distribution of the N-termini of all non-covered peptides, and of those which also have no upstream coverage. X axis shows the distance of the peptide from N-terminus of the protein, and Y shows has the frequency.





N-Terminal Methionine Cleavage


71 proteins found with N-terminal Methionine Cleavages


Efficiency factor of N-terminal methionine cleavage for each amino acid at second position, as observed in this proteomic analysis and in-vitro analysis of E. coli [Hirel et al. 1989]. The amino acids are ordered in the increasing order of the size of their side chains. Amino acids that occur less than 10 times at the second position are shown with an asterisk.



AminoAcid FrequencyCut FrequencyNoCut CleavageEfficiency HirelEtAlEfficiency
G 1 0 1 0.971
A 12 0 1 0.958
P 5 1 0.833333333333333 0.882
S 40 1 0.975609756097561 0.84
T 13 0 1 0.897
V 0 1 0 0.837
C 0 0 NA 0.71
N 0 14 0 0.164
D 0 6 0 0.161
L 0 4 0 0.163
I 0 6 0 0.184
H 0 0 NA 0
Q 0 6 0 0
E 0 8 0 0
F 0 5 0 0
M 0 0 NA 0
K 0 13 0 0
Y 0 0 NA 0
W 0 0 NA 0
R 0 1 0 0

The table reports the N-terminal methionine cleavage efficiency for different amino acids at second position. For each amion acid at second position, the number of proteins with a N-terminal methionine cleavage and with no cleavage are reported, and compared with the efficiency values reported in [Hirel et al. 1989].





Signal Peptides


Venn diagram of all signal peptide predictions on confirmed proteins.



Sequence logo for the amino acid sequences motif of all signal peptides identified by our analysis. Position -1 corresponds to the last residue of the signal peptide.



Number of signal predictions by SignalP and Predisi rejected by us based on observation of peptides upstream of the signal cleavage site.




Unrestricted Modifications Search

Identified 2385 unique modification sites in the organism, at 5% false discovery rate. These are observed by 2675 peptides in the forward database (127 in the shuffled database). Note that two or more different peptide species may correspond to the same modification site.


AminoAcidModificationMassNumPeptidesPeptidesWithUnmodifiedVersionsPossibleAnnotation
+M 16 1247 743 oxidation M
+W 32 172 121 double oxidation W
+M 32 82 44 double oxidation M
+S 28 77 57 formaldehyde S
+C -57 73 3 C-57
+T 28 73 58 formaldehyde T
+W 4 69 58 Oxidation (Trp to kynurenine)
+K 28 68 53 dimethylation
+N -17 37 26 succinimide formation
+K 43 31 16 carbamylation K
+W 16 30 29 oxidation W
+Q -17 30 20 pyroglutamate formation
+K 14 21 13 methylation
+T -18 17 10 dehydration
+M 44 14 11 ?
+Y 14 14 14 ?
+W 48 13 10 ?
+M 19 11 9 ?
+W 13 10 10 ?
+M -48 9 6 M+16-64
+N 28 9 9 ?
+D -18 9 3 dehydration
+R 55 9 5 R+55
+M 48 8 4 ?
+I 28 8 8 ?
+S 102 8 1 phosphorylation + sodium
+S -18 8 3 dehydration
+E 28 8 8 ?
+G 22 8 1 Sodium adduct
+G 28 8 7 ?
+E -18 7 3 dehydration
+S 128 7 3 lysine adduct
+L 38 6 0 Potassium adduct
+V -17 6 5 ?
+S 80 6 1 phosphorylation
+W 12 5 5 Formaldehyde adduct
+K 48 5 4 ?
+W 20 5 5 ?
+V 28 4 3 ?
+A 28 4 4 ?
+L 28 4 4 ?
+C 77 4 1 ?
+W 85 4 4 ?
+I 128 4 4 lysine adduct
+A 128 4 4 lysine adduct
+C -43 4 0 methylation
+L 128 3 1 lysine adduct
+I -17 3 2 ?
+H 57 3 0 CAM
+R -17 3 2 ?
+T 63 3 2 ?
+G 48 3 0 ?
+D 53 3 3 ?
+L -17 3 2 ?
+T 44 3 3 ?
+K 19 3 3 ?
+Q 28 3 1 ?
+W -116 3 0 ?
+P 57 3 2 NT+CAM
+W 14 3 3 ?
+W 64 3 3 ?
+Y 28 3 3 ?
+S 42 3 0 acetylation
+M 47 3 2 ?
+M 17 3 1 ?
+N 22 3 2 Sodium adduct
+D 28 3 3 ?
+S 38 3 3 Potassium adduct
+T 64 3 0 ?
+Y -10 2 1 ?
+K 16 2 1 ?
+L 19 2 1 ?
+D -10 2 0 ?
+D 38 2 0 Potassium adduct
+T 128 2 0 lysine adduct
+K -14 2 1 ?
+K 85 2 2 ?
+Y 15 2 2 ?
+I 247 2 0 ?
+W 36 2 2 ?
+I -10 2 0 ?
+K 57 2 0 NT+CAM
+D 125 2 2 ?
+I 3 2 1 ?
+D 22 2 2 Sodium adduct
+R 38 2 1 Potassium adduct
+M 15 2 1 ?
+S 44 2 2 ?
+Y 141 2 1 ?
+T 117 2 0 ?
+F 28 2 2 ?
+E 48 2 1 ?
+R 28 2 2 ?
+K 27 2 2 ?
+D 42 2 2 acetylation
+N 48 2 2 ?
+G 128 2 0 lysine adduct
+D 210 2 1 ?
+V 22 2 1 Sodium adduct
+V -39 2 0 ?
+M 3 2 2 ?
+T 57 2 1 NT+CAM
+E 131 2 0 ?
+D -17 2 1 ?
+N 219 2 0 ?
+T -36 2 2 ?
+Q 128 2 1 lysine adduct
+E 16 2 2 ?
+I 19 2 2 ?
+K -17 2 1 ?
+L 3 1 1 ?
+V 156 1 1 ?
+R 159 1 1 ?
+W -4 1 1 ?
+G 38 1 1 Potassium adduct
+G 66 1 0 ?
+D 48 1 1 ?
+K -59 1 0 ?
+Y 128 1 1 lysine adduct
+A 12 1 0 ?
+D 239 1 1 ?
+G 89 1 0 ?
+N 43 1 0 carbamylation
+V 3 1 1 ?
+V 43 1 1 carbamylation
+T 70 1 1 ?
+N 242 1 0 ?
+S 62 1 0 ?
+Q 57 1 1 NT+CAM
+F -67 1 0 ?
+Y -76 1 0 ?
+C 110 1 0 ?
+L 43 1 1 carbamylation
+K 47 1 1 ?
+N 66 1 0 ?
+T 62 1 0 ?
+E 26 1 1 ?
+I 192 1 1 ?
+V 109 1 0 ?
+E 60 1 1 ?
+R 125 1 1 ?
+N 159 1 0 ?
+F 3 1 1 ?
+D -52 1 1 ?
+Q 184 1 1 ?
+M 18 1 1 ?
+D 143 1 1 ?
+D 140 1 0 ?
+E 57 1 1 NT+CAM
+L -3 1 1 ?
+N -36 1 0 ?
+E -39 1 0 ?
+Y 242 1 1 ?
+M 20 1 1 ?
+L -55 1 0 ?
+D 71 1 1 ?
+S 172 1 0 ?
+I 229 1 1 ?
+C -9 1 0 ?
+L 53 1 1 ?
+T -24 1 0 ?
+G 110 1 0 ?
+Y 53 1 1 ?
+R 127 1 1 ?
+T 159 1 0 ?
+R 204 1 0 ?
+S -20 1 1 ?
+A 191 1 0 ?
+Q 212 1 0 ?
+A 198 1 0 ?
+C 182 1 0 ?
+N 49 1 0 ?
+W -84 1 0 ?
+Y -44 1 1 ?
+K 156 1 1 ?
+L 198 1 1 ?
+W -112 1 0 ?
+L 57 1 1 NT+CAM
+V 42 1 0 acetylation
+M 213 1 1 ?
+P 19 1 1 ?
+C 218 1 0 ?
+T 16 1 0 ?
+D 118 1 1 ?
+C 246 1 0 ?
+S 14 1 1 ?
+I -43 1 0 ?
+F 22 1 1 Sodium adduct
+V 127 1 1 ?
+F 47 1 0 ?
+N 31 1 1 ?
+E 14 1 1 ?
+M 131 1 1 ?
+M 33 1 0 ?
+G 19 1 1 ?
+S 57 1 0 NT+CAM
+I 22 1 0 Sodium adduct
+E 38 1 1 Potassium adduct
+T 149 1 0 ?
+K 44 1 1 ?
+V 89 1 0 ?
+G 53 1 1 ?
+Y 100 1 1 ?
+K 82 1 1 ?
+E -44 1 1 ?
+M 42 1 1 acetylation
+I 90 1 0 ?
+N 143 1 0 ?
+W 3 1 1 ?
+I 109 1 0 ?
+E 127 1 0 ?
+W 49 1 1 ?
+T -38 1 0 ?
+W 30 1 1 ?
+V 38 1 1 Potassium adduct
+N 3 1 1 ?
+F 87 1 0 ?
+E 22 1 1 Sodium adduct
+E 228 1 0 ?
+E 42 1 0 acetylation
+K 175 1 0 ?
+W -52 1 0 ?
+S 162 1 0 ?
+D 142 1 1 ?
+D 141 1 1 ?
+T 245 1 0 ?
+N -43 1 0 ?
+K 22 1 1 Sodium adduct
+H 28 1 1 ?
+L 184 1 1 ?
+L -36 1 0 ?
+F -72 1 0 ?
+Y -79 1 1 ?
+T 164 1 1 ?
+K -44 1 1 ?
+N 27 1 1 ?
+G 40 1 0 ?
+Y 175 1 0 ?
+C 57 1 0 NT+CAM
+Y 80 1 0 phosphorylation
+A 19 1 1 ?
+W 56 1 1 ?
+C -3 1 0 ?
+S -16 1 1 ?
+L -18 1 1 ?
+K -15 1 0 ?
+D 69 1 1 ?
+W 236 1 0 ?
+V 48 1 1 ?
+G 43 1 1 carbamylation
+H 14 1 1 methylation
+V 13 1 0 ?
+M 143 1 1 ?
+L 71 1 0 ?
+D 19 1 1 ?
+Q 179 1 0 ?
+F 43 1 1 carbamylation
+D -31 1 0 ?
+V 57 1 1 NT+CAM
+E 147 1 0 ?
+G 54 1 0 ?
+P 38 1 0 Potassium adduct
+I 53 1 1 ?
+Y 38 1 1 Potassium adduct
+Q 38 1 0 Potassium adduct
+T 17 1 1 ?
+Q 14 1 1 ?
+T 169 1 1 ?
+M 49 1 0 ?
+K 245 1 0 ?
+P 50 1 1 ?
+L -38 1 0 ?
+Y 132 1 1 ?
+Q 42 1 1 acetylation
+M -66 1 1 ?
+A 43 1 1 carbamylation
+K 18 1 0 ?
+F -17 1 1 ?
+C -59 1 0 ?
+E -70 1 0 ?
+D 64 1 0 ?
+C -40 1 0 ?
+M 30 1 1 ?
+D 117 1 1 ?
+T 198 1 1 ?
+N 154 1 1 ?
+T 38 1 0 Potassium adduct
+L 17 1 0 ?
+F 38 1 1 Potassium adduct
+T 196 1 1 ?
+T -17 1 0 ?
+T 26 1 1 ?
+M 99 1 0 ?
+Y 184 1 1 ?
+I 57 1 1 NT+CAM
+A 127 1 1 ?
+D 128 1 1 lysine adduct
+M 26 1 1 ?
+G 3 1 0 ?
+M 127 1 0 ?
+L 125 1 1 ?
+V 44 1 1 ?
+R 249 1 1 ?
+C 125 1 0 ?
+L 94 1 0 ?
+N 175 1 0 ?
+K 139 1 1 ?
+C 176 1 0 ?
+M 64 1 1 ?
+E 39 1 1 ?
+S -25 1 0 ?
+N 35 1 1 ?
+Y 43 1 1 carbamylation
+L 34 1 1 ?
+W -23 1 0 ?
+T 22 1 1 Sodium adduct
+V 47 1 0 ?
+Y -36 1 0 ?
+K 147 1 0 ?
+Y -49 1 0 ?
+V 18 1 0 ?
+L -46 1 0 ?
+P 225 1 1 ?
+A 13 1 1 ?
+P 51 1 1 ?
+L 224 1 1 ?
+I 239 1 0 ?
+K 214 1 0 ?
+N 241 1 1 ?
+I 43 1 0 carbamylation
+I -53 1 0 ?
+M -32 1 1 ?
+V -22 1 1 ?
+N 46 1 0 ?
+P 121 1 0 ?
+M 175 1 0 ?
+W 44 1 1 ?
+C -89 1 0 ?
+E 129 1 0 ?
+M 38 1 0 Potassium adduct
+V -32 1 0 ?
+T 43 1 1 carbamylation
+K -56 1 0 ?
+V 172 1 1 ?
+Y 16 1 1 ?
+M -41 1 0 ?
+C 159 1 0 ?
+K -18 1 1 ?
+V 26 1 0 ?
+A 212 1 0 ?
+A 139 1 1 ?
+T 184 1 1 ?
+C 38 1 0 Potassium adduct
+L 182 1 1 ?
+E 50 1 0 ?
+I 34 1 1 ?
+D 56 1 1 ?
+Y 187 1 0 ?
+D 240 1 0 ?
+K 188 1 0 ?
+M 43 1 1 carbamylation
+E 19 1 1 ?
+F 185 1 0 ?
+Y -71 1 1 ?
+K -29 1 0 ?
+A 71 1 0 ?
+A 42 1 0 acetylation
+C -82 1 0 ?
+C 109 1 1 ?
+Y 71 1 1 ?
+K -71 1 0 ?
+G 57 1 0 NT+CAM
+W 47 1 0 ?
+N 62 1 0 ?

Table of observed modifications in the proteins.