ucsd bioinformatics
nitin gupta





home
THE ABSTRACT

academics
THE INTRODUCTION

research
THE METHODS

publications
THE RESULTS

resources
THE REFERENCES

pictures
THE FIGURES









what's my contribution?










Journal papers

N. Gupta and P.A. Pevzner. Peptide versus protein identifications. A strike against the two peptide rule. Submitted.

S. Kim, N. Gupta, N. Bandeira and P.A. Pevzner.  Spectral Dictionaries: Integrating De Novo Peptide Sequencing with Database Search of Tandem Mass Spectra. To appear in Molecular and Cellular Proteomics.

N. Gupta, J. Benhamida, V. Bhargava, D. Goodman, E. Kain, I. Kerman, N. Nguyen, N. Ollikainen, J. Rodriguez, J. Wang, M.S. Lipton, M. Romine, V. Bafna, R.D. Smith and P.A. Pevzner (2008).  Comparative Proteogenomics: Combining Mass Spectrometry and Comparative Genomics to Analyze Multiple Genomes. Genome Research. 18:1133-1142 .
[Abstract] [Full Text] [Pubmed]
* Included in Research Highlights of Nature Reviews Genetics, 6:418, 2008  [link]
* I presented this research in a talk at ASMS 2008 in Denver on June 5, 2008.
* Press releases on undergraduate-involvement:  UCSD  |
Science Daily | HHMI

S. Kim, N. Gupta and P.A. Pevzner (2008). The Partition Function of Tandem Mass Spectra: a New Approach to Peptide Identifications. Journal of Proteome Research. 7(8): 3354 - 3363.

J. Rodriguez, N. Gupta, R.D. Smith and P.A. Pevzner (2008). Does trypsin cut before Proline? Journal of Proteome Research. 7(1):300-5.
[Abstract] [Full Text] [Pubmed]
* Noted as one of the 20 most accessed articles in the first quarter of 2008.

N. Gupta, S. Tanner, N. Jaitly, J.N. Adkins, M. Lipton, R. Edwards, M. Romine, A. Osterman, V. Bafna, R.D. Smith and P.A. Pevzner (2007).  Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation. Genome Research. 17(9):1362-77.
[Abstract] [Full Text] [Pubmed]
* Research highlight at Pacific Northwest National Labs [link].

K. Gaurav, N. Gupta and R. Sowdhamini (2005). "FASSM: Enhanced Function Association in whole genome analysis using Sequence and Structural Motifs". In Silico Biology 5, 0040.
[Abstract] [Full Text] [Pubmed]

N. Gupta, N. Mangal and S. Biswas (2005). "Evolution and similarity evaluation of protein structures in contact map space". Proteins: Structure, Function and Bioinformatics, 59(2):196-204.
[Abstract] [Full Text] [Pubmed]

A. Bhaduri, G. Pugalenthi, N. Gupta and R. Sowdhamini (2004). "iMOT: an interactive package for the selection of spatially interacting motifs". Nucleic Acids Research,  32, W602-W605.
[Abstract] [Full Text] [PubMed]

N. Gupta and A. Irback (2004). Coupled folding-binding versus docking: A lattice model study. Journal of Chemical Physics, 120, 3983-3989.
[Abstract] [Full Text] [Pubmed]


Conference papers

B. Dost, T. Shlomi, N. Gupta, V. Bafna, and Roded Sharan, "QNet: A tool for querying biological networks", RECOMB 2007.
[Abstract] [Full Text]
* Also published in Lecture Notes in Bioinformatics, 4453, p. 1 ff.

N. Gupta, N. Mangal, K. Tiwari and P. Mitra. "Mining quantitative association rules in protein sequences". Proceedings of the third Australasian Data Mining Conference 2004 (AusDM'04), Cairns, Australia.
[Abstract]
* Also published in Lecture Notes in Computer Science, Volume 3755 / 2006, pp. 273 - 281.

N. Gupta and V. K. Agrawal. "Two Criterion Optimization in state assignment for synchronous finite state machines using NSGA-II". Proceedings of the International Conference on Adaptive and Natural Computing Algorithms, 2005 (ICANNGA'05), Coimbra, Portugal.
[Abstract]


Patents

S. Kim, N. Gupta and P.A. Pevzner. Method for identifying peptides using tandem mass spectra by dynamically determining the number of peptide reconstructions required. Pending.

Abstracts


K. Gaurav, N. Gupta and R. Sowdhamini. "FASSM: Enhanced Function Association in whole genome analysis using Sequence and Structural Motifs". In Silico Biology 5, 0040 (2005).

We present an algorithm to detect remote homology, which arises through circular permutation and discontinuous domains. It is also helpful in detecting small domain proteins that are characterized by few conserved residues. The input to the algorithm is a set of multiply aligned protein sequence profiles. This method, coded as FASSM, examines the sequence conservation and positions of protein family signatures or motifs for the annotation of protein sequences and to facilitate the analysis of their domains. The overall coverage of FASSM is 93% in comparison to other validation tools like HMM and IMPALA. The method is especially useful for difficult relationships such as discontinuous domains during whole-genome surveys and is demonstrated to perform accurate family associations at sequence identities as low as 15%.



N. Gupta, N. Mangal and S. Biswas (2005). "Evolution and similarity evaluation of protein structures in contact map space".  Proteins: Structure, Function and Bioinformatics, 59(2):196-204.

Prediction of fold from amino-acid sequence of a protein has been an active area of research in the past few years, but the limited accuracy of existing techniques emphasizes the need to develop newer approaches to tackle this task. In this study, we use contact map prediction as an intermediate step in fold prediction from sequence. Contact map is a reduced graph-theoretic representation of proteins which models the local and global inter-residue contacts in the structure. We start with a population of random contact maps for the protein sequence and "evolve" the population to a "high-feasibility" configuration using a genetic algorithm. A neural network is employed to assess the feasibility of contact maps based on their four physically relevant properties. We also introduce five parameters, based on algebraic graph theory and physical considerations, that can be used to judge the structural similarity between proteins through contact maps. To predict the fold of a given amino acid sequence, we predict a contact map that will sufficiently approximate the structure of the corresponding protein. Then we assess the similarity of this contact map with the representative contact map of each fold; the fold that corresponds to the closest match is our predicted fold for the input sequence. We have found that our feasibility measure is able to differentiate between feasible and infeasible contact maps. Further, this novel approach is able to predict the folds from sequences significantly better than a random predictor.



A. Bhaduri, G. Pugalenthi, N. Gupta and R. Sowdhamini (2004)." iMOT: an interactive package for the selection of spatially interacting motifs".  Nucleic Acids Research,  32, W602-W605.

Functional selection and three-dimensional structural constraints of proteins relate to the retention of significant sequence similarity between proteins of similar fold and function despite poor overall sequence identity and evolutionary pressures. We report the availability of ‘iMOT’ (interacting MOTif) server, an interactive package for the automatic identification of spatially interacting motifs among distantly related proteins sharing similar folds and possessing common ancestral lineage. Spatial interactions between conserved stretches of a protein are evaluated by calculations of pseudo-potentials that describe the strength of interactions. Such an evaluation permits the automatic identification of highly interacting conserved regions of a protein. Interacting motifs have been shown to be useful in searching for distant homologues and establishing remote homologies among the largely unassigned sequences in genome databases. Information on such motifs should also be of value in protein folding, modelling and engineering experiments.
The iMOT server can be accessed from
http://www.ncbs.res.in/~faculty/mini/imot/iMOTserver.html.



N. Gupta and A. Irback (2004). "Coupled folding-binding versus docking: A lattice model study". Journal of Chemical Physics, 120, 3983-3989.

Using a simple hydrophobic/polar protein model, we perform a Monte Carlo study of the thermodynamics and kinetics of binding to a target structure for two closely related sequences, one of which has a unique folded state while the other is unstructured. We obtain significant differences in their binding behavior. The stable sequence has rigid docking as its preferred binding mode, while the unstructured chain tends to first attach to the target and then fold. The free-energy profiles associated with these two binding modes are compared.



N. Gupta, N. Mangal, K. Tiwari and P. Mitra. "Mining quantitative association rules in protein sequences". Proceedings of the third Australasian Data Mining Conference 2004 (AusDM'04), Cairns, Australia.

Lot of research has gone into understanding the composition and nature of proteins, still many things are yet to be understood properly. It is now generally believed that amino acid sequences of proteins are not random, and thus the patterns of amino acids that we observe in the protein sequences are non-random. In this study, we are trying to decipher the nature of associations between different amino acids that are present in a protein. This very basic analysis can provide some insight into the co-occurrence of certain amino acids in a protein. Such association rules are desirable for enhancing our understanding of protein composition. They have the potential to give some clue regarding global interactions among particular sets of amino acids occuring in proteins. Presence of strong non-trivial
associations further suggests evidence for non-randomness of protein sequences.



N. Gupta and V. K. Agrawal. "Two Criterion Optimization in state assignment for synchronous finite state machines using NSGA-II".Proceedings of the International Conference on Adaptive and Natural Computing Algorithms, 2005 (ICANNGA'05), Coimbra, Portugal.

This project aims at  finding the best state assignment for implementing a synchronous sequential circuit which are also represented as Finite State Machines. This problem, commonly known as State Assignment Problem (S.A.P.), has been studied extensively because of its importance in reducing the cost of implementation. The previous work on this problem assumes the number of bits that are used for state assignment as given beforehand. Thus the problem has been treated as a single objective problem, with the only objective being to reduce the cumulative cost of transition between the connected states.

In this work, we add another dimension to this optimization problem by introducing a second objective of minimizing the number of bits used for assignment. This is desirable to reduce the complexity and cost of the circuit. The second objective conflicts with the first objective and thus the optimal solution requires a tradeoff between the two. We have used different EMO methods to tackle this problem. The results show that our NSGA-II based approach, with some modifications to constraint handling, gives better results and running time than NSGA. We also gain some insights about the shape of the efficient frontier.



B. Dost, T. Shlomi,
N. Gupta, V. Bafna, and Roded Sharan, "QNet: A tool for querying biological networks", RECOMB 2007. Also published in Lecture Notes in Bioinformatics, 4453, p. 1 ff.

Molecular interaction databases can be used to study the evolution of molecular pathways across species. Querying such pathways is a challenging computational problem, and recent efforts have been limited to simple queries (paths), or simple networks (forests). In this paper, we significantly extend the class of pathways that can be efficiently queried to the case of trees, and graphs of bounded treewidth. Our algorithm allows the identification of non-exact (homeomorphic) matches, exploiting the color coding technique of Alon et al. We implement a tool for tree queries, called QNet, and test its retrieval properties in simulations and on real network data. We show that QNet searches queries with up to 9 proteins in seconds on current networks, and outperforms sequence-based searches. We also use QNet to perform the first large scale cross-species comparison of protein complexes, by querying known yeast complexes against a fly protein interaction network. This comparison points to strong conservation between the two species, and underscores the importance of our tool in mining protein interaction networks.



N. Gupta, S. Tanner, N. Jaitly, J.N. Adkins, M. Lipton, R. Edwards, M. Romine, A. Osterman, V. Bafna, R.D. Smith and P.A. Pevzner (2007).  Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation. Genome Res. Sep;17(9):1362-77.

While bacterial genome annotations have significantly improved in recent years, techniques for bacterial proteome annotation (including post-translational chemical modifications, signal peptides, proteolytic events, etc.) are still in their infancy. At the same time, the number of sequenced bacterial genomes is rising sharply, far outpacing our ability to validate the predicted genes, let alone annotate bacterial proteomes. In this study, we use tandem mass spectrometry (MS/MS) to annotate the proteome of Shewanella oneidensis MR-1, an important microbe for bioremediation. In particular, we provide the first comprehensive map of post-translational modifications in a bacterial genome, including a large number of chemical modifications, signal peptide cleavages, and cleavages of N-terminal methionine residues. We also detect multiple genes that were missed or assigned incorrect start positions by gene prediction programs, and suggest corrections to improve the gene annotation. This study demonstrates that complementing every genome sequencing project by an MS/MS project would significantly improve both genome and proteome annotations for a reasonable cost.



J. Rodriguez,
N. Gupta, R.D. Smith and P.A. Pevzner (2008). Does trypsin cut before Proline? Journal of Proteome Research. 7(1):300-5.

Trypsin is the most commonly used enzyme in mass spectrometry for protein digestion with high substrate specificity. Many peptide identification algorithms incorporate these specificity rules as
filtering criteria. A generally accepted "Keil rule" is that trypsin cleaves next to arginine or lysine, but not before proline. Since this rule  was derived two decades ago based on a small number of experimentally confirmed cleavages, we decided to re-examine it using 14.5 million tandem spectra (two orders of magnitude increase in the number of observed tryptic cleavages). Our analysis revealed a surprisingly large number of cleavages before proline. We examine several hypotheses to explain these cleavages and argue that trypsin specificity rules used in peptide identification algorithms should be modified to "legitimatize" cleavages before proline. Our approach can be applied to analyzing any protease and we further argue that specificity rules for other enzymes should also be re-evaluated based on statistical evidence derived from large MS/MS datasets.




N. Gupta, J. Benhamida, V. Bhargava, D. Goodman, E. Kain, I. Kerman, N. Nguyen, N. Ollikainen, J. Rodriguez, J. Wang, M.S. Lipton, M. Romine, V. Bafna, R.D. Smith and P.A. Pevzner (2008).  Comparative Proteogenomics: Combining Mass Spectrometry and Comparative Genomics to Analyze Multiple Genomes. To appear in Genome Research.

Recent proliferation of low-cost DNA sequencing techniques will soon lead to an explosive growth in the number of sequenced genomes and will turn manual annotations into a luxury. Mass spectrometry recently emerged as a valuable technique for proteogenomic annotations that improves on the state-of-the-art in predicting genes and other features. However, previous proteogenomic approaches were limited to a single genome and did not take advantage of analyzing mass spectrometry data from multiple genomes at once. We show that such a comparative proteogenomics approach (like comparative genomics) allows one to address the problems that remained beyond the reach of the traditional "single proteome" approach in mass spectrometry. In particular, we show how comparative proteogenomics addresses the notoriously difficult problem of "one-hit-wonders" in proteomics, improves on the existing gene prediction tools in genomics, and allows identification of rare post-translational modifications. We therefore argue that complementing DNA sequencing projects by comparative proteogenomics projects can be a viable approach to improve both genomic and proteomic annotations.













Copyright © 2004-2006 www.ngupta.com