Workshop on Data Analysis and Data Mining in Proteomics - IMS

Workshop on
Data Analysis and Data Mining in Proteomics
(9 - 12 May 2005)

~ Abstracts ~

Statistical quality assurance in mass spectrometry for proteomics
Paul Eilers, Leiden University Medical Center, , Netherlands

Like many other high-throughput techniques, mass spectrometry has been adopted by biologists and medical doctors for classification of tumours. Strong claims, in high-profile journals, have been made about its potential. But also papers have been published that challenge the technical quality of published reports

At the Leiden University Medical Center we also expect a lot from MS, but it was felt that quality assurance should take a central place and come first. Good laboratory practice and frequent calibration were used to guarantee instrumental stability. To quantify biological variability of blood serum samples from a baseline group of healthy persons, several experimental conditions were varied systematically: 1) time-of-day when sampling, 2) storage temperature, 3) freeze-thaw cycles, 4) repeating measurements on different days.

Pre-processing of the samples involved binning, noise filtering, baseline correction and normalization. Initial analysis involved “warping” to quantify drift in the mass scale and principal components analysis to quantify repeatability. In later stages ANOVA was applied to quantify components of variability. I will report on methods, implementation and results.

This is joint work with Mirre de Noo, André Deelder and Rob Tollenaar (all LUMC).

Computational tools for standardized analysis of MS/MS data
Andrew Keller, Institute for Systems Biology, USA

High throughput proteomics studies seek to infer peptide and protein identifications based upon the analysis of thousands of collected MS/MS spectra. Despite the growing popularity of such methods, there is as yet no accepted standardized way of analyzing data and deriving conclusions, thus making it difficult for researchers to compare and share data with one another. This is confounded by the variety of different mass spectrometer types used to generate MS/MS spectra, and search engines used to assign peptides to such spectra.

Our group has sought to create a free open source analysis pipeline for the identification and quantification of peptides and proteins based upon MS/MS spectra, with the goal of facilitating interpretation and comparison of results. Two components of the pipeline are of particular importance for a standardized analysis: PeptideProphet validates search results by computing accurate probabilities that each result in a dataset is correct based upon search scores and peptide properties; ProteinProphet groups together peptides according to their corresponding protein(s) in the database and combines evidence together to compute accurate probabilities that each protein is present in the original sample.

In order to enable uniform analysis of data generated by various mass spectrometers, each with its own proprietary raw data format, and assigned peptides using various search engines such as SEQUEST and Mascot, we have designed three standard XML data formats upon which our pipeline is based. The first, mzXML, is for raw mass spectral data to which output from any type of instrument can be converted. The second, pepXML, is for search result data to which output from any search engine can be transformed. The third, protXML, is for protein identifications based upon peptides assigned to MS/MS spectra. Once data is converted to these standard formats, subsequent analyses, such as quantitation at the peptide and protein levels, can be performed consistently. Only analyses dealing directly with search result scores, such as validation by PeptideProphet, must be adapted for each search engine.

A standard analysis pipeline facilitates exchange and interpretation of MS/MS data. XML file formats make it easy to share data with others for viewing and analysis. Probabilities computed by PeptideProphet and ProteinProphet are accurate measures of the confidence of identifications, and thus enable the sensitivity and false positive error rate of datasets to be predicted. These can serve as objective criteria by which results from different researchers are compared.

This work was supported in whole or in part with federal funds from the National Heart, Lung, and Blood Institute, NIH, under contract HV-28179.

Two-dimensional probability model for peptide matching using tandem mass spectra and protein databases
Rovshan Sadygov, Thermo Electron Corporation

Tandem mass spectrometry followed by database search is a powerful tool of proteomics. Proteins of complex mixtures are identified from tandem mass spectra of their peptides and amino acid databases. The peptide identification is often the first step in such studies as protein-protein interactions, protein localization and protein quantification and relative expression. Therefore, the accuracy of the spectrum assignation is very important. The focus of this presentation is on a two-dimensional probability model for peptide identification. At first, separate probability models are developed for two of the parameters that affect the quality of peptide identification the most – number of shared peaks count and the sum of the product ion abundances. The model for the shared peaks count is derived from the observations of the product ion matches to fragment ions from protein databases. The intensity based model uses sum statistics to derive probability of protein assignation based on the product ion abundances. The probabilities are translated into canonical coordinates to derive a single significance value of a peptide match. The talk will present the comparison of the approach to other database search algorithms.

Proteomics, why, how and when
Peter Roepstorff, University of Southern Denmark

The advances in DNA-sequencing and rapidly increasing amount of genome sequence data becoming available have changed the scope of protein analysis. Databases now provide the sequence of more than 500,000 proteins, most of which are based on genome sequencing, and this number is rapidly increasing. However, the information content in genome sequences is not sufficient to understand the living organism because most proteins are processed or otherwise modified after translation. Therefore, studies are needed on the protein level and the next level after genomics is proteomics defined as the analysis of the complete protein complement expressed by a genome or by a cell or tissue type (Wilkins M.R. et al. (1996) Bio/Technology, 14, 61-65).

Mass spectrometry (MS) is one of the most sensitive analytical techniques, which can generate structural information on proteins and the type of information generated by mass spectrometric analysis is ideal for queering sequence databases. Therefore mass spectrometry has become a key analytical tool in proteomics. Two strategies dominate in proteomics, one based on separation or the proteins by 2D-PAGE prior to protein identification by mass spectrometry, the other based on proteolytic digestion of all the proteins in a sample followed by separation and sequencing of the resulting peptides by multidimensional LC-MS. A number of intermediate strategies are also used. The different strategies will be described and their strengths and limitations evaluated for the use on different levels in proteomics, which include protein identification and assignment of post translational modifications. A number of recently developed concepts that allow modification specific proteome analysis will also be described. Finally examples of disease related studies, which have taken advantage of proteomics will be mentioned.

Applying probability based protein identification to large data sets
David Creasy, Matrix Science, UK

In probability based scoring, we compute the probability that the observed match between the experimental data and mass values calculated from a candidate protein or peptide sequence is a random event. The "correct" match, which is not a random event, has a very low probability. The strengths and weaknesses of this technique will be discussed and comparisons with Blast searches will be made. Generic methods for testing algorithms and will be discussed and compared.

In particular, we have investigated the use of this approach with large data sets consisting of tens of thousands of spectra. The number of false positive peptide matches will be shown to be within the expected values. Techniques for interpreting and comparing the results of these searches will be discussed, along with potential pitfalls.

Algorithms and score functions used in PEAKS de novo sequencing software
Bin Ma, University of Western Ontario

De novo sequencing from MS/MS data is the best way for the identification of the peptides of novel proteins. Because of the importance of the de novo sequencing, many software programs, free or commercial, are available. Recently, the PEAKS software has drawn much attention. It uses novel algorithms to compute, and uses sophisicated scoring functions to evaluate the peptide candidates. The software also has the capability to deal with variable posttranslational modifications; and has some nice features such as positional confidence scores for individual amino acids of the computed peptides. In this talk, the basic design of the algorithms and score functions of PEAKS will be introduced. Some recent development of the software will also be discussed.

Mining motifs from protein interaction data
See-Kiong Ng, Institute for Infocomm Research, Singapore

Discovering short conserved amino acid sequence patterns---or motifs---associated with protein functions or interactions is useful for guiding biological studies for the discovery of new drugs. However, finding biologically significant protein motifs remains a challenging task. Current methods typically require the manual grouping of the protein sequences for pre-processing---the quality of motifs discovered depended greatly on the clustering adequacy of the protein sequences provided. With the advent of high throughput protein interaction detection methods, genome-wide protein interactions are now available for analysis. In this work, we demonstrate how the inherent functional associations between interacting proteins can be exploited for clustering protein sequences to automatically discover novel biologically significant motifs.

Binding motif pairs from interacting protein groups
Limsoon Wong, Institute for Infocomm Research & National University of Singapore

Protein--protein interaction is intrinsic to most functional processes in the cell, and the binding sites are essential to the understanding of protein--protein interactions. A binding site is modeled as a binding motif pair in our research to emphasize the correlation between a pair of binding motifs. Inspired by the fact that a protein can interact with many proteins, we propose a concept of interacting protein groups to discover binding motif pairs. A pair of interacting protein groups is such two protein groups that every protein from one group interacts with all proteins in the other protein group, indicating a kind of ``full interaction'' between the two protein-sets. As an interacting protein group may share a common binding motif, we can get binding motif pairs by examining pairs of interacting protein groups. The identification of pairs of interacting protein groups is a challenging problem given a large collection of protein interactions. By a careful and sophisticated problem transformation, the problem is efficiently solved by using algorithms for mining frequent patterns, a problem extensively studied in data mining. The motif (or motifs) of each interacting protein group is then derived by applying traditional motif discovery algorithms on the sequence data of the protein group. We found 16372 binding motif pairs from a yeast protein interaction dataset, represented in the form of blocks. Comparing the motifs in the pairs with the BLOCKS and PRINTS databases, we found that each block could be mapped to an average of 4.0 correlated blocks in these two databases. The mapped blocks occur in 2472 out of total 6794 protein groups in these two databases. Comparing the 16372 motif pairs with a putative domain--domain interaction database (Interdom), we found 1508 matches, of which 320 pairs can be mapped to high-confidence domain--domain interactions and 194 within the 320 pairs can be mapped to interactions confirmed by complex data.

Atomistic computer simulations: an essential toolkit in modern proteomics
Chandra Verma, Bioinformatics Institute, Singapore

Computer simulations of biomolecules and in particular proteins have made tremendous advances in contributing towards understanding biology at the molecular level. They now regularly complement experiments and often provide unique details of biomolecular processes at an unprecedented level. Indeed they have often served to shed new light on existing paradigms. Examples will be given of the use of some of these methods to construct comprehensive models that relate biomolecular dynamics to the important processes which underpin biological function.

Serum biomarker discovery of liver diseases using SELDI-TOF and bioinformatics
Eastwood Leung, Genome Institute of Singapore

Reliable serum biomarker for early detection of cirrhosis and hepatocellular carcinoma (HCC) is still absent. We used surface-enhanced laser desorption and ionization (SELDI) technology together with machine learning algorithms to establish a pipeline for discovery of novel low molecular weight serum biomarkers. In a rat cirrhosis model, serum proteins were allowed to bind onto weak cation exchanger surface. Protein profiles were generated after washing off non-specific binding proteins from the surface followed by laser desorption and ionization process. Support vector machine algorithm was used to analyze protein profiles generated. A panel of selected signature markers resulted in higher than 90 % specificity and sensitivity in classification of test samples. The significant marker was purified and identified using on-chip digestion and sequencing on a MALDI-TOF/TOF platform. Copper II (Cu²⁺) ion surface was used in serum biomarker discovery of human HCC. A panel of selected protein peaks was differentially expressed in HCC specifically but not in normal, cirrhosis, colon carcinoma, and nasopharyngeal carcinoma. Again, the specificity and sensitivity of class prediction of test samples were higher than 89%. Several selected protein peaks were purified and identified by using off-line column chromatography, SDS-PAGE, and peptide sequencing using tandem mass spectrometry. These markers were further validated by Western blotting analysis. The preprocessing procedures of data and comparison of performance of different machine learning algorithms will also be discussed.

Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach
Yu Zhong Chen, National University of Singapore

For proteins having no sequence homolog of known function, their function is difficult to assign on the basis of sequence similarity. The same problem arises for homologous proteins of different functions. It is desirable to explore methods that are not based on sequence similarity. One approach is to assign functional family. A statistical learning method, support vector machines (SVM), has been used by several groups for predicting protein functional family irrespective of sequence similarity. These studies showed that SVM prediction accuracy is at a useful level, particularly for distantly related proteins and homologous proteins of different functions. Here SVM is tested for functional family assignment of two groups of enzymes. One consists of 41 enzymes without a homolog of known function from PSI-BLAST search of protein databases. The other contains 20 pairs of homologous enzymes of different families. SVM correctly assigns 78% of the enzymes in group 1 and 70% of the enzyme pairs in group 2, suggesting that it is potentially useful for facilitating functional study of novel proteins.

Systematic proteome analysis of breast cancer cell lines
Keli Ou, Agenica Research Pte Ltd, Singapore

Breast cancer is the commonest cancer in Asian women. In Singapore, about three women are diagnosed with the breast cancer each day and this number is increasing significantly at an average of 3% annually. In this study, we aim to identify the potential breast tumor biomarkers by comparing the protein expression profiles among different breast cell lines. A large-scale and high-throughput proteomic platform was employed for the project, which included two-dimensional gel electrophoresis and MALDI-TOF MS for protein characterization. Proteomics analysis of the 3 cell lines indicated significant differences between the normal (CRL) and breast cancer cell lines (MCF-7, HCC 38). An integrated proteomic and genomic approach showed that the majority proteins’ expression levels were consistent with the corresponding gene expression levels derived from RNA microarray. However there also existed some inconsistent results. The challenges of bioinformatics in the study will be discussed.

Proteomic investigation of colorectal cancer
Qingsong Lin, National University of Singapore

Colorectal cancer (CRC) is the second leading killer cancer worldwide and has become the most common cancer in Singapore. CRC is among the best characterized cancers with regards to genetic progression. Changes in gene expression profiles have also been widely investigated at the mRNA level. However, changes at the protein level are less well studied. Latest development of proteomic technologies allows us to examine the global expression profile of proteins in action, and has been widely applied to the studies of disease processes. The present study aims to detect changes of protein profiles that could be associated with the process of colon tumorigenesis, in order to discover biomarkers for diagnosis, and potential therapeutic targets. Two-Dimensional Gel Electrophoresis (2-D GE) and Isotope-coded Affinity Tag (ICAT) were applied to detect protein profile differences between cancerous and adjacent normal tissues. Issues regarding proteomic data analysis, protein quantitation and bioinformatics data mining will be discussed.

BIND: the Biomolecular Interaction Network Database
Susan Moore, Blueprint Asia, Singapore

Recent estimates suggest that there are hundreds of thousands of published, experimentally demonstrated biomolecular interactions. Rapid retrieval and meaningful large-scale analysis of this data depends on cataloguing these interactions in a computer-readable format. To this end, the Blueprint Initiative has embarked on a project to create and maintain BIND (the Biomolecular Interaction Network Database; www.bind.ca), a freely available resource. A long-term goal for Blueprint is to use BIND and associated tools to permit simulation of living cells. This talk will focus on the current status of the BIND project and discuss its possible use in facilitating research.

DTSeq: decision tree based De Novo peptide sequencing
Wing-Kin Sung, National University of Singapore

De Novo peptide sequencing is an important and well-studied problem which require further improvement. One major hurdle on further improvement is on how to model the intensities of the peaks based on the chemical and the physical properties of the protein peptide. In this talk, we model the intensity with the help of the probabilistic decision tree. Together with a PEAK-like dynamic programming algorithm, a new algorithm DTSeq is proposed to perform De Novo peptide sequencing. Experimental results show that DTSeq has better accuracy when compared with some best-known de novo peptide sequencing software.

Proteome analysis of separated male and female gametocytes reveals novel sex specific Plasmodium biology
Shahid Khan, Leiden University Medical Centre, Netherlands

Gametocytes, the precursor cells of malaria parasite gametes, circulate in the blood and are responsible for transmission from host to mosquito vector. The individual proteomes of male and female gametocytes were analyzed using mass spectrometry, following separation by flow sorting of transgenic parasites expressing Green Fluorescent Protein, in a sex-specific manner. Promoter tagging in transgenic parasites confirmed the designation of gametocyte specificity of the proteins. The male proteome contained 36% (236 of 650) male specific and the female proteome 19% (101 of 541) female specific proteins but they share only 69 proteins emphasizing the diverged features of the sexes. Of all the malaria life cycle stages analyzed, the male gametocyte has the most distinct proteome containing many proteins involved in flagellar-based motility and rapid genome replication. By identification of gender specific protein kinases and phosphatases and using targeted gene disruption of two kinases new sex-specific regulatory pathways were defined.

SPLASH : Systematic Proteomics Laboratory Analysis and Storage Hub
Siaw Ling Lo, National University of Singapore

Proteomics is a rapidly expanding field generating tremendously large amount of data annually. The increasing difficulty to unify the data format, due to the use of different platforms/equipments and laboratory documentations systems, greatly hinders experimental data verification, exchange and comparison. To address this issue, it is essential to establish standard formats for every aspect of proteomics. One of the recently published data model is Proteomics Experiment Data Repository (PEDRo) [1]. Based on this model with some customizations, SPLASH database system has been developed to provide proteomics researchers a common platform to store, manage, search and analyze their data. Here we report the implementation of SPLASH, covering all the three modules, including data maintenance, data search and data mining. Data maintenance consists of experimental data entry and update, and uploading of experiment results in batch mode (such as gel image annotation and mass spectrometry results). Data search module provides a means to search the database and allow viewing of protein details or differential expression display by clicking on a 2D GE image. The data mining module offers tools to aid researchers to make biological sense of the high throughput data, including Gene Ontology (GO) analyses, KEGG biochemical pathway analyses, and statistical analyses for sample sets. These features make SPLASH a practical and highly powerful tool for the proteomics research community.

[1] Chris F Taylor, Norman W Paton, Kevin L Garwood, Paul D Kirby, David A Stead, Zhikang Yin, Eric W Deutsch, Laura Selway, Janet Walker, Isabel Riba–Garcia, Shabaz Mohammed, Michael J Deery, Julie A Howard, Tom Dunkley, Ruedi Aebersold, Douglas B Kell, Kathryn S Lilley, Peter Roepstorff, John R Yates III, Andy Brass, Alistair J P Brown, Phil Cash, Simon J Gaskell, Simon J Hubbard, and Stephen G Oliver (2003). A systematic approach to modelling capturing and disseminating proteomics experimental data. Nature Biotechnology, March 2003, 247-254