|  Workshop on Data Analysis and Data Mining in Proteomics
 (9 - 12 May 2005)
~ Abstracts ~Statistical quality assurance in mass spectrometry for proteomicsPaul Eilers, Leiden University Medical Center, , 
					Netherlands
 Like many other high-throughput techniques, mass 
					spectrometry has been adopted by biologists and medical 
					doctors for classification of tumours. Strong claims, in 
					high-profile journals, have been made about its potential. 
					But also papers have been published that challenge the 
					technical quality of published reports  At the Leiden University Medical Center we also expect a 
					lot from MS, but it was felt that quality assurance should 
					take a central place and come first. Good laboratory 
					practice and frequent calibration were used to guarantee 
					instrumental stability. To quantify biological variability 
					of blood serum samples from a baseline group of healthy 
					persons, several experimental conditions were varied 
					systematically: 1) time-of-day when sampling, 2) storage 
					temperature, 3) freeze-thaw cycles, 4) repeating 
					measurements on different days.  Pre-processing of the samples involved binning, noise 
					filtering, baseline correction and normalization. Initial 
					analysis involved “warping” to quantify drift in the mass 
					scale and principal components analysis to quantify 
					repeatability. In later stages ANOVA was applied to quantify 
					components of variability. I will report on methods, 
					implementation and results.  This is joint work with Mirre de Noo, André Deelder and 
					Rob Tollenaar (all LUMC). 
                  « Back... Computational tools for 
					standardized analysis of MS/MS dataAndrew Keller, Institute for Systems Biology, USA
 High throughput proteomics studies seek to infer peptide 
					and protein identifications based upon the analysis of 
					thousands of collected MS/MS spectra. Despite the growing 
					popularity of such methods, there is as yet no accepted 
					standardized way of analyzing data and deriving conclusions, 
					thus making it difficult for researchers to compare and 
					share data with one another. This is confounded by the 
					variety of different mass spectrometer types used to 
					generate MS/MS spectra, and search engines used to assign 
					peptides to such spectra. Our group has sought to create a free open source 
					analysis pipeline for the identification and quantification 
					of peptides and proteins based upon MS/MS spectra, with the 
					goal of facilitating interpretation and comparison of 
					results. Two components of the pipeline are of particular 
					importance for a standardized analysis: PeptideProphet 
					validates search results by computing accurate probabilities 
					that each result in a dataset is correct based upon search 
					scores and peptide properties; ProteinProphet groups 
					together peptides according to their corresponding protein(s) 
					in the database and combines evidence together to compute 
					accurate probabilities that each protein is present in the 
					original sample. In order to enable uniform analysis of data generated by 
					various mass spectrometers, each with its own proprietary 
					raw data format, and assigned peptides using various search 
					engines such as SEQUEST and Mascot, we have designed three 
					standard XML data formats upon which our pipeline is based. 
					The first, mzXML, is for raw mass spectral data to which 
					output from any type of instrument can be converted. The 
					second, pepXML, is for search result data to which output 
					from any search engine can be transformed. The third, 
					protXML, is for protein identifications based upon peptides 
					assigned to MS/MS spectra. Once data is converted to these 
					standard formats, subsequent analyses, such as quantitation 
					at the peptide and protein levels, can be performed 
					consistently. Only analyses dealing directly with search 
					result scores, such as validation by PeptideProphet, must be 
					adapted for each search engine. A standard analysis pipeline facilitates exchange and 
					interpretation of MS/MS data. XML file formats make it easy 
					to share data with others for viewing and analysis. 
					Probabilities computed by PeptideProphet and ProteinProphet 
					are accurate measures of the confidence of identifications, 
					and thus enable the sensitivity and false positive error 
					rate of datasets to be predicted. These can serve as 
					objective criteria by which results from different 
					researchers are compared.  This work was supported in whole or in part with federal 
					funds from the National Heart, Lung, and Blood Institute, 
					NIH, under contract HV-28179.  
                  « Back... Two-dimensional probability model 
					for peptide matching using tandem mass spectra and protein 
					databasesRovshan Sadygov, Thermo Electron Corporation
 Tandem mass spectrometry followed by database search is a 
					powerful tool of proteomics. Proteins of complex mixtures 
					are identified from tandem mass spectra of their peptides 
					and amino acid databases. The peptide identification is 
					often the first step in such studies as protein-protein 
					interactions, protein localization and protein 
					quantification and relative expression. Therefore, the 
					accuracy of the spectrum assignation is very important. The 
					focus of this presentation is on a two-dimensional 
					probability model for peptide identification. At first, 
					separate probability models are developed for two of the 
					parameters that affect the quality of peptide identification 
					the most – number of shared peaks count and the sum of the 
					product ion abundances. The model for the shared peaks count 
					is derived from the observations of the product ion matches 
					to fragment ions from protein databases. The intensity based 
					model uses sum statistics to derive probability of protein 
					assignation based on the product ion abundances. The 
					probabilities are translated into canonical coordinates to 
					derive a single significance value of a peptide match. The 
					talk will present the comparison of the approach to other 
					database search algorithms.  
                  « Back... Proteomics, why, how and whenPeter Roepstorff, University of Southern Denmark
 The advances in DNA-sequencing and rapidly increasing 
					amount of genome sequence data becoming available have 
					changed the scope of protein analysis. Databases now provide 
					the sequence of more than 500,000 proteins, most of which 
					are based on genome sequencing, and this number is rapidly 
					increasing. However, the information content in genome 
					sequences is not sufficient to understand the living 
					organism because most proteins are processed or otherwise 
					modified after translation. Therefore, studies are needed on 
					the protein level and the next level after genomics is 
					proteomics defined as the analysis of the complete protein 
					complement expressed by a genome or by a cell or tissue type 
					(Wilkins M.R. et al. (1996) Bio/Technology, 14, 
					61-65).  Mass spectrometry (MS) is one of the most sensitive 
					analytical techniques, which can generate structural 
					information on proteins and the type of information 
					generated by mass spectrometric analysis is ideal for 
					queering sequence databases. Therefore mass spectrometry has 
					become a key analytical tool in proteomics. Two strategies 
					dominate in proteomics, one based on separation or the 
					proteins by 2D-PAGE prior to protein identification by mass 
					spectrometry, the other based on proteolytic digestion of 
					all the proteins in a sample followed by separation and 
					sequencing of the resulting peptides by multidimensional 
					LC-MS. A number of intermediate strategies are also used. 
					The different strategies will be described and their 
					strengths and limitations evaluated for the use on different 
					levels in proteomics, which include protein identification 
					and assignment of post translational modifications. A number 
					of recently developed concepts that allow modification 
					specific proteome analysis will also be described. Finally 
					examples of disease related studies, which have taken 
					advantage of proteomics will be mentioned.  
                  « Back... Applying probability based protein 
					identification to large data setsDavid Creasy, Matrix Science, UK
 In probability based scoring, we compute the probability 
					that the observed match between the experimental data and 
					mass values calculated from a candidate protein or peptide 
					sequence is a random event. The "correct" match, which is 
					not a random event, has a very low probability. The 
					strengths and weaknesses of this technique will be discussed 
					and comparisons with Blast searches will be made. Generic 
					methods for testing algorithms and will be discussed and 
					compared.  In particular, we have investigated the use of this 
					approach with large data sets consisting of tens of 
					thousands of spectra. The number of false positive peptide 
					matches will be shown to be within the expected values. 
					Techniques for interpreting and comparing the results of 
					these searches will be discussed, along with potential 
					pitfalls. 
                  « Back... Algorithms and score functions used 
					in PEAKS de novo sequencing softwareBin Ma, University of Western Ontario
 De novo sequencing from MS/MS data is the best way for 
					the identification of the peptides of novel proteins. 
					Because of the importance of the de novo sequencing, many 
					software programs, free or commercial, are available. 
					Recently, the PEAKS software has drawn much attention. It 
					uses novel algorithms to compute, and uses sophisicated 
					scoring functions to evaluate the peptide candidates. The 
					software also has the capability to deal with variable 
					posttranslational modifications; and has some nice features 
					such as positional confidence scores for individual amino 
					acids of the computed peptides. In this talk, the basic 
					design of the algorithms and score functions of PEAKS will 
					be introduced. Some recent development of the software will 
					also be discussed. 
                  « Back... Mining motifs from protein 
					interaction dataSee-Kiong Ng, Institute for Infocomm Research, Singapore
 Discovering short conserved amino acid sequence 
					patterns---or motifs---associated with protein functions or 
					interactions is useful for guiding biological studies for 
					the discovery of new drugs. However, finding biologically 
					significant protein motifs remains a challenging task. 
					Current methods typically require the manual grouping of the 
					protein sequences for pre-processing---the quality of motifs 
					discovered depended greatly on the clustering adequacy of 
					the protein sequences provided. With the advent of high 
					throughput protein interaction detection methods, 
					genome-wide protein interactions are now available for 
					analysis. In this work, we demonstrate how the inherent 
					functional associations between interacting proteins can be 
					exploited for clustering protein sequences to automatically 
					discover novel biologically significant motifs. 
                  « Back... Binding motif pairs from 
					interacting protein groupsLimsoon Wong, Institute for Infocomm Research & National 
					University of Singapore
 Protein--protein interaction is intrinsic to most 
					functional processes in the cell, and the binding sites are 
					essential to the understanding of protein--protein 
					interactions. A binding site is modeled as a binding motif 
					pair in our research to emphasize the correlation between a 
					pair of binding motifs. Inspired by the fact that a protein 
					can interact with many proteins, we propose a concept of 
					interacting protein groups to discover binding motif pairs. 
					A pair of interacting protein groups is such two protein 
					groups that every protein from one group interacts with all 
					proteins in the other protein group, indicating a kind of 
					``full interaction'' between the two protein-sets. As an 
					interacting protein group may share a common binding motif, 
					we can get binding motif pairs by examining pairs of 
					interacting protein groups. The identification of pairs of 
					interacting protein groups is a challenging problem given a 
					large collection of protein interactions. By a careful and 
					sophisticated problem transformation, the problem is 
					efficiently solved by using algorithms for mining frequent 
					patterns, a problem extensively studied in data mining. The 
					motif (or motifs) of each interacting protein group is then 
					derived by applying traditional motif discovery algorithms 
					on the sequence data of the protein group. We found 16372 
					binding motif pairs from a yeast protein interaction 
					dataset, represented in the form of blocks. Comparing the 
					motifs in the pairs with the BLOCKS and PRINTS databases, we 
					found that each block could be mapped to an average of 4.0 
					correlated blocks in these two databases. The mapped blocks 
					occur in 2472 out of total 6794 protein groups in these two 
					databases. Comparing the 16372 motif pairs with a putative 
					domain--domain interaction database (Interdom), we found 
					1508 matches, of which 320 pairs can be mapped to 
					high-confidence domain--domain interactions and 194 within 
					the 320 pairs can be mapped to interactions confirmed by 
					complex data. 
                  « Back... Atomistic computer simulations: an 
					essential toolkit in modern proteomicsChandra Verma, Bioinformatics Institute, Singapore
 Computer simulations of biomolecules and in particular 
					proteins have made tremendous advances in contributing 
					towards understanding biology at the molecular level. They 
					now regularly complement experiments and often provide 
					unique details of biomolecular processes at an unprecedented 
					level. Indeed they have often served to shed new light on 
					existing paradigms. Examples will be given of the use of 
					some of these methods to construct comprehensive models that 
					relate biomolecular dynamics to the important processes 
					which underpin biological function. 
                  « Back... Serum biomarker discovery of liver 
					diseases using SELDI-TOF and bioinformaticsEastwood Leung, Genome Institute of Singapore
 Reliable serum biomarker for early detection of cirrhosis 
					and hepatocellular carcinoma (HCC) is still absent. We used 
					surface-enhanced laser desorption and ionization (SELDI) 
					technology together with machine learning algorithms to 
					establish a pipeline for discovery of novel low molecular 
					weight serum biomarkers. In a rat cirrhosis model, serum 
					proteins were allowed to bind onto weak cation exchanger 
					surface. Protein profiles were generated after washing off 
					non-specific binding proteins from the surface followed by 
					laser desorption and ionization process. Support vector 
					machine algorithm was used to analyze protein profiles 
					generated. A panel of selected signature markers resulted in 
					higher than 90 % specificity and sensitivity in 
					classification of test samples. The significant marker was 
					purified and identified using on-chip digestion and 
					sequencing on a MALDI-TOF/TOF platform. Copper II (Cu2+) 
					ion surface was used in serum biomarker discovery of human 
					HCC. A panel of selected protein peaks was differentially 
					expressed in HCC specifically but not in normal, cirrhosis, 
					colon carcinoma, and nasopharyngeal carcinoma. Again, the 
					specificity and sensitivity of class prediction of test 
					samples were higher than 89%. Several selected protein peaks 
					were purified and identified by using off-line column 
					chromatography, SDS-PAGE, and peptide sequencing using 
					tandem mass spectrometry. These markers were further 
					validated by Western blotting analysis. The preprocessing 
					procedures of data and comparison of performance of 
					different machine learning algorithms will also be 
					discussed.  
                  « Back... Predicting functional family of 
					novel enzymes irrespective of sequence similarity: a 
					statistical learning approachYu Zhong Chen, National University of Singapore
 For proteins having no sequence homolog of known 
					function, their function is difficult to assign on the basis 
					of sequence similarity. The same problem arises for 
					homologous proteins of different functions. It is desirable 
					to explore methods that are not based on sequence 
					similarity. One approach is to assign functional family. A 
					statistical learning method, support vector machines (SVM), 
					has been used by several groups for predicting protein 
					functional family irrespective of sequence similarity. These 
					studies showed that SVM prediction accuracy is at a useful 
					level, particularly for distantly related proteins and 
					homologous proteins of different functions. Here SVM is 
					tested for functional family assignment of two groups of 
					enzymes. One consists of 41 enzymes without a homolog of 
					known function from PSI-BLAST search of protein databases. 
					The other contains 20 pairs of homologous enzymes of 
					different families. SVM correctly assigns 78% of the enzymes 
					in group 1 and 70% of the enzyme pairs in group 2, 
					suggesting that it is potentially useful for facilitating 
					functional study of novel proteins.  
                  « Back... Systematic proteome analysis of 
					breast cancer cell linesKeli Ou, Agenica Research Pte Ltd, Singapore
 Breast cancer is the commonest cancer in Asian women. In 
					Singapore, about three women are diagnosed with the breast 
					cancer each day and this number is increasing significantly 
					at an average of 3% annually. In this study, we aim to 
					identify the potential breast tumor biomarkers by comparing 
					the protein expression profiles among different breast cell 
					lines. A large-scale and high-throughput proteomic platform 
					was employed for the project, which included two-dimensional 
					gel electrophoresis and MALDI-TOF MS for protein 
					characterization. Proteomics analysis of the 3 cell lines 
					indicated significant differences between the normal (CRL) 
					and breast cancer cell lines (MCF-7, HCC 38). An integrated 
					proteomic and genomic approach showed that the majority 
					proteins’ expression levels were consistent with the 
					corresponding gene expression levels derived from RNA 
					microarray. However there also existed some inconsistent 
					results. The challenges of bioinformatics in the study will 
					be discussed. 
                  « Back... Proteomic investigation of 
					colorectal cancerQingsong Lin, National University of Singapore
 Colorectal cancer (CRC) is the second leading killer 
					cancer worldwide and has become the most common cancer in 
					Singapore. CRC is among the best characterized cancers with 
					regards to genetic progression. Changes in gene expression 
					profiles have also been widely investigated at the mRNA 
					level. However, changes at the protein level are less well 
					studied. Latest development of proteomic technologies allows 
					us to examine the global expression profile of proteins in 
					action, and has been widely applied to the studies of 
					disease processes. The present study aims to detect changes 
					of protein profiles that could be associated with the 
					process of colon tumorigenesis, in order to discover 
					biomarkers for diagnosis, and potential therapeutic targets. 
					Two-Dimensional Gel Electrophoresis (2-D GE) and 
					Isotope-coded Affinity Tag (ICAT) were applied to detect 
					protein profile differences between cancerous and adjacent 
					normal tissues. Issues regarding proteomic data analysis, 
					protein quantitation and bioinformatics data mining will be 
					discussed.  
                  « Back... BIND: the Biomolecular Interaction 
					Network DatabaseSusan Moore, Blueprint Asia, Singapore
 Recent estimates suggest that there are hundreds of 
					thousands of published, experimentally demonstrated 
					biomolecular interactions. Rapid retrieval and meaningful 
					large-scale analysis of this data depends on cataloguing 
					these interactions in a computer-readable format. To this 
					end, the Blueprint Initiative has embarked on a project to 
					create and maintain BIND (the Biomolecular Interaction 
					Network Database; www.bind.ca), a freely available resource. 
					A long-term goal for Blueprint is to use BIND and associated 
					tools to permit simulation of living cells. This talk will 
					focus on the current status of the BIND project and discuss 
					its possible use in facilitating research. 
                  « Back... DTSeq: decision tree based De Novo 
					peptide sequencingWing-Kin Sung, National University of Singapore
 De Novo peptide sequencing is an important and 
					well-studied problem which require further improvement. One 
					major hurdle on further improvement is on how to model the 
					intensities of the peaks based on the chemical and the 
					physical properties of the protein peptide. In this talk, we 
					model the intensity with the help of the probabilistic 
					decision tree. Together with a PEAK-like dynamic programming 
					algorithm, a new algorithm DTSeq is proposed to perform De 
					Novo peptide sequencing. Experimental results show that 
					DTSeq has better accuracy when compared with some best-known 
					de novo peptide sequencing software. 
                  « Back... Proteome analysis of separated male 
					and female gametocytes reveals novel sex specific Plasmodium 
					biologyShahid Khan, Leiden University Medical 
					Centre, Netherlands
 Gametocytes, the precursor cells of malaria parasite 
					gametes, circulate in the blood and are responsible for 
					transmission from host to mosquito vector. The individual 
					proteomes of male and female gametocytes were analyzed using 
					mass spectrometry, following separation by flow sorting of 
					transgenic parasites expressing Green Fluorescent Protein, 
					in a sex-specific manner. Promoter tagging in transgenic 
					parasites confirmed the designation of gametocyte 
					specificity of the proteins. The male proteome contained 36% 
					(236 of 650) male specific and the female proteome 19% (101 
					of 541) female specific proteins but they share only 69 
					proteins emphasizing the diverged features of the sexes. Of 
					all the malaria life cycle stages analyzed, the male 
					gametocyte has the most distinct proteome containing many 
					proteins involved in flagellar-based motility and rapid 
					genome replication. By identification of gender specific 
					protein kinases and phosphatases and using targeted gene 
					disruption of two kinases new sex-specific regulatory 
					pathways were defined. 
                  « Back... SPLASH : Systematic Proteomics 
					Laboratory Analysis and Storage HubSiaw Ling Lo, National University of Singapore
 Proteomics is a rapidly expanding field generating 
					tremendously large amount of data annually. The increasing 
					difficulty to unify the data format, due to the use of 
					different platforms/equipments and laboratory documentations 
					systems, greatly hinders experimental data verification, 
					exchange and comparison. To address this issue, it is 
					essential to establish standard formats for every aspect of 
					proteomics. One of the recently published data model is 
					Proteomics Experiment Data Repository (PEDRo) [1]. Based on 
					this model with some customizations, SPLASH database system 
					has been developed to provide proteomics researchers a 
					common platform to store, manage, search and analyze their 
					data. Here we report the implementation of SPLASH, covering 
					all the three modules, including data maintenance, data 
					search and data mining. Data maintenance consists of 
					experimental data entry and update, and uploading of 
					experiment results in batch mode (such as gel image 
					annotation and mass spectrometry results). Data search 
					module provides a means to search the database and allow 
					viewing of protein details or differential expression 
					display by clicking on a 2D GE image. The data mining module 
					offers tools to aid researchers to make biological sense of 
					the high throughput data, including Gene Ontology (GO) 
					analyses, KEGG biochemical pathway analyses, and statistical 
					analyses for sample sets. These features make SPLASH a 
					practical and highly powerful tool for the proteomics 
					research community.  [1] Chris F Taylor, Norman W Paton, Kevin L Garwood, Paul 
					D Kirby, David A Stead, Zhikang Yin, Eric W Deutsch, Laura 
					Selway, Janet Walker, Isabel Riba–Garcia, Shabaz Mohammed, 
					Michael J Deery, Julie A Howard, Tom Dunkley, Ruedi 
					Aebersold, Douglas B Kell, Kathryn S Lilley, Peter 
					Roepstorff, John R Yates III, Andy Brass, Alistair J P 
					Brown, Phil Cash, Simon J Gaskell, Simon J Hubbard, and 
					Stephen G Oliver (2003). A systematic approach to modelling 
					capturing and disseminating proteomics experimental data. 
					Nature Biotechnology, March 2003, 247-254 
                  « Back...                                                   |