Workshop on
Data Analysis and Data Mining in Proteomics
(9 - 12 May 2005)
~ Abstracts ~
Statistical quality assurance in mass spectrometry for proteomics
Paul Eilers, Leiden University Medical Center, ,
Netherlands
Like many other high-throughput techniques, mass
spectrometry has been adopted by biologists and medical
doctors for classification of tumours. Strong claims, in
high-profile journals, have been made about its potential.
But also papers have been published that challenge the
technical quality of published reports
At the Leiden University Medical Center we also expect a
lot from MS, but it was felt that quality assurance should
take a central place and come first. Good laboratory
practice and frequent calibration were used to guarantee
instrumental stability. To quantify biological variability
of blood serum samples from a baseline group of healthy
persons, several experimental conditions were varied
systematically: 1) time-of-day when sampling, 2) storage
temperature, 3) freeze-thaw cycles, 4) repeating
measurements on different days.
Pre-processing of the samples involved binning, noise
filtering, baseline correction and normalization. Initial
analysis involved “warping” to quantify drift in the mass
scale and principal components analysis to quantify
repeatability. In later stages ANOVA was applied to quantify
components of variability. I will report on methods,
implementation and results.
This is joint work with Mirre de Noo, André Deelder and
Rob Tollenaar (all LUMC).
« Back...
Computational tools for
standardized analysis of MS/MS data
Andrew Keller, Institute for Systems Biology, USA
High throughput proteomics studies seek to infer peptide
and protein identifications based upon the analysis of
thousands of collected MS/MS spectra. Despite the growing
popularity of such methods, there is as yet no accepted
standardized way of analyzing data and deriving conclusions,
thus making it difficult for researchers to compare and
share data with one another. This is confounded by the
variety of different mass spectrometer types used to
generate MS/MS spectra, and search engines used to assign
peptides to such spectra.
Our group has sought to create a free open source
analysis pipeline for the identification and quantification
of peptides and proteins based upon MS/MS spectra, with the
goal of facilitating interpretation and comparison of
results. Two components of the pipeline are of particular
importance for a standardized analysis: PeptideProphet
validates search results by computing accurate probabilities
that each result in a dataset is correct based upon search
scores and peptide properties; ProteinProphet groups
together peptides according to their corresponding protein(s)
in the database and combines evidence together to compute
accurate probabilities that each protein is present in the
original sample.
In order to enable uniform analysis of data generated by
various mass spectrometers, each with its own proprietary
raw data format, and assigned peptides using various search
engines such as SEQUEST and Mascot, we have designed three
standard XML data formats upon which our pipeline is based.
The first, mzXML, is for raw mass spectral data to which
output from any type of instrument can be converted. The
second, pepXML, is for search result data to which output
from any search engine can be transformed. The third,
protXML, is for protein identifications based upon peptides
assigned to MS/MS spectra. Once data is converted to these
standard formats, subsequent analyses, such as quantitation
at the peptide and protein levels, can be performed
consistently. Only analyses dealing directly with search
result scores, such as validation by PeptideProphet, must be
adapted for each search engine.
A standard analysis pipeline facilitates exchange and
interpretation of MS/MS data. XML file formats make it easy
to share data with others for viewing and analysis.
Probabilities computed by PeptideProphet and ProteinProphet
are accurate measures of the confidence of identifications,
and thus enable the sensitivity and false positive error
rate of datasets to be predicted. These can serve as
objective criteria by which results from different
researchers are compared.
This work was supported in whole or in part with federal
funds from the National Heart, Lung, and Blood Institute,
NIH, under contract HV-28179.
« Back...
Two-dimensional probability model
for peptide matching using tandem mass spectra and protein
databases
Rovshan Sadygov, Thermo Electron Corporation
Tandem mass spectrometry followed by database search is a
powerful tool of proteomics. Proteins of complex mixtures
are identified from tandem mass spectra of their peptides
and amino acid databases. The peptide identification is
often the first step in such studies as protein-protein
interactions, protein localization and protein
quantification and relative expression. Therefore, the
accuracy of the spectrum assignation is very important. The
focus of this presentation is on a two-dimensional
probability model for peptide identification. At first,
separate probability models are developed for two of the
parameters that affect the quality of peptide identification
the most – number of shared peaks count and the sum of the
product ion abundances. The model for the shared peaks count
is derived from the observations of the product ion matches
to fragment ions from protein databases. The intensity based
model uses sum statistics to derive probability of protein
assignation based on the product ion abundances. The
probabilities are translated into canonical coordinates to
derive a single significance value of a peptide match. The
talk will present the comparison of the approach to other
database search algorithms.
« Back...
Proteomics, why, how and when
Peter Roepstorff, University of Southern Denmark
The advances in DNA-sequencing and rapidly increasing
amount of genome sequence data becoming available have
changed the scope of protein analysis. Databases now provide
the sequence of more than 500,000 proteins, most of which
are based on genome sequencing, and this number is rapidly
increasing. However, the information content in genome
sequences is not sufficient to understand the living
organism because most proteins are processed or otherwise
modified after translation. Therefore, studies are needed on
the protein level and the next level after genomics is
proteomics defined as the analysis of the complete protein
complement expressed by a genome or by a cell or tissue type
(Wilkins M.R. et al. (1996) Bio/Technology, 14,
61-65).
Mass spectrometry (MS) is one of the most sensitive
analytical techniques, which can generate structural
information on proteins and the type of information
generated by mass spectrometric analysis is ideal for
queering sequence databases. Therefore mass spectrometry has
become a key analytical tool in proteomics. Two strategies
dominate in proteomics, one based on separation or the
proteins by 2D-PAGE prior to protein identification by mass
spectrometry, the other based on proteolytic digestion of
all the proteins in a sample followed by separation and
sequencing of the resulting peptides by multidimensional
LC-MS. A number of intermediate strategies are also used.
The different strategies will be described and their
strengths and limitations evaluated for the use on different
levels in proteomics, which include protein identification
and assignment of post translational modifications. A number
of recently developed concepts that allow modification
specific proteome analysis will also be described. Finally
examples of disease related studies, which have taken
advantage of proteomics will be mentioned.
« Back...
Applying probability based protein
identification to large data sets
David Creasy, Matrix Science, UK
In probability based scoring, we compute the probability
that the observed match between the experimental data and
mass values calculated from a candidate protein or peptide
sequence is a random event. The "correct" match, which is
not a random event, has a very low probability. The
strengths and weaknesses of this technique will be discussed
and comparisons with Blast searches will be made. Generic
methods for testing algorithms and will be discussed and
compared.
In particular, we have investigated the use of this
approach with large data sets consisting of tens of
thousands of spectra. The number of false positive peptide
matches will be shown to be within the expected values.
Techniques for interpreting and comparing the results of
these searches will be discussed, along with potential
pitfalls.
« Back...
Algorithms and score functions used
in PEAKS de novo sequencing software
Bin Ma, University of Western Ontario
De novo sequencing from MS/MS data is the best way for
the identification of the peptides of novel proteins.
Because of the importance of the de novo sequencing, many
software programs, free or commercial, are available.
Recently, the PEAKS software has drawn much attention. It
uses novel algorithms to compute, and uses sophisicated
scoring functions to evaluate the peptide candidates. The
software also has the capability to deal with variable
posttranslational modifications; and has some nice features
such as positional confidence scores for individual amino
acids of the computed peptides. In this talk, the basic
design of the algorithms and score functions of PEAKS will
be introduced. Some recent development of the software will
also be discussed.
« Back...
Mining motifs from protein
interaction data
See-Kiong Ng, Institute for Infocomm Research, Singapore
Discovering short conserved amino acid sequence
patterns---or motifs---associated with protein functions or
interactions is useful for guiding biological studies for
the discovery of new drugs. However, finding biologically
significant protein motifs remains a challenging task.
Current methods typically require the manual grouping of the
protein sequences for pre-processing---the quality of motifs
discovered depended greatly on the clustering adequacy of
the protein sequences provided. With the advent of high
throughput protein interaction detection methods,
genome-wide protein interactions are now available for
analysis. In this work, we demonstrate how the inherent
functional associations between interacting proteins can be
exploited for clustering protein sequences to automatically
discover novel biologically significant motifs.
« Back...
Binding motif pairs from
interacting protein groups
Limsoon Wong, Institute for Infocomm Research & National
University of Singapore
Protein--protein interaction is intrinsic to most
functional processes in the cell, and the binding sites are
essential to the understanding of protein--protein
interactions. A binding site is modeled as a binding motif
pair in our research to emphasize the correlation between a
pair of binding motifs. Inspired by the fact that a protein
can interact with many proteins, we propose a concept of
interacting protein groups to discover binding motif pairs.
A pair of interacting protein groups is such two protein
groups that every protein from one group interacts with all
proteins in the other protein group, indicating a kind of
``full interaction'' between the two protein-sets. As an
interacting protein group may share a common binding motif,
we can get binding motif pairs by examining pairs of
interacting protein groups. The identification of pairs of
interacting protein groups is a challenging problem given a
large collection of protein interactions. By a careful and
sophisticated problem transformation, the problem is
efficiently solved by using algorithms for mining frequent
patterns, a problem extensively studied in data mining. The
motif (or motifs) of each interacting protein group is then
derived by applying traditional motif discovery algorithms
on the sequence data of the protein group. We found 16372
binding motif pairs from a yeast protein interaction
dataset, represented in the form of blocks. Comparing the
motifs in the pairs with the BLOCKS and PRINTS databases, we
found that each block could be mapped to an average of 4.0
correlated blocks in these two databases. The mapped blocks
occur in 2472 out of total 6794 protein groups in these two
databases. Comparing the 16372 motif pairs with a putative
domain--domain interaction database (Interdom), we found
1508 matches, of which 320 pairs can be mapped to
high-confidence domain--domain interactions and 194 within
the 320 pairs can be mapped to interactions confirmed by
complex data.
« Back...
Atomistic computer simulations: an
essential toolkit in modern proteomics
Chandra Verma, Bioinformatics Institute, Singapore
Computer simulations of biomolecules and in particular
proteins have made tremendous advances in contributing
towards understanding biology at the molecular level. They
now regularly complement experiments and often provide
unique details of biomolecular processes at an unprecedented
level. Indeed they have often served to shed new light on
existing paradigms. Examples will be given of the use of
some of these methods to construct comprehensive models that
relate biomolecular dynamics to the important processes
which underpin biological function.
« Back...
Serum biomarker discovery of liver
diseases using SELDI-TOF and bioinformatics
Eastwood Leung, Genome Institute of Singapore
Reliable serum biomarker for early detection of cirrhosis
and hepatocellular carcinoma (HCC) is still absent. We used
surface-enhanced laser desorption and ionization (SELDI)
technology together with machine learning algorithms to
establish a pipeline for discovery of novel low molecular
weight serum biomarkers. In a rat cirrhosis model, serum
proteins were allowed to bind onto weak cation exchanger
surface. Protein profiles were generated after washing off
non-specific binding proteins from the surface followed by
laser desorption and ionization process. Support vector
machine algorithm was used to analyze protein profiles
generated. A panel of selected signature markers resulted in
higher than 90 % specificity and sensitivity in
classification of test samples. The significant marker was
purified and identified using on-chip digestion and
sequencing on a MALDI-TOF/TOF platform. Copper II (Cu2+)
ion surface was used in serum biomarker discovery of human
HCC. A panel of selected protein peaks was differentially
expressed in HCC specifically but not in normal, cirrhosis,
colon carcinoma, and nasopharyngeal carcinoma. Again, the
specificity and sensitivity of class prediction of test
samples were higher than 89%. Several selected protein peaks
were purified and identified by using off-line column
chromatography, SDS-PAGE, and peptide sequencing using
tandem mass spectrometry. These markers were further
validated by Western blotting analysis. The preprocessing
procedures of data and comparison of performance of
different machine learning algorithms will also be
discussed.
« Back...
Predicting functional family of
novel enzymes irrespective of sequence similarity: a
statistical learning approach
Yu Zhong Chen, National University of Singapore
For proteins having no sequence homolog of known
function, their function is difficult to assign on the basis
of sequence similarity. The same problem arises for
homologous proteins of different functions. It is desirable
to explore methods that are not based on sequence
similarity. One approach is to assign functional family. A
statistical learning method, support vector machines (SVM),
has been used by several groups for predicting protein
functional family irrespective of sequence similarity. These
studies showed that SVM prediction accuracy is at a useful
level, particularly for distantly related proteins and
homologous proteins of different functions. Here SVM is
tested for functional family assignment of two groups of
enzymes. One consists of 41 enzymes without a homolog of
known function from PSI-BLAST search of protein databases.
The other contains 20 pairs of homologous enzymes of
different families. SVM correctly assigns 78% of the enzymes
in group 1 and 70% of the enzyme pairs in group 2,
suggesting that it is potentially useful for facilitating
functional study of novel proteins.
« Back...
Systematic proteome analysis of
breast cancer cell lines
Keli Ou, Agenica Research Pte Ltd, Singapore
Breast cancer is the commonest cancer in Asian women. In
Singapore, about three women are diagnosed with the breast
cancer each day and this number is increasing significantly
at an average of 3% annually. In this study, we aim to
identify the potential breast tumor biomarkers by comparing
the protein expression profiles among different breast cell
lines. A large-scale and high-throughput proteomic platform
was employed for the project, which included two-dimensional
gel electrophoresis and MALDI-TOF MS for protein
characterization. Proteomics analysis of the 3 cell lines
indicated significant differences between the normal (CRL)
and breast cancer cell lines (MCF-7, HCC 38). An integrated
proteomic and genomic approach showed that the majority
proteins’ expression levels were consistent with the
corresponding gene expression levels derived from RNA
microarray. However there also existed some inconsistent
results. The challenges of bioinformatics in the study will
be discussed.
« Back...
Proteomic investigation of
colorectal cancer
Qingsong Lin, National University of Singapore
Colorectal cancer (CRC) is the second leading killer
cancer worldwide and has become the most common cancer in
Singapore. CRC is among the best characterized cancers with
regards to genetic progression. Changes in gene expression
profiles have also been widely investigated at the mRNA
level. However, changes at the protein level are less well
studied. Latest development of proteomic technologies allows
us to examine the global expression profile of proteins in
action, and has been widely applied to the studies of
disease processes. The present study aims to detect changes
of protein profiles that could be associated with the
process of colon tumorigenesis, in order to discover
biomarkers for diagnosis, and potential therapeutic targets.
Two-Dimensional Gel Electrophoresis (2-D GE) and
Isotope-coded Affinity Tag (ICAT) were applied to detect
protein profile differences between cancerous and adjacent
normal tissues. Issues regarding proteomic data analysis,
protein quantitation and bioinformatics data mining will be
discussed.
« Back...
BIND: the Biomolecular Interaction
Network Database
Susan Moore, Blueprint Asia, Singapore
Recent estimates suggest that there are hundreds of
thousands of published, experimentally demonstrated
biomolecular interactions. Rapid retrieval and meaningful
large-scale analysis of this data depends on cataloguing
these interactions in a computer-readable format. To this
end, the Blueprint Initiative has embarked on a project to
create and maintain BIND (the Biomolecular Interaction
Network Database; www.bind.ca), a freely available resource.
A long-term goal for Blueprint is to use BIND and associated
tools to permit simulation of living cells. This talk will
focus on the current status of the BIND project and discuss
its possible use in facilitating research.
« Back...
DTSeq: decision tree based De Novo
peptide sequencing
Wing-Kin Sung, National University of Singapore
De Novo peptide sequencing is an important and
well-studied problem which require further improvement. One
major hurdle on further improvement is on how to model the
intensities of the peaks based on the chemical and the
physical properties of the protein peptide. In this talk, we
model the intensity with the help of the probabilistic
decision tree. Together with a PEAK-like dynamic programming
algorithm, a new algorithm DTSeq is proposed to perform De
Novo peptide sequencing. Experimental results show that
DTSeq has better accuracy when compared with some best-known
de novo peptide sequencing software.
« Back...
Proteome analysis of separated male
and female gametocytes reveals novel sex specific Plasmodium
biology
Shahid Khan, Leiden University Medical
Centre, Netherlands
Gametocytes, the precursor cells of malaria parasite
gametes, circulate in the blood and are responsible for
transmission from host to mosquito vector. The individual
proteomes of male and female gametocytes were analyzed using
mass spectrometry, following separation by flow sorting of
transgenic parasites expressing Green Fluorescent Protein,
in a sex-specific manner. Promoter tagging in transgenic
parasites confirmed the designation of gametocyte
specificity of the proteins. The male proteome contained 36%
(236 of 650) male specific and the female proteome 19% (101
of 541) female specific proteins but they share only 69
proteins emphasizing the diverged features of the sexes. Of
all the malaria life cycle stages analyzed, the male
gametocyte has the most distinct proteome containing many
proteins involved in flagellar-based motility and rapid
genome replication. By identification of gender specific
protein kinases and phosphatases and using targeted gene
disruption of two kinases new sex-specific regulatory
pathways were defined.
« Back...
SPLASH : Systematic Proteomics
Laboratory Analysis and Storage Hub
Siaw Ling Lo, National University of Singapore
Proteomics is a rapidly expanding field generating
tremendously large amount of data annually. The increasing
difficulty to unify the data format, due to the use of
different platforms/equipments and laboratory documentations
systems, greatly hinders experimental data verification,
exchange and comparison. To address this issue, it is
essential to establish standard formats for every aspect of
proteomics. One of the recently published data model is
Proteomics Experiment Data Repository (PEDRo) [1]. Based on
this model with some customizations, SPLASH database system
has been developed to provide proteomics researchers a
common platform to store, manage, search and analyze their
data. Here we report the implementation of SPLASH, covering
all the three modules, including data maintenance, data
search and data mining. Data maintenance consists of
experimental data entry and update, and uploading of
experiment results in batch mode (such as gel image
annotation and mass spectrometry results). Data search
module provides a means to search the database and allow
viewing of protein details or differential expression
display by clicking on a 2D GE image. The data mining module
offers tools to aid researchers to make biological sense of
the high throughput data, including Gene Ontology (GO)
analyses, KEGG biochemical pathway analyses, and statistical
analyses for sample sets. These features make SPLASH a
practical and highly powerful tool for the proteomics
research community.
[1] Chris F Taylor, Norman W Paton, Kevin L Garwood, Paul
D Kirby, David A Stead, Zhikang Yin, Eric W Deutsch, Laura
Selway, Janet Walker, Isabel Riba–Garcia, Shabaz Mohammed,
Michael J Deery, Julie A Howard, Tom Dunkley, Ruedi
Aebersold, Douglas B Kell, Kathryn S Lilley, Peter
Roepstorff, John R Yates III, Andy Brass, Alistair J P
Brown, Phil Cash, Simon J Gaskell, Simon J Hubbard, and
Stephen G Oliver (2003). A systematic approach to modelling
capturing and disseminating proteomics experimental data.
Nature Biotechnology, March 2003, 247-254
« Back...
|