STATISTICAL METHODS IN MICROARRAY ANALYSIS
(2 - 31 January 2004)
Abstracts
Considerations on sample
classification and gene selection with microarray data using
machine learning approaches
Xuegong Zhang, MOE Key Lab of Bioinformatics/ Department of
Automation, Tsinghua University, China
With the advance of microarray techniques, high expection
has been given to better sample classification (such as the
classification of disease vs. normal or of subtypes of a
cancer) at molecular levels with microarray data. The feature
of high-dimensionality (typically thousands of genes) and
small sample sizes (typically tens or hundreds of cases) makes
this task very challenging. The complexity of the diseases,
the poor understanding of the underlying biology and the
imperfectness of the data makes the problem even harder. With
examples on the classification of lymph-node metastasis status
and ER status of breast cancers, and on the famous leukemia
data sets, this talk will introduce a SVM-based strategy for
sample classification and gene selection (named R-SVM) and the
observations in the experiments. However, the emphasis of the
talk will be some general opinion on the task of sample
classification and gene selection, some possible pitfalls
therein, and some consideration on the general strategy.
«
Back
Normalization for cDNA microarray
experiments having many differentially expressed genes
I-Shou Chang, National Health Research Institute, Taiwan
This talk discusses two normalization methods for cDNA
microarray data in which a substantial proportion of genes
differ in expression between the two mRNA samples, or there is
no symmetry in the expression levels of up/down-regulated
genes. The first method concerns the situation that there are
no control DNA sequences on the slide. The first step of this
approach is to perform global normalization based on dye-swap
experiments, and then use a statistical criterion to select a
set of (almost) constantly expressed genes. Based on this set,
intensity dependent normalization is carried out using local
regression method. The usefulness of this method is clearly
demonstrated in simulation studies and in the analysis of real
data sets. In particular, it is shown in the simulation
studies that this method identifies genes with a lower false
positive rate and a lower false negative rate than a commonly
used method, when a large number of genes are turned up or
down. The second method concerns the situation that there are
control sequences on the slide. Calibration curves relating
fluorescence signal intensities to gene expressional levels
are considered in the context of Bayesian isotonic regression,
which makes use of smooth priors on Bernstein polynomials and
Markov Chain Monte Carlo methods to study the isotonic
regression problem. The second method is applied to identify
early onset genes in the study of transcriptional profiling of
Autographa Californica multiple polyhedrosis virus.
«
Back
A new web-based mouse phenotype
analysis system (MPHASYS) to integrate molecular and
pathophysiological end points of aging
Jan Vijg, University of Texas Health Science Center at San
Antonio, USA
Progress in the science of aging is largely driven by the
use of model systems, ranging from yeast and nematodes to
mice. Especially mouse models are highly suitable to study the
complexities of aging in view of their short evolutionary
distance to humans, their equal genome size and the recently
emerged opportunities of altering genetic pathways considered
to be critically important in determining aging phenotypes and
life span. The study of such mutants, however, has been
hampered by the lack of objective standards embedded in new
information science and data management technologies for the
comparative analysis of pathophysiological characteristics
over their life span. The severity of this problem,
exacerbated by the rapid increase in the number of mouse
mutants, is increased by orders of magnitude by the emergence
of tools for global molecular characterization, such as RNA
and protein profiling using microarrays. Hence, new databases
and integrated tool sets are needed to bring together
biological information obtained at the molecular, cellular,
organ and system level of the various mouse models, in order
to understand the functional interactions that are at the
basis of a genetic alteration. Here we describe the creation
and use of MPHASYS, a new, web-based mouse phenotype analysis
system. MPHASYS includes: (1) a pathology ontology, describing
clinical observations, gross pathology, anatomy and
histopathology; (2) an objective pathology data entry system;
and (3) transparent query and data analysis systems that can
interact with currently available standards for molecular
database systems, such as microarrays. This “federated mouse
database” provides a solid basis for downstream data analysis
of the occurrence of adverse biological effects in cohorts of
aging mice. As an example of the use of MPHASYS, a comparative
analysis is presented of pathophysiological and gene
expressional endpoints, related to aging, in cohorts of mutant
mice with defects in genome maintenance systems.
«
Back
Spot shape modelling and saturated
pixels in microarrays
Mats Rudemo, Chalmers University of Technology, Sweden
To be able to study lowly expressed genes in microarray
experiments it is useful to increase the gain in the scanning.
However, a large gain may cause some pixels for highly
expressed genes to become saturated.
Techniques for adjustment of highly expressed signal
intensities are given by Wit and McClure (2003) based on a
small set of available spot summaries such as spot mean, spot
median and spot variance. As mentioned by Wit and McClure it
should be possible to get more accurate adjustments when all
pixel values are available. In the present project we study
spatial statistical models for pixel values which should
enable such adjustments.
A convenient type of modelling is to transform data to
become approximately Gaussian distributed with a mean value
function determined by gene intensities and spot shapes and a
corresponding covariance function. For such models censored
pixel values can be optimally estimated. We study different
types of transformations, spot shapes and covariance
functions. The transformations include logarithmic and power
transforms with an offset and the inverse hyperbolic sine
transform of Huber et al. (2002) and Durbin et al. (2002). The
spot shapes include three types suggested in Wierling et al.
(2002): (i) an isotropic 2D Gaussian distribution, (ii) a
crater spot distribution consisting of a difference between
two scaled isotropic 2D Gaussian distributions and (iii) a
plateau spot distribution. An additional model with a
polynomial-hyperbolic spot shape is introduced which gives a
considerably improved performance for the dataset studied.
The models are applied to the analysis of a dataset
obtained with a specially designed 50mer oligonucleotide
microarray. Here 452 selected genes in transgenic Arabidopsis
plants are compared to the corresponding genes in wild-type
plants. Data include scans with different gains ranging from
no saturation to heavy saturation. This is joint work with
Claus Ekstrom, Charlotte Kristensen and Soren Bak, Copenhagen.
«
Back
Error modeling, data transformation
and robust calibration for microarray data
Anja von Heydebreck, Max Planck Institute for Molecular
Genetics, Germany
Microarray gene expression measurements are affected by a
number of variable experimental conditions, e.g. in sample
preparation, labelling, and hybridisation, which lead to
systematic and stochastic variation in the data. Normalization
tries to correct for the systematic experimental variation,
whereas error models are used to represent the remaining
stochastic variation. In replicate microarray data, one often
observes a systematic intensity-dependence of the variances of
log-transformed intensities. Thus the significance of a
measured fold change depends on the intensity level at which
it was observed.
We show how a variance-stabilizing transformation can be
derived from a simple error model for microarray data. For
large intensities, the transformation coincides with the usual
logarithmic transformation, such that expression differences
can still be interpreted in terms of fold changes. For small
intensities, the transformation diminishes the fluctuation of
the intensities that is usually visible in log-transformed
data.
Using a parametric statistical model, we simultaneously
estimate the parameters of the transformations for variance
stabilization and normalization. A robust estimation technique
is used in order to avoid a bias due to differentially
expressed genes. In applications to benchmark datasets, this
approach compares favorably to other normalization algorithms.
«
Back
A comparison of microarray platforms
Darlene Goldstein, Swiss Institute for Experimental Cancer
Research, Switzerland
Several different platforms are available for quantifying
gene expression. My talk will introduce the technologies and
present results of a study comparing some of these, including
commercial arrays from Affymetrix and Agilent, in-house
spotted cDNA arrays, and MPSS methodology. Advantages and
disadvantages of the platforms will be discussed, as well as
assessments of reproducibility within and agreement between
technologies. Comparisons with results of quantitative PCR are
currently in progress and will also be reported if available.
«
Back
GPmerge – a computing program for
cDNA microarray raw data processing
Jinming Li, Nanyang Technological University, Singapore
Microarray experiments generate millions of data points.
But these data are useful only when biologically meaningful
information can be extracted. This involves many facets of
data processing, statistical analysis, and data visualization,
etc.
Novel computation tools and reliable data processing
procedures are essential for the meaningful and accurate
interpretation of microarray data. However, the current
computing tools are inefficient, and sometime even unreliable.
So the development of novel algorithms and approaches for
microarray data processing is a challenge for
bioinformaticians. We will introduce a computer program
GPmerge developed by us, which can be used to process the
microarray raw data generated by image analysis software such
as Axon GenePix Pro or Quantarray.
It is widely accepted that any single microarray output is
subjected to substantial variability. By pooling data from
replicates, we can provide a more reliable classification of
gene expression. Designing experiments with replications will
greatly reduce misclassification rates. The development of
GPmerge reflects our efforts toward to the pooling of
replicated data sets generated by image analysis software, so
that a user can use the overall information provided by these
replicated microarray slides.
«
Back
Building genetic networks in
gene expression patterns
Eric Fung Siu Leung, University of Hong Kong
Building genetic regulatory networks from time series data
of gene expression patterns is an important and useful topic
in bioinformatics. Probabilistic Boolean Networks (PBNs) were
recently developed as a model of gene regulatory networks.
PBNs is able to cope with uncertainty, corporates rule-based
dependencies between genes and discover the relative
sensitivity of genes in their interactions with other genes.
However, in lacks of prior knowledge on the nature of
predictors, i.e., the existence of the set of predictors. PBNs
are unlikely used in practice because of huge number of
possible predictors with its corresponding probability. In
this paper we propose a multivariate Markov chain model to
model the dynamics of a genetic network for gene expression
patterns. One of the contributions here is to preserve the
strength of PBNs and reduce the complexity of the networks. We
propose a multivariate Markov chain model whose number of
states and parameters are linear with respect to the number of
genes of the model. We also develop an efficient estimation
method for the model parameters. Numerical examples with
applications to yeast data are given to illustrate the
effectiveness of the model.
«
Back
Bayesian hierarchical
modelling of gene expression data
Sylvia Richardson Imperial College, UK
We show how Bayesian hierarchical modelling strategies can
be usefully applied to gene expression data for signal
extraction and differential expression and carry out joint
estimation of model parameters in a full Bayesian framework,
using MCMC techniques.
For signal extraction from Affymetrix GeneChip data at the
gene-probe level, the proposed models use both perfect match
(PM) and mismatch (MM) intensities. They include background
and cross-hybridization terms and allow for part of the MM
intensity being the true signal. At the gene level, we pool
information across probe sets and over repeated measurements,
to obtain gene expression measurements under given conditions.
For investigating differential gene expression, we propose
a flexible method for choosing a list of genes for further
investigation based on a model of the sources of variability
of the experimental set-up. We give empirical evidence that
expression-level dependent array effects are frequently
needed, and explore different non-linear functions as part of
a model-based approach to normalisation. The model includes
gene-specific variances but imposes some necessary shrinkage
through a hierarchical structure. Model criticism via
posterior predictive checks is discussed. To choose a list of
genes, we propose to combine various criteria (for instance,
fold change and overall expression) into a single indicator
variable for each gene. The posterior distribution of these
variables is then used to pick the list of genes, thereby
taking into account uncertainty in parameter estimates.
Lewin A, Richardson S, Marshall. C, Glazier, A and Aitman
T. (2003) Bayesian Modelling of Differential Gene Expression,
submitted for publication, available at
http://155.198.41.240/projects/bgx/BGX-papers.html
«
Back
A practical projected clustering
algorithm for gene expression profiles
Kevin Yip, The University of Hong Kong
In gene expression data, clusters can be found in subspaces
in which a set of related genes have similar expression
patterns in a set of samples. Traditional clustering
algorithms may fail to identify such clusters as the
expression patterns of the cluster members are not similar in
the full input space. A number of algorithms have been
proposed to identify clusters in subspaces, but most of them
require the input of some parameter values that are hard for
users to determine. In this talk, I will introduce a new
projected clustering algorithm that dynamically adjusts its
internal thresholds in response to the clustering status. This
allows the algorithm to avoid using any hard-to-determine
parameters, which simplifies the analysis of complex gene
expression data. I will present the experimental results on
some synthetic and real datasets to show that the algorithm is
able to identify projected clusters that make both statistical
and biological sense.
«
Back
Hidden Markov modelling of genomic
interactions
Ernst Wit, University of Glasgow, UK
Microarray technology has made the simultaneous measurement
of gene transcription a routine activity. Whereas gene
transcription is only one stage in the complex genomic process
of living organisms, it gives a fascinating insight in one
aspect of this activity across the whole genome. Gene
regulation is a complex biological process, which involves
gene-gene and gene-protein interactions. An operator region,
to which the enzyme polymerase can bind to start
transcription, precedes the gene sequence. Such local features
regulating transcription, pose the question whether there
might be local spatial gene interactions.
We define a Hidden Markov Model (HMM) to relate the
observed expression levels to hidden states "Up", "Down" and
"Same" for a time-series gene expression dataset. A Potts
Model is identified to describe the interactions between
neighboring states. A typical problem in these types of model
is the estimation of the hidden parameters because of the
intractability of the normalizing constant. Recent work by
Pettitt et al (2002) provides a clue to avoid to use
pseudolikehood and to solve this issue for a wide class of
HMMs.
This is joint work with N. Friel (University of Glasgow).
«
Back
Hierarchical bayesian modelling
of multiple arrays experiments
Annibale Biggeri, University of Florence, Italy
We propose a Hierarchical Bayesian model for the
simultaneous analysis of repli- cated gene expression profiles
in standard reference design. Gene expression is modelled as
the sum, on an appropriate scale, of fixed terms representing
dif- ferent sources of variation (e.g. pin effect or dye
effect for normalization and one or more parameters to
describe the effect of an experimental factor). Treat- ment
effects are represented by a set of parameters whose prior
distribution is a mixture of three components. This
corresponds to the definition of a dis- crete latent variable
with three possible states (labelling a given gene being
under-expressed, over-expressed or not-differentially
expressed with respect to the reference sample). The Bayesian
approach uses all the information collected to make inference
and it allows to estimate the posterior probability of each
sin- gle gene being differentially expressed. All the sources
of variation are modelled in a common and consistent
framework, avoiding the need of multiple distinct steps of
analysis (for example, normalization and testing).
We applied the model to some cDNA microarray experiments
designed to evaluate signal variation related to temperature
of hybridization on Saccha- romyces cerevisiae, diet-effects
on rat carcinogenesis experiments and human gene expression
profiles from patients affected by dysplastic disease.
«
Back
More on the analysis of time-course
microarray data
Terry Speed, UC Berkeley, USA and WEHI, Australia
Time course microarray data sometimes involves
autocorrelation, that is, there are biological reasons for
expecting expression measurements at one time to be correlated
with expression measurements at nearby times. Sometimes time
course microarray data are replicated. And frequently, time
course cDNA microarray data are collected with one channel
being a common reference mRNA source.
In this talk I discuss replicated time course cDNA
microarray data with autocorrelation, measured relative to a
common reference. The question I focus on is this: given such
replicated time course data, one series for each spot on the
slide, how can we best determine which genes have constant and
which genes have varying (relative) expression levels across
the times?
One common approach in longitudinal data problems like this
is to carry out F-tests, treating the times as levels of a
one-way classification. This approach ignores any
autocorrelation which may be be present, and simply compares
between to within time variation. An alternative approach is
to treat this as a multivariate problem, seeking a likelihood
ratio test of the null hypothesis that a (vector) mean is
constant, against the alternative that it is not constant. A
difficulty here is that we typically have more times than we
do replicates, so estimating a covariance matrix for the
replicate series is problematic, but a variety of solutions to
this difficulty suggest themselves. A third strategy is to
extend a familiar empirical Bayes approach to the scalar
version of the same problem.
In the talk, which represents work in progress carried out
jointly with Yu Chuan Tai of the Program in Biostatistics at
UC Berkeley, I'll explain these approaches and compare them on
a data set with the characteristics described.
«
Back
Directed indices for exploring
gene expression data
Charles Kooperberg, Fred Hutchinson Cancer Research Center
Expression studies with clinical outcome data are becoming
available for analysis. An important goal is to identify genes
or clusters of genes where expression is related to patient
outcome. While clustering methods are useful data exploration
tools, they do not directly allow one to relate the expression
data to clinical outcome. Alternatively, methods which rank
genes based on their univariate significance do not
incorporate gene function or relationships to genes that have
been previously identified. We consider a gene index technique
that generalizes methods that rank genes by their univariate
associations to patient outcome. Genes are ordered based on
simultaneously linking their expression both to patient
outcome and to a specific gene of interest. The technique can
also be used to suggest profiles or means of bundles of gene
expression related to patient outcome. The methods are
illustrated on a gene expression data set based on patients
with Diffuse Large Cell Lymphoma.
This is joint work with Michael LeBlanc.
«
Back
The analysis of proteomics
spectra from serum samples
Keith Baggerly, M.D. Anderson Cancer Center
Just as microarrays allow us to measure the relative RNA
expression levels of thousands of genes at once, mass
spectrometry profiles can provide quick summaries of the
expression levels of hundreds of proteins. Using spectra
derived from easily available biological samples such as serum
or urine, we hope to identify proteins linked with a
difference of interest such as the presence or absence of
cancer. In this talk, we will briefly introduce two of the
more common mass sprectrometry techniques, matrix-assisted
laser desorption and ionization/time of flight (MALDI-TOF) and
surface-enhanced laser desorption and ionization/time of
flight (SELDI-TOF). We then describe two case studies, one
using each of the above techniques. While we do uncover some
structure of interest, aspects of the data clearly illustrate
the need for careful experimental design, data cleaning, and
data preprocessing to ensure that the structure found is due
to biology. Time permitting, we will then discuss further
examples using data collected at MD Anderson, in some cases
illustrating that these lessons have been learned.
«
Back
Analyzing data from a splice
array experiment
Jean Yee Hwa Yang, University of California, San Francisco
Splice-specific microarrays provide a basis to investigate
the effect of mutations and other factors on splicing events
in the creation of mature mRNA. This talk will illustrate
various statistical designs and analysis issues from a study
aimed at detecting differential gene expression between
selected spliceosome mutants. The data features an unbalanced,
nested design with minimal degree of replication. I will begin
the talk with a brief overview of the splice array technology
and discuss potential methods for synthesizing results from
various approaches. The design of these arrays also provide a
platform for comparing the performance of different
normalization methods.
«
Back
Unsupervised determination of
gene significance in time-course microarray data
Radha Krishna Murthy, Karuturi, Genome Institute of
Singapore
Motivation: The abundance of a significant portion of the
temporal induction-repression expression pattern of a gene
among other genes in a time-course data is an indication of
its non-randomness. The significance of the portion that
matches between two gene profiles can be derived using
binomial analysis and/or its variant. Considering the
induction-repression pattern alone is both meaningful and
significant since the related genes induced/repressed in a
given period may not show the same exact shape of
induction/repression. Further, microarray measurements are of
low quality which might make expression patterns of related
genes less similar. Based on this observation we developed an
approach called friendly neighbors (FNs). In this approach,
the significance score of a gene is number of genes in the
same experiment that share its induction-repression pattern
more than a certain threshold.
Results: The FNs approach has been applied to discover
putative estrogen target genes, to detect cell cycle regulated
genes in S. cerevisiae, and to elicit the modes of expression
of immune genes in SARS infected samples. The new approach
performed better than paired t-test and simple expression
level based filtering methods on estrogen target gene
discovery. It did significantly well on cell cycle regulated
gene discovery in the absence of task-specific knowledge.
Using the new approach we discovered trends which might not be
elicited by typical hierarchical clustering in SARS infected
samples data.
«
Back
Practical use of Bayesian mixture
model for comparative microarray analyses in clinical oncology
Philippe Broët, Institut Curie
and INSERM, France
Recent developments in transcriptome-oriented
biotechnologies have made possible the comparative analysis of
thousands of mRNA expression in parallel. Typically, these
data consist of the measurement of gene expression under
various experimental or biological conditions that can
potentially provide information on the complex transcriptional
activity for the biological system under study. In parallel to
the rapid development of these technologies, research into
ways of identifying gene expression changes in microarray
experiments taking into account false conclusions has become
an active area. Up to now, statistical procedures have mostly
relied on the multiple testing framework in order to control
false positive conclusions. In this framework, two quantities
have been considered: the familywise error rate (FWER) and the
false discovery rate (FDR). This latter criterion is now
widely used for microarray analyses since it controls a error
quantity which is relevant and leads to more powerful
procedures than those relying on the FWER. In this spirit,
important work has been done for estimating the FDR or the
pFDR in a non-parametric mixture approach. However, a drawback
of these latter procedures is that they only focus on
protecting against false positive conclusions. In the
exploratory and screening context of most microarray data
analysis, investigators may however be seriously concerned
that such methods do not take into account for false negative
discoveries and lead to discarding too large a proportion of
meaningful experimental information. Since in many cases
complex biological pathways are of interest, it is difficult
to envisage exploratory strategies which only protect against
false-positive without controlling for false-negative results.
As a large gene expression variation does not necessarily
translates into a major role in the biological process studied
and vice versa, gene with small variation in expression should
not be discarded by a blind selection process. This is
especially true for genome-wide microarray experiments which
are followed by large-scale rt-PCR or custom microarrays
focusing on specific pathways.
In this presentation, we will consider the problem of
detecting differentially expressed genes in multiclass
response microarray experiments and of providing false
discoveries rate estimates for a define subset of genes that
help the investigator in its gene selection process.
Multiclass response (MCR) experiments correspond to a
situation where there are more than two groups to be compared.
Although this situation is frequently encountered in
biomedical microarray studies, it has received less attention
than the classical two class comparison problem. For this
purpose, we propose a mixture model-based approach on a
modified F-statistic that allows to identify gene expression
change profiles for MCR experiments. This new approach is
based on a fully Bayesian mixture model that extends previous
work on two class comparison in microarray experiments. We
illustrate the performance in estimating false discovery and
non-discovery rates using simulated microarray data sets. The
usefulness of this new approach will be illustrated on real
data investigating breast cancer.
«
Back
Understanding array CGH data
Jane Fridlyand, The Jain Lab, UCSF Cancer Center, USA
The development of solid tumors is associated with
acquisition of complex genetic alterations, indicating that
failures in the mechanisms that maintain the integrity of the
genome contribute to tumor evolution. Thus, one expects that
the particular types of genomic derangement seen in tumors
reflect underlying failures in maintenance of genetic
stability, as well as selection for changes that provide
growth advantage. In order to investigate genomic alterations
we are using microarray-based comparative genomic
hybridization (array CGH). The computational task is to map
and characterize the number and types of copy number
alterations present in the tumors, and so define copy number
phenotypes as well as to associate them with known biological
markers.
We discuss general analytical and visualization approaches
applicable to the array CGH data. We also use unsupervised
Hidden Markov Models approach to utilize the spatial coherence
between nearby clones. The clones are partitioned into the
states which represent underlying copy number of the group of
clones. The method is demonstrated on the primary melanoma
data and on the two cell line datasets with known copy number
alterations for one of them. The biological conclusions drawn
from the analyses are discussed.
«
Back
Biological and practical issues
Mark Reimers, Karolinska Institute, Stockholm
Individual differences: some genes are more variable
between individuals than others. Certain classes of genes are
quite often regulated by large factors in a few individuals in
comparison to the majority.
Scales for Analysis: what are the benefits and drawbacks of
transforming the scale? The variability of most measures
increases with the signal. Hence some sort of concave
transforming function is often used, most commonly the
logarithm. However in some cases this seems to actually hurt
the analysis, as when the treatment down-regulates genes, and
these are more variable.
Experimental consistency: the details of experiment setup
make a huge difference to the results; often these differences
can be detected at an early stage of the analysis.
Spatial effects on chips: although the idea is to have
massively parallel measures, often the hybridization reaction
proceeds differently on different regions of the chip.
Sometimes this can be normalized.
«
Back
Affymetrix low-level analysis
Mark Reimers, Karolinska Institute, Stockholm
The probe sets used by Affymetrix contain between 11 and 20
probes for each gene. Sometimes different probes map to
different splice variants, but the aim has been to have a
probe set that consistently matches one splice variant.
However the values of signal strength across samples differ
greatly for probes in a single probe set, although the
patterns are often similar. This suggests the use of a linear
model to fit the probe affinities, and the gene abundance
estimates simultaneously. Several authors have presented such
a measure. We’ll look at the two leading measures: dChip and
RMA.
«
Back
Practical issues in Affymetrix
analysis
Mark Reimers, Karolinska Institute, Stockholm
The multi-chip methods for Affymetrix analysis seem better
in principle and seem often to do better in practice. However
there are differences between them; some experiments have
shown fairly consistent differences; others show interesting
but unexplained systematic differences. Comparisons with
spotted arrays done to the same test samples suggest some
models are consistently better for certain purposes, but no
model is uniformly best. Most multi-chip models assume that
probes behave consistently. Some data sets suggest
considerable variation in probe performance; this may reflect
differences in non-specific hybridization between tissues.
«
Back
Improvement of DNA microarray data
analysis and better interpretation of microarray results
Henry Yang He, Bioinformatics Institute, Singapore
cDNA/oligo microarrays provide simple and economical ways
to explore gene expression patterns on a genomic scale, and
are used by an increasing number of biologists. In comparison
to conventional methods, microarray technology can be used for
guided gene discovery, meaning that microarray data are used
to select handful genes out of the whole genome. This
selection process involves two stage classifications: 1)
classification of genes as differentially or
non-differentially expressed and 2) classification of genes as
biomarkers or non-marker genes. Although microarray technology
is still at infant stage, a lot of computational methods have
evolved. Thus, the question arises how to choose proper
microarray data analysis methods. We try to ask this question
by developing validation methods for each analysis steps. This
talk will highlight what kind of microarray experiments are
needed to obtain useful and reliable information. The talk
will also give suggestions how to choose analysis algorithms
and software suits for proper microarray data analysis.
«
Back
Integration of gene expression and
protein activity data to estimate structure of a metabolic
pathway
Marek Kimmel, Rice University, USA
NF-kB
transcription factor and its signaling pathway play a major
role in triggering immune response in humans. Its regulation
involves at least two-feedback-loops, which can be modeled by
means of ordinary differential equations: A deterministic
model involves two-compartment kinetics of the activators IkB
kinase (IKK) and NF-kB,
the inhibitors A20 and IkBa,
and their complexes. In resting cells the unphosphorylated IkBk
binds to NF-kB
and sequesters it in an inactive form in the cytoplasm. In
response to extracellular signals such as TNF or IL-1, IKK is
transformed from its neutral form (IKKn) into its active form
(IKKa), a form capable of phosphorylating IkBa
leading to IkBa
degradation. Degradation of IkBa
releases the main activator NF-kB,
which then enters the nucleus and triggers transcription of
the inhibitors and numerous other genes. The newly synthesized
IkBa
leads NF-kB
out of the nucleus and sequesters it in the cytoplasm, while
A20 inhibits IKK by easing its transformation into the
inactive form (IKKi), a form different from IKKn, no longer
capable of phosphorylating IkBa.
After parameter fitting, the proposed model is able to
properly reproduce time behavior of all variables for which
the data now is available, nuclear NF-kB,
cytoplasmic IkBa,
A20 and IkBa
mRNA transcripts, IKK and IKK catalytic activity in both
wild-type and A20-deficient cells. The model allows detailed
analysis of kinetics of the involved proteins and their
complexes and gives the predictions of the possible responses
of whole kinetics to the change in the level of a given
activator or inhibitor. However, the NF-kB
transcription factor acts by attaching to one or two sites of
the promoter region of any gene, and this act is random, and
followed by random detachment. We build a stochastic model,
which allows simulating this process: In each particular cell,
the effect of the extracellular signal leads to non-vanishing
oscillations, which, at the population level, cancel due to
phase shifts. This unexpected effect leads to testable
predictions, which we are trying to verify using single-cell
observations.
«
Back
Spearman’s footrule as a measure of
cDNA microarray reproducibility
Byung Soo Kim, Yonsei University, Korea
Replication is a crucial aspect of to microarray
experiments, due to various sources of errors that persist
even after removing systematic effects. It has been confirmed
that replication in microarray studies is not equivalent to
duplication, and hence it is not a waste of scientific
resources. Replication and reproducibility are the most
important issues for the microarray application in genomics.
However, little attention has been paid to the assessment of
reproducibility among replicates. Here we develop using
Spearman’s footrule a new measure of the reproducibility of
cDNA microarrays, which is based on how consistently a gene’s
relative rank is maintained in two replicates. The
reproducibility measure termed as index.R has a R2-type
operational interpretation. The index.R assesses
reproducibility at the initial stage of the microarray data
analysis even before normalization is done. We first define
three layers of replicates; biological, technical and
hybridizational replicates, which refer to different
biological units, different mRNA’s from a same tissue, and
different cDNA’s from a same mRNA, respectively. As the
replicate layer moves down to a lower level, the experiment
has the fewer sources of errors, and thus, is expected to be
more reproducible. To validate the method we applied the
index.R to two sets of controlled cDNA microarray experiments
each of which had two or three layers of replicates. The
index.R showed an uniform increase as the layer of the
replicates moved into a more homogeneous environment. We also
noted that the index.R had a larger jump size that Pearson’s
correlation or Spearman’s rank correlation for each replicate
layer move, and therefore, it has greater expandability as a
measure in [0,1] than these two other measures.
«
Back
Statistical study of inter-lab and
inter-platform agreement of DNA microarray data
Lei Liu, University of Illinois at Urbana-Champaign
As the gene expression profile data from DNA microarrays
accumulate rapidly, a natural need of comparing data across
different data sets arises. Unlike DNA sequence comparison,
comparison of microarray data can be quite challenging due to
the complexity of the data. Different laboratories may adopt
different technology platforms. How reliable can we compare
data from different labs and different platforms? To address
this question, we conducted a statistical study of inter-lab
and inter-platform agreement of microarray data from a same
type of experiment using Intra-Class Correlation, Kappa
Statistics, and Pearson Correlation. The platforms involved
include Affymetrix GeneChip, custom cDNA arrays, and custom
oligo arrays. We investigated the consistency of replicates,
agreement by pair wise comparison, two-fold change agreement,
and overall agreement. We also discussed effects of data
filtering and the duplication of genes on the arrays.
«
Back
Strategic design and meta-analysis
of expression genomic experiments
Edison Liu, Genome Institute of Singapore
DNA microarrays make possible the rapid and comprehensive
assessment of the transcriptional activity of a cell, and as
such, have proven valuable in assessing the molecular logic of
biological processes and human diseases. With the focus on the
post hoc statistical analysis of data, attention to the design
of the array experiments, to the strategic convergence of
results, and to quality control measures may be limited. Our
premise is that optimal analysis requires an accounting and
control of the many sources of variance within the system, the
structuring of experiments to optimally answer specific
questions, the ability to make sense of the results through
intelligent database interrogation, and then the finality of
data validation. We will describe the sources and impact of
technical and analytical error, offer solutions to circumvent
these problems, and discuss experiment-appropriate design and
validation through experimental and database interrogations.
Specific mention will be made of strategic design whereby
convergence of the results from a series of experiments using
different systems can be used to uncover fundamental
biological truths.
«
Back
Common parameters in parallel
regressions: extracting information from within-array
replicate spots
Gordon Smyth, Walter and Eliza Hall Institute of Medical
Research, Australia
Spotted microarrays are printed robotically from DNA
plates. Very often the robot is programmed to print more than
one spot from each DNA well on each array resulting in
within-array replicate spots for each gene. Within-array
replicate spots are heavily correlated through spatial
proximity on the same array and hence the usual approach is to
average the results of the replicate spots before undertaking
further analysis. This talk shows that substantial information
about gene variability and hence differential expression can
be extracted from the within-array replicates by analysing the
replicates individually using a pooled correlation estimator.
«
Back
Recognizing and dealing with
common problems in proteomic mass spectrometry
Keith Baggerly, M.D. Anderson Cancer Center
What types of problems are commonly encountered in
proteomics? How should we design an experiment in this
context? Why do peaks change shape with mass? How do we define
"a peak"?
In this talk the speaker will present some answers to the
above questions, with illustrations drawn from case studies
encountered at MD Anderson. This talk will focus primarily on
MALDI and SELDI as described in Friday's lecture, but time
permitting the speaker will touch briefly on one or two other
modalities being explored.
«
Back
Smoothing application in
microarray analysis
Paul Eilers, Leiden University Medical Centre, Netherlands
In many areas of microarray analysis smoothing can be
applied fruitfully. I will present the application of
penalized likelihood (P-splines) to:
- Trend correction in MA plots, to improve normalization.
P-splines can be modified to make an extremely fast
smoother.
- Improved presentation of scatterplots. The many dots in
scatterplots can hide the patterns and they make display and
printing slow. Fast smoothing of a two-dimensional histogram
and color-coded display can help.
- Modelling and correction of spatial trends and pin
effects in background and signal estimates. Tensor products
of B-splines and spatial penalties give an effective 2-D
smoother.
- Presentation and analysis of time series data. Smooth
trends improve displays and help to define distance measures
between curves.
«
Back
|