|
|
|
Workshop on Genomics
(14 - 17 Nov 2005)
An algorithm for choosing significant
PCA components on expression microarrays
I-ping Tu, Institute of Statistical Science Academia Sinica,
Taiwan
PCA (Principal Component Analysis) is one of the oldest and
best known statistical tools for multivariate analysis. Even in
earlier ages, PCA has been applied fruitfully by physicists in
solving the motions of rigid bodies in classical mechanics. In
the example of the motion of a top, the first principal
component (with the largest eigenvalue) of momentum inertia is
the direction of the axis around which the top can spin stably.
Adopting this insight, we propose an algorithm based on
robustness properties to choose statistically significant
components of PCA. We will use a Microarray data set to
demonstrate this algorithm.
« Back
A Bayes regression approach to
array-CGH data
I-Shou Chang, National Health Research Institute, Taiwan
This paper develops a Bayes regression approach for the
analysis of array-CGH data by utilizing not only the
underlying spatial structure of the genomic alterations but
also the observation that the noise associated with the
ratio of the fluorescence intensities is larger when the
intensities get smaller. We show that this Bayes regression
approach is particularly suitable for the analysis of cDNA
microarray-CGH data, which are generally noisier than those
using genomic clones. A simulation study and a real data
analysis are included to illustrate this approach.
« Back
Superiority of spaced seeds for
genomic sequence comparison
Kwok Pui Choi, National University of Singapore
Homology search, or local alignment, finds similar
segments between two DNA or protein sequences. It is the
most fundamental task in bioinformatics. In index-based
homology search program design as exemplared in BLAST,
spaced seeds are observed to be more sensitive than the
consective seeds. However, it is challenging to elucidate
the mechanism that confers power to spaced seeds. This talk
presents our recent works towards to this open problem.
« Back
Detection of genes for ordinal
traits in nuclear families and a unified approach for
association studies
Heping Zhang, Yale University
There is growing interest in genome-wide association
analysis using single-nucleotide polymorphisms (SNPs),
because traditional linkage studies are not as powerful in
identifying genes for common, complex diseases. A variety of
tests for linkage disequilibrium have been developed and
examined for binary and quantitative traits. However, since
many human conditions and diseases are measured in an
ordinal scale, methods need to be developed to investigate
the association of genes and ordinal traits. Thus, in the
current study we propose and derive a score test statistic
that identifies genes that are associated with ordinal
traits when gametic disequilibrium between a marker and
trait loci exist. Through simulation, the performance of
this new test is examined for both ordinal traits as well as
quantitative traits. The proposed statistic not only
accommodates ordinal traits and have superior power for
ordinal traits, but also has similar power of existing tests
when the trait is quantitative. Therefore, our proposed
statistic has the potential to serve as a unified approach
to identifying genes that are associated with any trait,
regardless of how the trait is measured.
« Back
The effect of missing information
on gene mapping
Benjamin Yakir, The Hebrew University of Jerusalem and
National University of Singapore
Many of the commonly used techniques for gene mapping are
formulated in terms of unobservable quantities. Examples
include identity-by-decent relations in human linkage
analysis, haplotypes in association studies, and the
population origin of an allele in admixture mapping. In all
these cases the quantities need to be inferred from the
observed genotypes. In this talk we will discuss the some of
issues involved in statical inference in the context of
missing information and try to identify the major factors
that have impact on the statistical power.
« Back
Phylogeny via an EM algorithm based on
a general nucleotide substitution model
Von Bing Yap, National University of Singapore
DNA sequences are routinely used to reconstruct
phylogenies, i.e., evolutionary relationships among
organisms. An increasingly popular approach is to lean on a
Markov nucleotide substitution model, and do maximum
likelihood or Bayesian inference. Most models involve rather
restrictive constraints, for example, time reversibility and
that all branch transition matrices are generated by the
same rate matrix, in order to reduce the number of model
parameters. In this talk, the most general Markov model,
which includes the usual models as special cases, will be
discussed. It turns out that the new optimisation problems
are comparatively much easier. Indeed, a simple EM algorithm
can be used to do maximum likelihood estimation, and to
solve the mathematical problem of determining the model
parameters, given a joint distribution of the leaf states on
a given phylogeny. Some argument/evidence will be given for
the view that the large number of parameters in the general
model may not hurt phylogeny reconstruction.
« Back
Model selection in irregular
problems: applications to gene mapping and CGH
David Siegmund, Stanford University and National University
of Singapore
I discuss two methods of model selection for change-point
like problems arising in genetic linkage analysis. The first
is a method that selects the model with the smallest
p-value, while the second is a modification of the Bayes
Information Criterion (BIC). The methods are compared
theoretically and on examples from the literature. For these
examples, the methods are roughly comparable although the
p-value based method is somewhat more liberal in selecting a
high dimensional model. The BIC for a standard change-point
point formulation with applications to comparative genomic
hybridization (CGH) is also discussed. This is joint
research with N. Zhang.
References
- Bogdan, M., Doerge, R. and Ghosh: J. K. (2004). How to
modify Schwarz Bayesian information criterion to locate
multiple interacting quantitative trait loci, Genetics
167, 989-999.
- Broman, K. and Speed, T. (2002). A model selection
approach for the identification of quantitative trait loci
in experimental crosses, J. R. Statisti. Soc. B 64,
Part 4, 1-16.
- Olshen, A. and Venkatraman, E., Lucito, R., Wigler, M.
(2004). Circular binary segmentation for the analysis of
array based DNA copy number data, Biostatistics 5,
557-572.
- Sen, S, Churchill, G.A. (2001). A statistical framework
for quantitative trait mapping: Genetics 159,
371-87.
- Siegmund, D. (2004). Model selection in irregular
problems: applications to mapping QTLs, Biornetrika,
91, 785-800.
- Zhang, N. and Siegmund, D. (2005). A modified Bayes
information criterion with applications to the analysis of
comparative genomic hybridization data, submitted for
publication.
« Back
DNPTrapper: an assembly editing
tool for finishing of complex repeat regions
Erik Arner, Karolinska Institute, Sweden
The emergence of high-throughput methods for genome
sequencing, in combination with increased computer power and
better algorithms for sequence assembly, has yielded a
plethora of genomes accessible for analysis. However,
complicated parts of sequenced genomes tend to be left
unfinished to a large extent. This is due to a lack of
proper tools specifically designed to resolve complex
regions, including repeated regions and cases where homologs
show a high degree of polymorphism. Virtually all genomes
sequenced have complex regions to some extent. In many
cases, the complex regions encountered have biological
function. Examples are repeated surface antigen genes in the
parasite Trypanosoma cruzi, or repeated splice leader
sequences in its relative Leishmania major. If the goal is
to obtain a better understanding of the biology of these
organisms, the complex, repeated regions need to be
resolved.
Previously described methods for finishing complicated
genomes include the use of mate pairs and defined nucleotide
positions, DNPs, that represent single base differences
between repeat copies. We propose that a combination of
these two approaches will constitute a powerful tool for
automatic resolving of nearly identical repeats. We
illustrate this principle in DNPTrapper, an assembly editing
and visualization tool specifically designed for finishing
of complex regions. DNPTrapper makes it possible to perform
DNP and mate pair analysis on assemblies and subsequently
resolve repeats in a semi-automatic fashion using the
combined information. The program offers flexibility in the
editing choices available, allowing for testing of
alternative solutions to the problem at hand. We describe
how DNPTrapper is being put to use for resolving repeated
regions in T. cruzi and L. major, but the program is
applicable for finishing of complex regions in any organism.
DNPTrapper relies on Open Source software for the
graphical user interface and the underlying database.
Shortly, DNPTrapper itself will be released under an Open
Source license. The program is designed to be easy to
extend, with a flexible plug-in system and a well-documented
API. This makes the process of adding features that can be
visualized, supported file formats, and new algorithms
straight-forward.
« Back
Characterization of the maximal
score of optimal pairwise local alignments
Nancy Zhang, Stanford University
This problem is inspired by the comparison of protein and
DNA sequences. We ask the question: For which scoring
functions does the optimal local alignment score grow
logarithmically with sequence length? We define the concept
of ``Local Optimality" and use it to prove a sufficient
condition on the scoring parameters for logarithmic growth
of the optimal score for gapped alignments. ``Local
Optimality" refers to the fact that in an optimal alignment,
any local changes around gaps should not increase the
overall score. We use numerical studies to compare our local
optimality based result to previous results and also draw
some theoretical connections. This gives new theoretical
proof that some commonly used scoring functions are in the
logarithmic region, and provides a more accurate large
deviations rate for the p-value of the optimal score.
« Back
Asymptotics of the local
alignment score for non-affine gap penalties
Hock Peng Chan, National University of Singapore
The computation of local alignment scores in DNA or
protein sequences takes into account penalties due to gaps
in the alignments. Though affine gap penalties are in
widespread use due to its computational ease, empirical
evidence and the underlying biological mechanism have
supported the consideration of non-affine penalties that are
small compared to the length of the gaps. We provide here
asymptotics of the growth rate of the local alignment scores
and determine the types of non-affine gap penalties that are
statistically useful.
« Back
Chromosome rearrangements in
evolution and cancer
Guillaume Bourque, Genome Institute of Singapore
In recent years, impressive sequencing and comparative
mapping endeavors have made available numerous detailed
whole-genome sequences and maps. One of the stated goals of
these projects is to better our understanding of evolution
through comparative analyses. Our main focus is the
comparison of the relative order of conserved segments and
the recovery of a rearrangement scenario that best explains
the observed architectures. We will summarize our
contributions to one such analysis involving 8 mammalian
genomes (3 sequenced and 5 with dense Radiation-Hybrid
maps).
Tied with the chromosomal rearrangements observed in
evolution are the chromosomal aberrations found in cancer.
Although full scale sequencing of these tumor genomes would
provide great insights into the disease, the costs remain
prohibitive. Nevertheless, a few alternative approaches can
mine some of the unique features of these aberrant genomes.
We will present one such approach that uses a novel
Pair-End-Tags (ditag) sequencing technology. By carefully
classifying the different types of ditags, we will show that
we can identify rearrangement breakpoints in the cancer
genome.
« Back
Genetic factors influencing
Tb susceptibility
Mark Seielstad, Harvard University and Genome Institute
of Singapore
One third of humanity is infected by Mycobacterium
tuberculosis and more than two million people die from
the infection each year. And yet, despite this awful toll,
only a tenth of the infected billions will ever succumb to
or even exhibit symptoms of the disease. This bespeaks a
major role for genetic variability in determining the
outcome of mycobacterial exposure and infection – a role
that, together with significant environmental exposures, has
been substantiated by heritability and other analyses over
the decades. Identifying the relevant genetic variation has
stymied investigators for some time, with most progress to
date arising from studies of severe Mendelian defects in
pathways conferring unusual susceptibility to mycobacterial
infections. In a preliminary analysis of two distinct data
sets comprised of 1.) ~10,000 SNPs distributed throughout
the human genome and genotyped in 50 active Tuberculosis
cases matched with 50 household and community controls and
2.) ~110,000 SNPs genotyped in 120 active cases and 120
controls, we have seen statistically significant
associations for numerous SNPs. The study design expects
many of these to be spurious, but we see evidence that many
of these associations result from genuine involvement in
susceptibility to tuberculosis. We are attempting to
validate some of these associations by genotyping 3,000
additional SNPs in a larger collection of 500 cases and 500
controls from the same Jakarta population. In addition to
the likelihood of uncovering variation that contributes to
Tb susceptibility, this study will provide an early
assessment of whole genome association approaches, which are
currently poised to completely revolutionize the conduct and
success of genetic association studies.
« Back
Statistics of runs of multiletter
alphabet and their applications to biological sequence
analysis
Yong Kong, National University of Singapore
Exact distributions of run statistics are traditionally
obtained by using combinatorial methods, which under certain
situations become very tedious. Run distributions of
multiple object systems, although appear frequently in
applications from various fields such as computational
biology, are not commonly used, partially due to the lack of
easy-to-use formulas. In this presentation, a method for
evaluating partition functions of lattice models in the
field of statistical mechanics is used to develop a
systematic method to study various run statistics in
multiple object systems. By using particular generating
functions for the specified situation under study, many new
distributions can be obtained in a unified and coherent way.
The method makes it possible to manipulate formulas of run
statistics by using binomial identities to obtain more
general, yet at the same time simpler formulas. To
illustrate the applications of the general method, the
distributions of the total number of runs and the m-th
longest runs are investigated. Novel and general explicit
formulas are derived for these distributions. In addition,
some classical run statistics are recovered and generalized
in the same unified way. As examples of applications to
biological sequence analysis in computational biology and
bioinformatics, the run statistics developed using the
general method are applied to several protein sequences to
look at their global and local features.
« Back
The occurrence and exploitation of
simple tandem repeats in bacterial and human genomes
Eric Yap, DSO National Laboratories, Singapore
A tandem repeat is an occurrence of two or more adjacent,
often approximate copies of a sequence of nucleotides.
Simple tandemly repeated (STR) sequences have been found to
be common and ubiquitous motifs in the known genomes of all
eukaryotes (including human, plant, animal and single
cellular organisms) and prokaryotes (bacteria). In the
human, expansion mutations of STR (triple repeats) cause
inherited neurological diseases including fragile-X mental
retardation and Huntington's disease, and are associated
with other diseases and traits. STR have been commonly used
as genetic markers for gene mapping by linkage and linkage
dysequilibrium analysis, genetic profiling for forensics,
and molecular epidemiological tracing of bacterial strains.
Therefore we have been motivated to mine genomes for STR
markers, predict their variability between individuals
(polymorphisms) and hence utility as genetic markers, study
factors that account for their genomic distribution and
biological function, and develop novel lab methods for
analysing them and exploiting their genotypic information. I
will attempt to illustrate these with applications in human
population genetics and infection outbreak investigation.
« Back
|
|