


Workshop on Genomics
(14 - 17 Nov 2005)


An algorithm for choosing significant PCA components on expression microarrays
I-ping Tu, Institute of Statistical Science Academia Sinica, Taiwan

PCA (Principal Component Analysis) is one of the oldest and best known statistical tools for multivariate analysis. Even in earlier ages, PCA has been applied fruitfully by physicists in solving the motions of rigid bodies in classical mechanics. In the example of the motion of a top, the first principal component (with the largest eigenvalue) of momentum inertia is the direction of the axis around which the top can spin stably. Adopting this insight, we propose an algorithm based on robustness properties to choose statistically significant components of PCA. We will use a Microarray data set to demonstrate this algorithm.

 « Back

A Bayes regression approach to array-CGH data
I-Shou Chang, National Health Research Institute, Taiwan

This paper develops a Bayes regression approach for the analysis of array-CGH data by utilizing not only the underlying spatial structure of the genomic alterations but also the observation that the noise associated with the ratio of the fluorescence intensities is larger when the intensities get smaller. We show that this Bayes regression approach is particularly suitable for the analysis of cDNA microarray-CGH data, which are generally noisier than those using genomic clones. A simulation study and a real data analysis are included to illustrate this approach.

 « Back

Superiority of spaced seeds for genomic sequence comparison
Kwok Pui Choi, National University of Singapore

Homology search, or local alignment, finds similar segments between two DNA or protein sequences. It is the most fundamental task in bioinformatics. In index-based homology search program design as exemplared in BLAST, spaced seeds are observed to be more sensitive than the consective seeds. However, it is challenging to elucidate the mechanism that confers power to spaced seeds. This talk presents our recent works towards to this open problem.

 « Back

Detection of genes for ordinal traits in nuclear families and a unified approach for association studies
Heping Zhang, Yale University

There is growing interest in genome-wide association analysis using single-nucleotide polymorphisms (SNPs), because traditional linkage studies are not as powerful in identifying genes for common, complex diseases. A variety of tests for linkage disequilibrium have been developed and examined for binary and quantitative traits. However, since many human conditions and diseases are measured in an ordinal scale, methods need to be developed to investigate the association of genes and ordinal traits. Thus, in the current study we propose and derive a score test statistic that identifies genes that are associated with ordinal traits when gametic disequilibrium between a marker and trait loci exist. Through simulation, the performance of this new test is examined for both ordinal traits as well as quantitative traits. The proposed statistic not only accommodates ordinal traits and have superior power for ordinal traits, but also has similar power of existing tests when the trait is quantitative. Therefore, our proposed statistic has the potential to serve as a unified approach to identifying genes that are associated with any trait, regardless of how the trait is measured.

 « Back

The effect of missing information on gene mapping
Benjamin Yakir, The Hebrew University of Jerusalem and National University of Singapore

Many of the commonly used techniques for gene mapping are formulated in terms of unobservable quantities. Examples include identity-by-decent relations in human linkage analysis, haplotypes in association studies, and the population origin of an allele in admixture mapping. In all these cases the quantities need to be inferred from the observed genotypes. In this talk we will discuss the some of issues involved in statical inference in the context of missing information and try to identify the major factors that have impact on the statistical power.

 « Back

Phylogeny via an EM algorithm based on a general nucleotide substitution model
Von Bing Yap, National University of Singapore

DNA sequences are routinely used to reconstruct phylogenies, i.e., evolutionary relationships among organisms. An increasingly popular approach is to lean on a Markov nucleotide substitution model, and do maximum likelihood or Bayesian inference. Most models involve rather restrictive constraints, for example, time reversibility and that all branch transition matrices are generated by the same rate matrix, in order to reduce the number of model parameters. In this talk, the most general Markov model, which includes the usual models as special cases, will be discussed. It turns out that the new optimisation problems are comparatively much easier. Indeed, a simple EM algorithm can be used to do maximum likelihood estimation, and to solve the mathematical problem of determining the model parameters, given a joint distribution of the leaf states on a given phylogeny. Some argument/evidence will be given for the view that the large number of parameters in the general model may not hurt phylogeny reconstruction.

 « Back

Model selection in irregular problems: applications to gene mapping and CGH
David Siegmund, Stanford University and National University of Singapore

I discuss two methods of model selection for change-point like problems arising in genetic linkage analysis. The first is a method that selects the model with the smallest p-value, while the second is a modification of the Bayes Information Criterion (BIC). The methods are compared theoretically and on examples from the literature. For these examples, the methods are roughly comparable although the p-value based method is somewhat more liberal in selecting a high dimensional model. The BIC for a standard change-point point formulation with applications to comparative genomic hybridization (CGH) is also discussed. This is joint research with N. Zhang.

- Bogdan, M., Doerge, R. and Ghosh: J. K. (2004). How to modify Schwarz Bayesian information criterion to locate multiple interacting quantitative trait loci, Genetics 167, 989-999.
- Broman, K. and Speed, T. (2002). A model selection approach for the identification of quantitative trait loci in experimental crosses, J. R. Statisti. Soc. B 64, Part 4, 1-16.
- Olshen, A. and Venkatraman, E., Lucito, R., Wigler, M. (2004). Circular binary segmentation for the analysis of array based DNA copy number data, Biostatistics 5, 557-572.
- Sen, S, Churchill, G.A. (2001). A statistical framework for quantitative trait mapping: Genetics 159, 371-87.
- Siegmund, D. (2004). Model selection in irregular problems: applications to mapping QTLs, Biornetrika, 91, 785-800.
- Zhang, N. and Siegmund, D. (2005). A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data, submitted for publication.

 « Back

DNPTrapper: an assembly editing tool for finishing of complex repeat regions
Erik Arner, Karolinska Institute, Sweden

The emergence of high-throughput methods for genome sequencing, in combination with increased computer power and better algorithms for sequence assembly, has yielded a plethora of genomes accessible for analysis. However, complicated parts of sequenced genomes tend to be left unfinished to a large extent. This is due to a lack of proper tools specifically designed to resolve complex regions, including repeated regions and cases where homologs show a high degree of polymorphism. Virtually all genomes sequenced have complex regions to some extent. In many cases, the complex regions encountered have biological function. Examples are repeated surface antigen genes in the parasite Trypanosoma cruzi, or repeated splice leader sequences in its relative Leishmania major. If the goal is to obtain a better understanding of the biology of these organisms, the complex, repeated regions need to be resolved.

Previously described methods for finishing complicated genomes include the use of mate pairs and defined nucleotide positions, DNPs, that represent single base differences between repeat copies. We propose that a combination of these two approaches will constitute a powerful tool for automatic resolving of nearly identical repeats. We illustrate this principle in DNPTrapper, an assembly editing and visualization tool specifically designed for finishing of complex regions. DNPTrapper makes it possible to perform DNP and mate pair analysis on assemblies and subsequently resolve repeats in a semi-automatic fashion using the combined information. The program offers flexibility in the editing choices available, allowing for testing of alternative solutions to the problem at hand. We describe how DNPTrapper is being put to use for resolving repeated regions in T. cruzi and L. major, but the program is applicable for finishing of complex regions in any organism.

DNPTrapper relies on Open Source software for the graphical user interface and the underlying database. Shortly, DNPTrapper itself will be released under an Open Source license. The program is designed to be easy to extend, with a flexible plug-in system and a well-documented API. This makes the process of adding features that can be visualized, supported file formats, and new algorithms straight-forward.

 « Back

Characterization of the maximal score of optimal pairwise local alignments
Nancy Zhang, Stanford University

This problem is inspired by the comparison of protein and DNA sequences. We ask the question: For which scoring functions does the optimal local alignment score grow logarithmically with sequence length? We define the concept of ``Local Optimality" and use it to prove a sufficient condition on the scoring parameters for logarithmic growth of the optimal score for gapped alignments. ``Local Optimality" refers to the fact that in an optimal alignment, any local changes around gaps should not increase the overall score. We use numerical studies to compare our local optimality based result to previous results and also draw some theoretical connections. This gives new theoretical proof that some commonly used scoring functions are in the logarithmic region, and provides a more accurate large deviations rate for the p-value of the optimal score.

 « Back

Asymptotics of the local alignment score for non-affine gap penalties
Hock Peng Chan, National University of Singapore

The computation of local alignment scores in DNA or protein sequences takes into account penalties due to gaps in the alignments. Though affine gap penalties are in widespread use due to its computational ease, empirical evidence and the underlying biological mechanism have supported the consideration of non-affine penalties that are small compared to the length of the gaps. We provide here asymptotics of the growth rate of the local alignment scores and determine the types of non-affine gap penalties that are statistically useful.

 « Back

Chromosome rearrangements in evolution and cancer
Guillaume Bourque, Genome Institute of Singapore

In recent years, impressive sequencing and comparative mapping endeavors have made available numerous detailed whole-genome sequences and maps. One of the stated goals of these projects is to better our understanding of evolution through comparative analyses. Our main focus is the comparison of the relative order of conserved segments and the recovery of a rearrangement scenario that best explains the observed architectures. We will summarize our contributions to one such analysis involving 8 mammalian genomes (3 sequenced and 5 with dense Radiation-Hybrid maps).

Tied with the chromosomal rearrangements observed in evolution are the chromosomal aberrations found in cancer. Although full scale sequencing of these tumor genomes would provide great insights into the disease, the costs remain prohibitive. Nevertheless, a few alternative approaches can mine some of the unique features of these aberrant genomes. We will present one such approach that uses a novel Pair-End-Tags (ditag) sequencing technology. By carefully classifying the different types of ditags, we will show that we can identify rearrangement breakpoints in the cancer genome.

 « Back

Genetic factors influencing Tb susceptibility
Mark Seielstad, Harvard University and Genome Institute of Singapore

One third of humanity is infected by Mycobacterium tuberculosis and more than two million people die from the infection each year. And yet, despite this awful toll, only a tenth of the infected billions will ever succumb to or even exhibit symptoms of the disease. This bespeaks a major role for genetic variability in determining the outcome of mycobacterial exposure and infection – a role that, together with significant environmental exposures, has been substantiated by heritability and other analyses over the decades. Identifying the relevant genetic variation has stymied investigators for some time, with most progress to date arising from studies of severe Mendelian defects in pathways conferring unusual susceptibility to mycobacterial infections. In a preliminary analysis of two distinct data sets comprised of 1.) ~10,000 SNPs distributed throughout the human genome and genotyped in 50 active Tuberculosis cases matched with 50 household and community controls and 2.) ~110,000 SNPs genotyped in 120 active cases and 120 controls, we have seen statistically significant associations for numerous SNPs. The study design expects many of these to be spurious, but we see evidence that many of these associations result from genuine involvement in susceptibility to tuberculosis. We are attempting to validate some of these associations by genotyping 3,000 additional SNPs in a larger collection of 500 cases and 500 controls from the same Jakarta population. In addition to the likelihood of uncovering variation that contributes to Tb susceptibility, this study will provide an early assessment of whole genome association approaches, which are currently poised to completely revolutionize the conduct and success of genetic association studies.

 « Back

Statistics of runs of multiletter alphabet and their applications to biological sequence analysis
Yong Kong, National University of Singapore

Exact distributions of run statistics are traditionally obtained by using combinatorial methods, which under certain situations become very tedious. Run distributions of multiple object systems, although appear frequently in applications from various fields such as computational biology, are not commonly used, partially due to the lack of easy-to-use formulas. In this presentation, a method for evaluating partition functions of lattice models in the field of statistical mechanics is used to develop a systematic method to study various run statistics in multiple object systems. By using particular generating functions for the specified situation under study, many new distributions can be obtained in a unified and coherent way. The method makes it possible to manipulate formulas of run statistics by using binomial identities to obtain more general, yet at the same time simpler formulas. To illustrate the applications of the general method, the distributions of the total number of runs and the m-th longest runs are investigated. Novel and general explicit formulas are derived for these distributions. In addition, some classical run statistics are recovered and generalized in the same unified way. As examples of applications to biological sequence analysis in computational biology and bioinformatics, the run statistics developed using the general method are applied to several protein sequences to look at their global and local features.

 « Back

The occurrence and exploitation of simple tandem repeats in bacterial and human genomes
Eric Yap, DSO National Laboratories, Singapore

A tandem repeat is an occurrence of two or more adjacent, often approximate copies of a sequence of nucleotides. Simple tandemly repeated (STR) sequences have been found to be common and ubiquitous motifs in the known genomes of all eukaryotes (including human, plant, animal and single cellular organisms) and prokaryotes (bacteria). In the human, expansion mutations of STR (triple repeats) cause inherited neurological diseases including fragile-X mental retardation and Huntington's disease, and are associated with other diseases and traits. STR have been commonly used as genetic markers for gene mapping by linkage and linkage dysequilibrium analysis, genetic profiling for forensics, and molecular epidemiological tracing of bacterial strains. Therefore we have been motivated to mine genomes for STR markers, predict their variability between individuals (polymorphisms) and hence utility as genetic markers, study factors that account for their genomic distribution and biological function, and develop novel lab methods for analysing them and exploiting their genotypic information. I will attempt to illustrate these with applications in human population genetics and infection outbreak investigation.

 « Back