| 
				
                
                  
				  | 
				  | 
				      
                   Workshop on Genomics 
                  (14 - 17 Nov 2005)
                  
                  
                An algorithm for choosing significant 
				PCA components on expression microarrays 
                I-ping Tu, Institute of Statistical Science Academia Sinica, 
				Taiwan 
                PCA (Principal Component Analysis) is one of the oldest and 
				best known statistical tools for multivariate analysis. Even in 
				earlier ages, PCA has been applied fruitfully by physicists in 
				solving the motions of rigid bodies in classical mechanics. In 
				the example of the motion of a top, the first principal 
				component (with the largest eigenvalue) of momentum inertia is 
				the direction of the axis around which the top can spin stably. 
				Adopting this insight, we propose an algorithm based on 
				robustness properties to choose statistically significant 
				components of PCA. We will use a Microarray data set to 
				demonstrate this algorithm. 
                 « Back 
                A Bayes regression approach to 
				array-CGH data 
				I-Shou Chang, National Health Research Institute, Taiwan 
					This paper develops a Bayes regression approach for the 
					analysis of array-CGH data by utilizing not only the 
					underlying spatial structure of the genomic alterations but 
					also the observation that the noise associated with the 
					ratio of the fluorescence intensities is larger when the 
					intensities get smaller. We show that this Bayes regression 
					approach is particularly suitable for the analysis of cDNA 
					microarray-CGH data, which are generally noisier than those 
					using genomic clones. A simulation study and a real data 
					analysis are included to illustrate this approach. 
                 « Back 
                	Superiority of spaced seeds for 
					genomic sequence comparison 
					Kwok Pui Choi, National University of Singapore 
					Homology search, or local alignment, finds similar 
					segments between two DNA or protein sequences. It is the 
					most fundamental task in bioinformatics. In index-based 
					homology search program design as exemplared in BLAST, 
					spaced seeds are observed to be more sensitive than the 
					consective seeds. However, it is challenging to elucidate 
					the mechanism that confers power to spaced seeds. This talk 
					presents our recent works towards to this open problem. 
                 « Back 
                	Detection of genes for ordinal 
					traits in nuclear families and a unified approach for 
					association studies 
					Heping Zhang, Yale University 
					There is growing interest in genome-wide association 
					analysis using single-nucleotide polymorphisms (SNPs), 
					because traditional linkage studies are not as powerful in 
					identifying genes for common, complex diseases. A variety of 
					tests for linkage disequilibrium have been developed and 
					examined for binary and quantitative traits. However, since 
					many human conditions and diseases are measured in an 
					ordinal scale, methods need to be developed to investigate 
					the association of genes and ordinal traits. Thus, in the 
					current study we propose and derive a score test statistic 
					that identifies genes that are associated with ordinal 
					traits when gametic disequilibrium between a marker and 
					trait loci exist. Through simulation, the performance of 
					this new test is examined for both ordinal traits as well as 
					quantitative traits. The proposed statistic not only 
					accommodates ordinal traits and have superior power for 
					ordinal traits, but also has similar power of existing tests 
					when the trait is quantitative. Therefore, our proposed 
					statistic has the potential to serve as a unified approach 
					to identifying genes that are associated with any trait, 
					regardless of how the trait is measured. 
                 « Back 
                	The effect of missing information 
					on gene mapping 
					Benjamin Yakir, The Hebrew University of Jerusalem and 
					National University of Singapore 
					Many of the commonly used techniques for gene mapping are 
					formulated in terms of unobservable quantities. Examples 
					include identity-by-decent relations in human linkage 
					analysis, haplotypes in association studies, and the 
					population origin of an allele in admixture mapping. In all 
					these cases the quantities need to be inferred from the 
					observed genotypes. In this talk we will discuss the some of 
					issues involved in statical inference in the context of 
					missing information and try to identify the major factors 
					that have impact on the statistical power. 
                 « Back 
                Phylogeny via an EM algorithm based on 
				a general nucleotide substitution model 
				Von Bing Yap, National University of Singapore 
					DNA sequences are routinely used to reconstruct 
					phylogenies, i.e., evolutionary relationships among 
					organisms. An increasingly popular approach is to lean on a 
					Markov nucleotide substitution model, and do maximum 
					likelihood or Bayesian inference. Most models involve rather 
					restrictive constraints, for example, time reversibility and 
					that all branch transition matrices are generated by the 
					same rate matrix, in order to reduce the number of model 
					parameters. In this talk, the most general Markov model, 
					which includes the usual models as special cases, will be 
					discussed. It turns out that the new optimisation problems 
					are comparatively much easier. Indeed, a simple EM algorithm 
					can be used to do maximum likelihood estimation, and to 
					solve the mathematical problem of determining the model 
					parameters, given a joint distribution of the leaf states on 
					a given phylogeny. Some argument/evidence will be given for 
					the view that the large number of parameters in the general 
					model may not hurt phylogeny reconstruction.  
                 « Back 
                Model selection in irregular 
				problems: applications to gene mapping and CGH 
				David Siegmund, Stanford University and National University 
				of Singapore 
					I discuss two methods of model selection for change-point 
					like problems arising in genetic linkage analysis. The first 
					is a method that selects the model with the smallest 
					p-value, while the second is a modification of the Bayes 
					Information Criterion (BIC). The methods are compared 
					theoretically and on examples from the literature. For these 
					examples, the methods are roughly comparable although the 
					p-value based method is somewhat more liberal in selecting a 
					high dimensional model. The BIC for a standard change-point 
					point formulation with applications to comparative genomic 
					hybridization (CGH) is also discussed. This is joint 
					research with N. Zhang. 
					References 
					- Bogdan, M., Doerge, R. and Ghosh: J. K. (2004). How to 
					modify Schwarz Bayesian information criterion to locate 
					multiple interacting quantitative trait loci, Genetics
					167, 989-999. 
					- Broman, K. and Speed, T. (2002). A model selection 
					approach for the identification of quantitative trait loci 
					in experimental crosses, J. R. Statisti. Soc. B 64, 
					Part 4, 1-16. 
					- Olshen, A. and Venkatraman, E., Lucito, R., Wigler, M. 
					(2004). Circular binary segmentation for the analysis of 
					array based DNA copy number data, Biostatistics 5, 
					557-572. 
					- Sen, S, Churchill, G.A. (2001). A statistical framework 
					for quantitative trait mapping: Genetics 159, 
					371-87. 
					- Siegmund, D. (2004). Model selection in irregular 
					problems: applications to mapping QTLs, Biornetrika,
					91, 785-800. 
					- Zhang, N. and Siegmund, D. (2005). A modified Bayes 
					information criterion with applications to the analysis of 
					comparative genomic hybridization data, submitted for 
					publication. 
                 « Back 
                	DNPTrapper: an assembly editing 
					tool for finishing of complex repeat regions 
					Erik Arner, Karolinska Institute, Sweden 
					The emergence of high-throughput methods for genome 
					sequencing, in combination with increased computer power and 
					better algorithms for sequence assembly, has yielded a 
					plethora of genomes accessible for analysis. However, 
					complicated parts of sequenced genomes tend to be left 
					unfinished to a large extent. This is due to a lack of 
					proper tools specifically designed to resolve complex 
					regions, including repeated regions and cases where homologs 
					show a high degree of polymorphism. Virtually all genomes 
					sequenced have complex regions to some extent. In many 
					cases, the complex regions encountered have biological 
					function. Examples are repeated surface antigen genes in the 
					parasite Trypanosoma cruzi, or repeated splice leader 
					sequences in its relative Leishmania major. If the goal is 
					to obtain a better understanding of the biology of these 
					organisms, the complex, repeated regions need to be 
					resolved.  
					Previously described methods for finishing complicated 
					genomes include the use of mate pairs and defined nucleotide 
					positions, DNPs, that represent single base differences 
					between repeat copies. We propose that a combination of 
					these two approaches will constitute a powerful tool for 
					automatic resolving of nearly identical repeats. We 
					illustrate this principle in DNPTrapper, an assembly editing 
					and visualization tool specifically designed for finishing 
					of complex regions. DNPTrapper makes it possible to perform 
					DNP and mate pair analysis on assemblies and subsequently 
					resolve repeats in a semi-automatic fashion using the 
					combined information. The program offers flexibility in the 
					editing choices available, allowing for testing of 
					alternative solutions to the problem at hand. We describe 
					how DNPTrapper is being put to use for resolving repeated 
					regions in T. cruzi and L. major, but the program is 
					applicable for finishing of complex regions in any organism.
					 
					DNPTrapper relies on Open Source software for the 
					graphical user interface and the underlying database. 
					Shortly, DNPTrapper itself will be released under an Open 
					Source license. The program is designed to be easy to 
					extend, with a flexible plug-in system and a well-documented 
					API. This makes the process of adding features that can be 
					visualized, supported file formats, and new algorithms 
					straight-forward. 
                 « Back 
                	Characterization of the maximal 
					score of optimal pairwise local alignments 
					Nancy Zhang, Stanford University 
					This problem is inspired by the comparison of protein and 
					DNA sequences. We ask the question: For which scoring 
					functions does the optimal local alignment score grow 
					logarithmically with sequence length? We define the concept 
					of ``Local Optimality" and use it to prove a sufficient 
					condition on the scoring parameters for logarithmic growth 
					of the optimal score for gapped alignments. ``Local 
					Optimality" refers to the fact that in an optimal alignment, 
					any local changes around gaps should not increase the 
					overall score. We use numerical studies to compare our local 
					optimality based result to previous results and also draw 
					some theoretical connections. This gives new theoretical 
					proof that some commonly used scoring functions are in the 
					logarithmic region, and provides a more accurate large 
					deviations rate for the p-value of the optimal score.  
                 « Back 
                	Asymptotics of the local 
					alignment score for non-affine gap penalties 
					Hock Peng Chan, National University of Singapore 
					The computation of local alignment scores in DNA or 
					protein sequences takes into account penalties due to gaps 
					in the alignments. Though affine gap penalties are in 
					widespread use due to its computational ease, empirical 
					evidence and the underlying biological mechanism have 
					supported the consideration of non-affine penalties that are 
					small compared to the length of the gaps. We provide here 
					asymptotics of the growth rate of the local alignment scores 
					and determine the types of non-affine gap penalties that are 
					statistically useful. 
                 « Back 
                	Chromosome rearrangements in 
					evolution and cancer 
					Guillaume Bourque, Genome Institute of Singapore 
					In recent years, impressive sequencing and comparative 
					mapping endeavors have made available numerous detailed 
					whole-genome sequences and maps. One of the stated goals of 
					these projects is to better our understanding of evolution 
					through comparative analyses. Our main focus is the 
					comparison of the relative order of conserved segments and 
					the recovery of a rearrangement scenario that best explains 
					the observed architectures. We will summarize our 
					contributions to one such analysis involving 8 mammalian 
					genomes (3 sequenced and 5 with dense Radiation-Hybrid 
					maps).  
					Tied with the chromosomal rearrangements observed in 
					evolution are the chromosomal aberrations found in cancer. 
					Although full scale sequencing of these tumor genomes would 
					provide great insights into the disease, the costs remain 
					prohibitive. Nevertheless, a few alternative approaches can 
					mine some of the unique features of these aberrant genomes. 
					We will present one such approach that uses a novel 
					Pair-End-Tags (ditag) sequencing technology. By carefully 
					classifying the different types of ditags, we will show that 
					we can identify rearrangement breakpoints in the cancer 
					genome. 
                 « Back 
                	Genetic factors influencing 
					Tb susceptibility 
					Mark Seielstad, Harvard University and Genome Institute 
					of Singapore 
					One third of humanity is infected by Mycobacterium 
					tuberculosis and more than two million people die from 
					the infection each year. And yet, despite this awful toll, 
					only a tenth of the infected billions will ever succumb to 
					or even exhibit symptoms of the disease. This bespeaks a 
					major role for genetic variability in determining the 
					outcome of mycobacterial exposure and infection – a role 
					that, together with significant environmental exposures, has 
					been substantiated by heritability and other analyses over 
					the decades. Identifying the relevant genetic variation has 
					stymied investigators for some time, with most progress to 
					date arising from studies of severe Mendelian defects in 
					pathways conferring unusual susceptibility to mycobacterial 
					infections. In a preliminary analysis of two distinct data 
					sets comprised of 1.) ~10,000 SNPs distributed throughout 
					the human genome and genotyped in 50 active Tuberculosis 
					cases matched with 50 household and community controls and 
					2.) ~110,000 SNPs genotyped in 120 active cases and 120 
					controls, we have seen statistically significant 
					associations for numerous SNPs. The study design expects 
					many of these to be spurious, but we see evidence that many 
					of these associations result from genuine involvement in 
					susceptibility to tuberculosis. We are attempting to 
					validate some of these associations by genotyping 3,000 
					additional SNPs in a larger collection of 500 cases and 500 
					controls from the same Jakarta population. In addition to 
					the likelihood of uncovering variation that contributes to 
					Tb susceptibility, this study will provide an early 
					assessment of whole genome association approaches, which are 
					currently poised to completely revolutionize the conduct and 
					success of genetic association studies.  
                 « Back 
                	Statistics of runs of multiletter 
					alphabet and their applications to biological sequence 
					analysis 
					Yong Kong, National University of Singapore 
					Exact distributions of run statistics are traditionally 
					obtained by using combinatorial methods, which under certain 
					situations become very tedious. Run distributions of 
					multiple object systems, although appear frequently in 
					applications from various fields such as computational 
					biology, are not commonly used, partially due to the lack of 
					easy-to-use formulas. In this presentation, a method for 
					evaluating partition functions of lattice models in the 
					field of statistical mechanics is used to develop a 
					systematic method to study various run statistics in 
					multiple object systems. By using particular generating 
					functions for the specified situation under study, many new 
					distributions can be obtained in a unified and coherent way. 
					The method makes it possible to manipulate formulas of run 
					statistics by using binomial identities to obtain more 
					general, yet at the same time simpler formulas. To 
					illustrate the applications of the general method, the 
					distributions of the total number of runs and the m-th 
					longest runs are investigated. Novel and general explicit 
					formulas are derived for these distributions. In addition, 
					some classical run statistics are recovered and generalized 
					in the same unified way. As examples of applications to 
					biological sequence analysis in computational biology and 
					bioinformatics, the run statistics developed using the 
					general method are applied to several protein sequences to 
					look at their global and local features. 
                 « Back 
                	The occurrence and exploitation of 
					simple tandem repeats in bacterial and human genomes 
					Eric Yap, DSO National Laboratories, Singapore 
					
					A tandem repeat is an occurrence of two or more adjacent, 
					often approximate copies of a sequence of nucleotides. 
					Simple tandemly repeated (STR) sequences have been found to 
					be common and ubiquitous motifs in the known genomes of all 
					eukaryotes (including human, plant, animal and single 
					cellular organisms) and prokaryotes (bacteria). In the 
					human, expansion mutations of STR (triple repeats) cause 
					inherited neurological diseases including fragile-X mental 
					retardation and Huntington's disease, and are associated 
					with other diseases and traits. STR have been commonly used 
					as genetic markers for gene mapping by linkage and linkage 
					dysequilibrium analysis, genetic profiling for forensics, 
					and molecular epidemiological tracing of bacterial strains. 
					Therefore we have been motivated to mine genomes for STR 
					markers, predict their variability between individuals 
					(polymorphisms) and hence utility as genetic markers, study 
					factors that account for their genomic distribution and 
					biological function, and develop novel lab methods for 
					analysing them and exploiting their genotypic information. I 
					will attempt to illustrate these with applications in human 
					population genetics and infection outbreak investigation.
					 
                 « Back 
                  
                  
                  
                  
                  
                  
                  
                  
                  
                  
                  
                  
                  
                  
                  
                  
                  
                  
                  
                  
                  
                  
                 | 
				  |