Statistical Genomics
(1 - 28 Jun 2009)

Jointly organized with Department of Statistics and Applied Probability in celebration of 80th Anniversary of Faculty of Science


~ Abstracts ~

 

Allowing for population structure and cryptic kinship in genetic association studies
David J. Balding, Imperial College, UK


William Astle and David Balding

Population structure is widely recognised as the principle potential confounding factor in genetic association studies, and there has been much debate over the magnitude of its effects and the best approaches to dealing with the problem. "Population structure" can be misleading as it is often interpreted to imply a simple partition of the population into K subpopulations, but the problem can be caused by any systematic differences in the ancestry of cases versus controls, including for example differences in cryptic kinship. Thus it is the unobserved pedigree of the study subjects that can be regarded as a confounding factor, and approaches to dealing with the problem should be based on appropriate descriptions. The matrix of kinship coefficients, which can be accurately estimated from genome-wide SNP data, provides a better description of the underlying pedigree and hence can provide a better resolution to the problem of population structure in association studies than do approaches based on a K-subpopulation model. Kinship-based approaches have been standard in animal and plant breeding genetics for many years, based on known pedigrees, via random-effects regression models. In addition to the issue of the best estimates of kinship, there are substantial challenges in making inferences under such models for the numbers of markers and the numbers of study subjects used in current association studies. We propose a fast algorithm for inference, and illustrate its performance under various study designs relative to other treatments of population structure. Ascertainment, the bane of human genetics that is almost absent from animal and plant genetics, poses serious problems which we discuss. Based on to be published work in the PhD thesis of William Astle (Sept 08).

« Back...

 

Simultaneous analysis of all snps in genome-wide and resequencing association studies
David J. Balding, Imperial College, UK


Clive Hoggart1, John Whittaker2, Maria De Iorio1 and David Balding1

1Department of Epidemiology and Public Health, Imperial College London
2Non-communicable Disease Epidemiology Unit, London School of Hygiene & Tropical Medicine.

Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study, to identify the subset that best predicts disease outcome, is now feasible thanks to developments in stochastic search methods. We employ a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant and recessive contributions to disease risk. Posterior mode estimates are obtained for regression coefficients that are each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate is interpreted as corresponding to a significant SNP. We investigate two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derive an explicit approximation for type-I error that avoids the need to employ permutation procedures. As well as genome-wide analyses, our method is well-suited to fine-mapping with very dense SNP-sets obtained from resequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. We demonstrate the method using simulated case-control data sets of up to 500K SNPs, a real genome-wide data set of 300K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation. The talk is based on PLoS Genetics 4(7): e1000130, 2008. doi:10.1371/journal.pgen.1000130 and recent work.

« Back...

 

Conservation and evolution of the mammalian transcription factor binding repertoire
Guillaume Bourque, Genome Institute of Singapore


Molecular technologies relying on chromatin immunoprecipitation have been developed to assay protein-DNA interactions and to study gene regulatory networks. For a of particular transcription factor and under a particular condition, these technologies combined with next-generation sequencing can provide a map of all the sites that are bound across the genome. To understand the functional significance of these sites, comparative genomic represents a powerful tool to prioritize these regulatory elements based on conservation. We will review some of the statistical approaches that are used to evaluate the conservation of regulatory elements that are identified by these genomic approaches. At the same time, because the identification of lineage-specific innovations in genomic control elements is critical for understanding phenotypic heterogeneity, we will also discuss a pervasive association with genomic repeats by showing that a large fraction of the bona fide binding sites are embedded in distinctive families of transposable elements. Using the age of the repeats, we established that these repeat-associated binding sites (RABS) have been associated with significant regulatory expansions throughout the mammalian phylogeny.

« Back...

 

Scan statistics and bootstrap resampling to identify chromosome regions of association in genome-wide studies
Shelley Bull, University of Toronto, Canada


Co-Authors: JL Asimit, L Faye, AD Paterson, L Sun, and YJ Yoo, University of Toronto

Much of current analysis of high-dimensional genetic marker data based on array technologies proceeds by treating each single marker as the unit of analysis for false positive error control and ranking of detected associations for followup. On the other hand, examining sets of markers within regions and treating the region as the unit of analysis can reduce the dimensionality problem substantially at the genome-level, and is natural when the region corresponds to a candidate gene. Alternatively, regions may be defined statistically, for example via a scan statistic. After categorizing a marker association as significant or not (based on the single marker association p-value), the scan statistic approach simultaneously identifies regions that contain more significant markers than expected by chance and tests for regional significance. Our interest in region-based analysis is motivated by two settings: (1) detection of association between microarray gene expression and copy number measures in tumour samples, and (2) detection and characterization of association between a quantitative trait and SNP array variants in a population-based sample of individuals. In this talk I will present some evaluations of the statistical approaches we propose in the context of datasets from these settings. To account for genome-wide multiple testing, we also consider the utility of marker reappearance frequencies and bias-reduced effect estimators based on bootstrap resampling methods.

« Back...

 

Adaptive EBIC and its application in genome-wide studies
Zehua Chen, National University of Singapore


The ordinary Bayes information criterion is too liberal for feature selection when the feature space is large, which is typical in genome-wide genetic studies. Chen and Chen (2008) developed an extended family of Bayes information criteria (BIC) for feature selection with large feature space. Unlike the original BIC which puts a penalty on the number of unknown parameters in a model, the extended BIC (EBIC) puts an additional penalty on the complexity of the model space. The criteria involve an adjusting parameter which adjusts for the severity of the penalty on the model complexity. Chen and Chen (2008) established the consistency of the EBIC when the dimension of the feature space goes to infinity much faster than the sample size for a range of the adjusting parameter. In terms of positive selection rate (PSR) and false discovery rate (FDR) in feature selection, the consistency of the EBIC implies that, when EBIC is used for model selection, the PSR and FDR will converge to 1 and 0 respectively as sample size goes to infinity for any value of the adjusting parameter in its consistency range. However, this asymptotic result is not quite useful in practical problems with finite and fixed sample size where one wants to control the FDR at a desirable level. In this talk, we present a data-driven procedure for the estimation of FDR with any value of the adjusting parameter. This estimation facilitates the choice of the adjusting parameter for controlling FDR. Simulation results demonstrating the efficacy of the adaptive EBIC procedure will be presented.

« Back...

 

Estimating the proportion of true null hypotheses in identifiable nonparametric models
Hanfeng Chen, Bowling Green State University, USA


In biology, it is often the case that researchers want to test multiple hypotheses simultaneously (e.g., thousands at a time in genomewide experiments). This gives rise to the so-called multiple testing problem in statistics. False discover rate or positive false discover rate (pFDR) has appeared to be a powerful criterion to measure or control the overall rate of false rejections in multiple testing. Unlike the significance level in testing a single hypothesis, pFDR is an unknown element even after experimentations or data analysis and implementation of pFDR in practice requires sophisticated estimation tasks. Storey (2002 and 2003) gave a Bayesian interpretation on the pFDR. By this approach, central to estimating pFDR is to estimate the proportion of true null hypotheses. In this talk, we discuss the methods to estimating the proportion and propose MLE approach in identifiable nonparametric models. When the nonparametric model is approximated by a Reimann sum, the likelihood function is based on a finite beta mixture model asymptotically. EM algorithm is readily applicable to compute the MLE. This is joint work with X. Wu.

« Back...

 

Statistical methods for mapping quantitative traits using high density single nucleotide polymorphisms in family samples
Josee Dupuis, Boston University, USA


Multiple genome wide association scans using hundreds of thousands of single nucleotide polymorphisms have been performed recently, and have enabled researchers to identify genetic variants with small effect on quantitative traits. While most of these genome scans use unrelated samples, a small number of studies have pursued genome-wide association approaches in related subjects. Association analysis in family based samples presents certain additional statistical challenges because of the correlated nature of the observations; however, the advantages of family designs in genetic studies greatly outweigh the added analysis complexity. We present statistical approaches to exploit family attributes when searching for genetic variants influencing quantitative traits of interest. We illustrate the methods using examples from a high density scan in the Framingham Heart Study cohorts.

« Back...

 

Concepts of information, correlation and optimality in evolutionary population genetics
Warren Ewens, University of Pennsylvania, USA


This talk will discuss recent ideas of Frank and others concerning the concept of Fisher information in evolutionary population genetics, where the formula for Fisher information formally appears in connection with the correlation between relatives and optimality principles. Connections are also made with Fisher's Fundamental Theorem of Natural Selection. Unresolved problems in this area will be mentioned.

« Back...

 

The quantitative TDT
Warren Ewens, University of Pennsylvania, USA


The qualitative (affected / not affected) TDT (transmission/disequilibrium test) was developed to overcome problems of population stratification when using marker locus information to locate disease genes. This has been extended by several authors to consider finding genes for a quantitative trait. Several package programs are available for this purpose. In this talk the assumptions and details of the models in these packages will be discussed, with particular attention being paid to the question of whether the problem of population stratification is indeed overcome.

« Back...

 

Penalized methods for bi-level variable selection
Jian Huang, University of Iowa, USA


In many applications, covariates possess a grouping structure that can be incorporated into the analysis to select important groups as well as important members of those groups. This work focuses on the incorporation of grouping structure into penalized regression. We investigate the previously proposed group lasso and group bridge penalties as well as a novel method, group MCP, introducing a framework and conducting simulation studies that shed light on the behavior of these methods. To fit these models, we use the idea of a locally approximated coordinate descent to develop algorithms which are fast and stable even when the number of features is much larger than the sample size. Finally, these methods are applied to a genetic association study of age-related macular degeneration. This is joint work with Patrick Breheny.

« Back...

 

Inversion distribution in RNA sequences
Ming-Ying Leung, University of Texas at El Paso, USA


In our attempts to improve the accuracy and efficiency in predicting secondary structures of ribonucleic acid (RNA) computationally using a grid of heterogeneous computers, we need to first characterize the distribution of inversions in random nucleotide sequences and establish statistical criteria for assessing significantly high concentrations of inversions in fixed length segments. RNA is a single-stranded molecule made up of 4 types of nucleotide bases: cytosine (C), guanine (G), adenine (A), and uracil (U). In some viruses (e.g., HIV, nodavirus, West Nile virus), their entire genomes are made of RNA. An RNA molecule can fold back onto itself to form a 3-dimensional conformation by pairing up the complementary bases (i.e., C with G and A with U), which is an important structural requirement for its replication process. All secondary structural elements must contain a string of nucleotides followed closely by its inverted complementary sequence downstream; these patterns are called inversions. In this talk, we shall discuss the distribution of inversions in comparison to Poisson type processes.

« Back...

 

A method of SNP re-ranking at the initial stage of genome-wide association studies
Yi Li, Genome Institute of Singapore


Genome-wide association studies (GWASs) have become more and more popular and fruitful in detecting common genetic variants that predispose common complex diseases. They are presently performed in a two/multi-stage manner, where a number of promising SNPs are selected at the initial stage and validated in follow-up independent samples. Candidate SNPs are chosen based on the single SNP association p-values, which does not directly guarantee a high replication rate due to the huge number of tests performed. A local False Discovery Rate (FDR) is a perfect selection criterion because a low local FDR for a SNP is equivalent to its high replication rate. However, the conventionally estimated local FDRs do not change the ranks obtained by the association p-values, since the effect size involved in the local FDR estimation is closely correlated with the test statistic, biased for extremely significant SNPs in GWASs. Although two main methods have been proposed to correct the bias in effect size estimation, none can be applied to re-rank a large number of SNPs at the initial stage of a GWAS. In this paper, we propose a modified .632+ bootstrapping method to estimate bias-reduced effect sizes, and re-rank tens of thousands of SNPs based on the corrected local FDRs. Both simulation and experimental data show that our re-ranking method can improve the ranks of truly-associated SNPs for 70% of the time when their p-value-based ranks are in [500,6000]. It is easy for our method to accommodate co-variates, and accordingly, our method is applicable to GWAS meta-analysis that is becoming more and more popular.

« Back...

 

Multiple interval mapping for quantitative trait loci with a spike in the trait distribution
Wenyun Li, Sun Yat-Sen University, China


Statistical methods for QTL mapping have been developed in the literature mainly for traits with regular distributions. However, there are many traits whose distribution has a spike, i.e., there is a single point with an irregular mass. In this talk, we present a multiple interval mapping (MIM) procedure for trait distributions with a spike. The MIM procedure is based on a mixture of joint generalized linear models (GLIMs). These mixture GLIMs are used together with an extended Bayesian information criterion (EBIC) in the MIM procedure. Unlike the methods based on single-QTL models, the MIM procedure considers multiple QTL simultaneously and hence enhances the efficiency of detecting QTL in a genome-wide search. The MIM procedure is compared with the interval mapping method based on single-QTL models considered in the literature in a real data example as well as in simulation studies. It is demonstrated that the MIM procedure greatly improves the efficiency of the methods based on single-QTL models in terms of positive selection rate and false discovery rate.

« Back...

 

An epistatic model for dissecting genetic susceptibility to disease
Tian Liu, Genome Institute of Singapore


Interactions between different genes, coined the epistasis, have been increasingly recognized to play an important role in the pathogenesis of most common human diseases, such as cancer or cardiovascular disease. Integrating the principle of quantitative genetics, we here propose a computational model for dissecting a complex disease into its genetic action and interaction components composed of causal single nucleotide polymorphisms (SNPs) in a simple case-control association study. We formulated a model based on reconstructed two by two contingency tables to test the epistasis of various kinds between the case and control groups. Computer simulations show that the method is more powerful and informative than the exhaustive pair-wise analysis using logistic regression model. The new model was tested on a stroke candidate-gene case-control data set, leading to the discovery of significant `additive by dominant? interactions for stroke.

Joint work with Anbupalam Thalamuthu, Jianjun Liu, Christopher Chen(Dept. of Pharmacology, National University of Singapore) , and Rongling Wu (Department of Public Health Sciences and Statistics, Pennsylvania State University).

« Back...

 

DNA methylation profiling using MSNP
Venkat Seshan, Columbia University Medical School, USA


DNA methylation is one of the epigenetic mechanism that controls the expression of genes. Aberrant DNA methylation is suspected to be a source in the development and progression of cancer. In this talk I will give some background and introduce methodologies available for the genome-wide measurement of DNA methylation patterns. I will expand on the MSNP method which uses the Affymetrix SNP-chips for this purpose using data from an ongoing project with Dr. Benjamin Tycko.

« Back...

 

DNA copy numbers and the circular binary segmentation algorithm
Venkat Seshan, Columbia University Medical School, USA


DNA sequence copy number is the number of copies of DNA at a region of a genome. The development of malignant tumors and their progression often involve alterations in DNA copy number. We will present the motivation for the Circular Binary Segmentation algorithm we developed (Olshen et al Biostatistics, 2004) to segment the genome into regions of equal copy number. We will also present refinements to the algorithm to handle the large arrays that are being used more commonly now (Venkatraman & Olshen Bioinformatics, 2007). We will present extensions to the problem such as parental copy numbers and the application to tumor data.

« Back...

 

Statistical challenges with next-generation sequence data
Terry Speed, The Walter and Eliza Hall Institute of Medical Research, Australia


Recent improvements in the efficiency, quality, and cost of genome-wide sequencing are prompting biologists to abandon microarrays in favor of next-generation sequencers, including Illumina's Genome Analyzer and several others.
These high-throughput sequencing technologies have already been applied to studying genome-wide transcription levels (mRNA-Seq), transcription factor binding sites (ChIP-Seq), chromatin structure, DNA copy number, and DNA methylation status and other topics. This talk, which draws heavily on work of Sandrine Dudoit and her students, and my student Oleg Mayba, touches on some of these areas, with the major focus being on ChIP-Seq.

« Back...

 

Genome-wide association studies in mixed populations
Hua Tang, Stanford University, USA


Admixture mapping is a method that exploits ancestral allele frequency differences to map disease susceptibility genes in recently admixed populations such as African Americans and Latinos. With the advent of high density genotyping platforms with as many as 500K SNPs or more for genome-wide association studies, a question arises as to the relative power of direct association analysis versus admixture mapping in such populations. Previously we have shown that with high-density SNP arrays, it is possible to accurately reconstruct the ancestry block structure of an admixed individual. Here we evaluate the relative efficiency of genotype- and ancestry-based association analyses. We also consider a strategy that combines the two sources of information.

« Back...

 

Joint analysis of multiple genes in a pathway or a gene set
Anbupalam Thalamuthu, Genome Institute of Singapore


Jingyuan Zhao, Anbupalam Thalamuthu, Simone Gupta, Garrett Teoh Hor Keong and Jianjun Liu

Multiple genes are known to be involved in the etiology of common diseases. Some of the existing methods for the joint association of multiple genetic variants combine the effect of many variants within a single gene or independent variants from multiple genes. Here we propose a joint association testing methodology to study the effect of multiple variants from several genes. We use gene level attribute to combine the information from multiple genes which enable us to test the joint association of several genes. Further we propose methodology to identify a subset of significantly associated genes. The proposed method can be used for testing the joint effect of several genes in a candidate gene study and can easily be extended to identify important genes in specific biological pathways generated from a Genomewide Association studies (GWAS). We evaluate the performance of the proposed methodology using simulated data sets as well as data sets from a candidate gene study. The results show that the joint effect of multiple genes is more powerful compared single gene analysis.

« Back...

 

Enhancing signal detection ability through information sharing
Naisyin Wang, Texas A&M University, USA


It is of great interest to identify genes that play a crucial role in the promotion stage of tumor forming. However, the differential signals at this stage tend to be much weaker compared to those obtained in the comparisons between tumor and regular tissues. One strategy in the study of diet prevention effects in tumorgenesis is to collect multivariate information, for example, microRNA and various types of mRNA measurements, from the same animals at different experimental setup. This practice allows researchers to borrow strength from the related variables to detect the weak but practically important diet differences at the early stage of the tumorgenesis. I will present some challenges we encountered during the study and methods we developed.

« Back...

 

On the relationships between population characteristics and QTL mapping
Benjamin Yakir, The Hebrew University, Israel


Recent decades have seen lively debates in attempt to identify the best population and the best experimental design for QTL mapping. In this talk we will try to add our own contribution to the debate. Our proposed analysis will relay upon a unified approach for constructing score statistics for mapping based on samples of related subjects that was introduced in Dupuis et al. (PNAS 104(51), 20210-20215). The component we will add to analysis in that paper is a population model that will enable the assessment of the efficiency of the statistics as a function of the parameters of the population evolution. The hope is to be able to produce a quantitative, rather than qualitative, assessment of the appropriateness of a given population for mapping a specific trait based estimable characteristics of the population and of the trait.

« Back...

 

Change-point detection and copy number variation
Benjamin Yakir, The Hebrew University, Israel


Change-point models are natural in settings that involve shifts in the characteristics of a sequence. Traditionally, the theory was developed for a single sequence and with respect to simple models of a single interval of shifted distribution. Modern applications call for the treatment of multiple data streams and more complex scenarios of change. In this talk we will consider the issue of the detection of DNA copy number variation i n the context of the theory of change-point detection. We will argue that the same principles that produce efficient change-point detection rules may be applied in order to develop procedures for the identification of copy number variations. Moreover, we will claim that the probabilistic theory that is being developed in the change-point detection literature is relevant to the analysis of DNA sequences as well.

« Back...

 

Characterization of Allele-specific copy number in tumor genomes
Nancy Zhang, Stanford University, USA


We develop a stochastic segmentation model to estimate allele specific DNA copy number in tumor samples, which can be applied to data from high density genotyping platforms. Our method estimates, at each probe location, the quantities of both inherited chromosomes, and thus gives a more comprehensive picture than methods based solely on total copy number. Since the genotypes at each marker are unknown, our method simultaneously infers this missing information. The model assumes a hidden Markov model with continuous valued states for the bivariate parent-specific chromosome copy number, and is able to model the fractional copy number changes that are commonly seen in tumors. We give a computational efficient algorithm for fitting the model, which
scales well to the current high density arrays.

The proposed method is applied to an analysis of 223 glioblastoma samples from the Cancer Genome Atlas (TCGA) project, giving a more comprehensive summary of the composition of allele-specific copy number events in these samples. Case studies using the TCGA glioblastoma samples reveal the additional insights that can be gained from an allele-specific copy number analysis, such as quantifying fractional gains and losses, identifying copy neutral loss of heterozygosity, and elucidating regions of simultaneous changes of both inherited chromosomes.

« Back...

 

Statistical challenges in genetic studies of mental disorders
Heping Zhang, Yale University, USA


Early family studies of psychiatric disorders began about a century ago, but our understanding for the genetics of mental and behavioral disorders remains limited. One challenge arises from the fact that multiple phenotypes are needed to characterize psychiatric disorders that usually do not occur alone. In fact, comorbidity is a rule other than exception. To address this challenge, I will first demonstrate the usefulness of considering multiple traits in genetic studies of complex disorders. Then, I will present a non-parametric test to studying the association between multiple traits and a candidate marker. After a brief summary for the theoretical properties of the test, the nominal type I error and power of the proposed test will be compared with existing test through simulation studies. The advantage of the proposed test will also be demonstrated by a study of alcoholism. This is a joint work with Ching-Ti Liu, Xueqin Wang, and Wensheng Zhu.

« Back...

 

Forest-based approach to genetic studies
Heping Zhang, Yale University, USA


Multiple genes, gene-by-gene interactions, and gene-by-environment interactions are believed to underlie most complex diseases. However, such interactions are difficult to identify. While there have been recent successes in identifying genetic variants for complex diseases, it remains difficult to identify gene-gene and gene-environment interactions. To overcome this difficulty, we explore a forest-based approach and propose a concept of variable importance. The proposed approach is demonstrated by simulation studies for its validity and illustrated by a real data analysis for its use. Analyses of both real data and simulated data based on published genetic models show the effectiveness of the proposed approach. This is a joint work with Xiang Chen, Ching-Ti Liu, and Minghui Wang.

« Back...

 

A feature selection approach to case-control genome-wide association studies
Jingyuan Zhao, Genome Institute of Singapore


Genome-wide association studies have become possible with the advancement of biotechnology which makes genome-wide typing of SNPs affordable. However, statistical methods for genome-wide association studies are still wanting. The multiple testing approache commonly used for genome-wide studies currently is not really appropriate. In this presentation, I will introduce a feature selection approach to genome-wide association studies. This approach consists of a screening procedure and a low-dimensional selection procedure. In the screening procedure, both main and interaction effects are screened by using a L1-penalized likelihood to reduce the number of features to a desirable low level. In the selection procedure, the retained features are ranked by a modified SCAD penalized likelihood (Fan and Li, 2001) and assessed by an extended Bayesian information criterion (Chen and Chen, 2008). The feature selection approach is compared with a pairwise multiple testing approach recently considered by Marchini et al. (2005). It is demonstrated by simulation studies that the feature selection approach controls the false discovery rate at lower levels and achieves higher positive detection rate compared with the pairwise multiple testing approach.

« Back...

 

What are network modules?
Hongyu Zhao, Yale University, USA


Many computational and statistical methods have been proposed to better organize, understand, and visualize gene expression data across different samples/conditions. A number of concepts have emerged to provide useful summaries of expression patterns, among them network module is commonly used to represent a set of genes with coherent expression patterns. Despite its popularity, there is no consensus on how to define a network module and what is the biological basis for a specific module. In this talk, we will survey different module definitions employed in the literature and discuss how to formulate its definition through joint statistical and biological considerations.

« Back...

 

Control of population stratification in whole-genome scans
Fei Zou, University of North Carolina at Chapel Hill, USA


Association studies using unrelated individuals have become the most popular design for mapping complex traits. Among the major challenges of association mapping is avoiding spurious association due to population stratification. Principal component analysis (PCA) is one of the leading stratification-control methods. However, the PCA approach impicitly assumes that the markers are in linkage equilibrium, a condition that is rarely satisfied in genome scans. We have developed a shrinkage PCA approach to all available markers, regardless of the linkage disequilibrium patterns. We have further identified a relationship between principal components and over-dispersion of association test statistics that provides precise guidance on the selection of principal components to use in adjusted test statistics. Our approach selects a much smaller number of principal components than that suggested by Tracy-Widom statistics, providing substantial computational savings in genome scans.

« Back...

 

Bayesian variable selection in semiparametric regression modeling with applications to genetic mapping
Fei Zou, University of North Carolina at Chapel Hill, USA


Quantitative traits and complex diseases are affected by the joint action of multiple genes. Most of the available genetic mapping methods only map one or a few QTL simultaneously with up to two-way gene-gene interactions considered, and are therefore not efficient for mapping the key genes influencing such complex traits. The identification of these genes is a very large variable selection problem: for q potential genes, with q being in the hundreds or thousands, there are 2 q possible main effect models, possible two-way interactions and possible higher order (k > 2) interactions. In this talk, we introduce a Bayesian variable selection approach for semiparametric genetic mapping. The approach allows us to select genetic variants that are not necessarily all individually important but rather together important.

« Back...

 

Bias correction in whole-genome scans
Fei Zou, University of North Carolina at Chapel Hill, USA


It is widely recognized that genome-wide association studies suffer from inflation of the risk estimates for genetic variants identified as significant in the genome scan, a so called "winner's curse". To handle such significance bias, we have developed an approximate conditional likelihood approach for risk estimation and a principled method for confidence interval construction by acknowledging the conditioning on statistical significance. We discuss extensions to the situation where risk estimation is performed for multiple correlated phenotypes in the genome scan. Our approach is widely applicable, is far easier to implement than competing approaches, and may often be applied to published studies without access to the original data. The results have considerable importance for the proper design of follow-up studies and risk characterization.

« Back...

 
Best viewed with IE 7 and above