Institute for Mathematical Sciences Event Archive
Post-Genome Knowledge Discovery
(January – June 2002)
~ Abstract ~
Sub-theme 2. Population and statistical genetics (Mar - Apr 2002)
Tutorial on population and statistical genetics (19 - 22 March 2002) |
Laura Almasy, Southwest Foundation for Biomedical Research, USA
Basic concepts of quantitative trait
genetics
This lecture will introduce some basic concepts behind
genetic analysis of quantitative traits using variance
component methods, including additive and dominance variance,
shared environmental effects, kinship, and heritability.
Examples will be provided from a family study of hemostasis.
Quantitative trait linkage analysis and identity by
descent
Identity-by-descent (IBD) allele sharing among family
members will be discussed and compared to identity-by-state
(IBS) allele sharing. Methods of estimating IBD and IBD-based
quantitative trait linkage methods will be explained. Examples
will be provided from a genome-wide linkage screen in the GAIT
(Genetic Analysis of Idiopathic Thrombosis) study.
Quantitative trait association analysis
Measured genotype association analysis in quantitative
traits will be introduced and the design and interpretation of
association studies will be discussed, with particular
emphasis on concepts and implications of linkage
disequilibrium. Additional study design issues relevant to
both linkage and association studies will also be addressed.
Advanced topics: Multivariate analysis and interaction effects
This lecture will introduce extensions to variance
component linkage methods which permit multivariate analyses
and epistatic or gene- environment interaction effects.
Multivariate analyses of related traits can be used to
identify sources of phenotypic correlation, genetic or
environmental, and to increase linkage power. Incorporation of
interaction effects also may provide insight into the
mechanisms of gene action. Examples will be drawn from a
variety of studies.
Statistical inference for population genetics
David Balding, University of London, UK
The series of 4 lectures will cover a range of modern, computational methods for drawing statistical inferences from samples of genetic data drawn from one or more populations. The parameter of interest may be the mutation rate, or else demographic parameters such as the current effective population size, growth rate and/or time since start of growth. We will also consider inferences about the ancestry of the sample, such as the time since the most recent common ancestor. To keep matters relatively simple, we will not deal at length with either recombination or selection.
The methods will be illustrated using real-time computations in S-Plus or R. The latter is freely available for several computer platforms at www.cran.r-project.org.
The coalescent model and its generalisations provide the most appropriate family of statistical models for genetic data. Those not already familiar with coalescent theory should regard Dr Nordborg's lectures on this topic as a pre-requisite. Preparatory reading (optional) can be found in the chapters by Nordborg (Chapter 6) and Stephens (Chapter 7) of the Handbook of Statistical Genetics, Balding, Bishop & Cannings (eds), Wiley 2001. See also the BATWING software documentation, available at www.maths.abdn.ac.uk/~ijw.
Warren Ewens, University of Pennsylvania, USA
Using mathematics in genetics: historical aspects
The use of mathematics and statistics in genetics,
particularly evolutionary genetics, in the period 1900-1980,
will be surveyed. It will be shown, for example, how a
mathematical argument overcame the main problem with the
Darwinian theory, and how mathematical and statistical
arguments quickly became central to an understanding of
evolution, of the genetics of human diseases, and of plant and
animal breeding.
Using mathematics in genetics: recent developments
This talk will survey the use of mathematics and statistics
in genetics, particularly evolutionary genetics, in the period
1980-2000. Whole genome evolutionary questions will be
discussed, as well as the stochastic and sampling theory
associated with samples of genes in contemporary populations,
using "molecular" population genetics models.
Mathematics and statistics in human genetics
Two uses of statistics in human genetics will be described.
The first of these concerns the ascertainment problem which
arises when non-random samples are taken in assessing the
genetic basis of a given disease. The second is an
introduction to statistical methods of linkage analysis, that
is of locating disease genes through the use of so-called
marker loci. The focus will be on non-parametric methods of
linkage analysis.
Linkage analysis and the TDT
This talk will discuss more advanced forms of linkage
analysis, applying to more recent forms of data. Problems with
the case-control method of linkage analysis will be described,
leading to a description of the transmission-disequilibrium
test (TDT).
Population genetics: making sense of genetic variation
Magnus Nordborg, University of Southern California, USA
The purpose of these lectures is to introduce the major questions currently being asked in population genetics, and the basic models that have shaped these questions. Emphasis will be on the concepts and the data: statistical details and methodology will be covered by other lectures, in particular by Dr. Balding.
Modeling genetic variation using the coalescent
Population genetics theory during the last 20 years has
been dominated by coalescent models. This lecture will
introduce the coalescent process and its basic properties, and
discuss why it may be a reasonable model. The relationship
between coalescent theory and classical population genetics
theory will be discussed.
Population structure and demographic history
Population genetics data has often been used for inference
about the demographic history of populations. We may, for
example, be interested in historical population sizes or
patterns of migration. This lecture will discuss models of
population structure, and how much information about the past
polymorphism data contains.
Linkage disequilibrium and the genealogy of chromosomes
In the era of genomic polymorphism data, it is necessary
to have models that incorporate recombination. This lecture
will discuss recombination and its effects of the pattern of
variation, in particular on the phenomenon known as linkage
disequilibrium, which is of great importance for human
genetics and genetic epidemiology.
Selection and molecular evolution
An important use of genetic polymorphism data is to
identify the trace of past selection. This lecture will
introduce models of selection, and evidence for selection
within and between species.
Workshop on population and statistical genetics (25 - 28 March 2002) |
Mapping genes for quantitative trait loci in humans:
current methods and study designs
Eleanor Feingold, University of Pittsburgh, USA
By quantitative traits, geneticists mean traits that are measured on an ordinal (usually continuous) scale, such as height, blood pressure, cholesterol, etc. Loci on the genome that link to quantitative traits are known as quantitative trait loci (QTLs). Agricultural geneticists have been mapping QTLs for many years, but human geneticists have started doing QTL mapping much more recently. Many methods from animal genetics extend easily to humans, but for QTL mapping, family structures and study designs are different enough that new methods are required for human data. In this talk I will summarize the most popular current methods for QTL mapping in humans, with an emphasis on the interaction between study design and analysis method. I will finish by describing my recent work on analysis methods for the study design known as discordant sibling pairs, which involves using nuclear families where the children have very different values of the trait (e.g. one very high cholesterol and one very low).
Mapping mutations on genealogies
Rasmus Nielsen, Cornell University, USA
Mapping of mutations on a phylogeny or a gene genealogy has been a commonly used analytical tool in phylogenetics, population genetics and molecular evolution. However, the common approaches for mapping mutations based on parsimony have lacked a solid statistical foundation. In this talk I present a Bayesian method for mapping mutations on a genealogy. I illustrate some of the common problems associated with using parsimony and suggest instead that inferences in molecular evolution and population genetics can be made on the basis of the posterior distribution of the mappings of mutations. A method for simulating a mapping from the posterior distribution of mappings is also presented and the utility of the method is illustrated on two previously published data sets. Applications include a method for testing for detecting positively selected amino acid sites and a method for testing if the dN/dS ratio varies among lineages in the phylogeny. The method is also used for estimating ages of mutations in population genetical models.
Fine scale mapping of disease loci via shattered
coalescent modelling of genealogies
Andrew Morris, Oxford Universty, UK
A Bayesian, Markov chain Monte Carlo method for fine scale linkage disequilibrium mapping using high-density marker maps will be presented. THe method explicitly models the genealogy underlying a sample of case chromosomes in the vicinity of a putative disease locus. Within this framework, it is straightforward to allow for missing marker information and for uncertainty about the true underlying genealogy and the makeup of ancestral marker haplotypes. A crucial advantage of the method is the incorporation of the shattered coalescent model for genealogies, allowing for multiple founding mutations at the disease locus and sporadic cases of disease. The advantages of the method will be illustrated by application to real data.
Investigating stem cells in human colon using
methylation patterns
Simon Tavare, University of Southern California, USA
The stem cells that maintain human colon crypts are poorly characterized. To better determine stem cell numbers and how they divide, methylation patterns can be used as cell fate markers. Methylation exhibits somatic inheritance and random changes that potentially record lifelong stem cell division histories as tags in adjacent CpG sites. We sampled methylation patterns in individual crypts using bisulfite sequencing at several loci. In this talk I will describe a simple stochastic model for the evolution of the cells in a crypt, and outline a computational approach to finding the posterior distribution of the number of stem cells that are present. The results are useful in describing the origins of colon cancer.
A unified multipoint linkage analysis of qualitative and
quantitative traits for sib-pairs
I-shou Chang, National Health Research Institutes, Taiwan
By introducing functions of the phenotypes of a sib-pair as weight functions in the study of IBD processes, we present a unified non-parametric approach to linkage analysis of qualitative and quantitative traits in sib-pairs based on IBD data obtained from a set of polymorphic markers. With the introduction of weight functions and an appropriate conditional expectation of IBD processes, these statistical methods should be more efficient in the detection of genetic factors for complex diseases. In particular, we do not assume any genetic map functions and we do not make use of hidden Markov models. These methods will be also useful in planning genetic studies. Large sample properties of these methods are demonstrated. Computational aspects of these methods are addressed.
The effect of natural selection on the fixation
probability of a mutant under the high mutation rate
Fumio Tajima, University of Tokyo
Both mutation and natural selection are important for understanding evolutionary processes. First I will consider the case where two mutants occur at almost the same time under the no-recombination model. In this case the fixation probability of a deleterious mutant increases when an advantageous mutant becomes fixed and the fixation probability of an advantageous mutant increases when a deleterious mutant becomes fixed. On the other hand, the fixation probability of a deleterious mutant decreases when a deleterious mutant becomes fixed and the fixation probability of an advantageous mutant decreases when an advantageous mutant becomes fixed. This means that the fixation of an advantageous mutant tends to accompany the fixation of a deleterious mutant. Next I will show under the no-recombination model that as the mutation rate per genome increases, the effect of natural selection decreases. Namely, the fixation probability of a deleterious mutant increases as the mutat! ion rate per genome increases. On the other hand, the fixation probability of an advantageous mutant decreases as the mutation rate per genome increases. This indicates that random genetic drift is more important than natural selection for the evolution of haploid organism with the high genomic mutation rate. There is a hypothesis that the RNA world existed before the DNA world. As shown by the data from RNA viruses, the mutation rate might have been very high in the RNA world. The mutation rate might have been high at the early stage of the DNA world because of low efficiencies of polymerases and repair systems. In such cases, evolution might have been determined mainly by the mutation rate rather than by natural selection.
A new molecular approach for haplotyping in large
population-based association studies
Benjamin Yakir, Hebrew University of Jerusalem, Israel
Determination of haplotype frequencies (the joint distribution of genetic markers) in large population samples is a powerful tool for association studies. Population haplotype frequencies evaluate linkage disequilibrium between markers. Haplotypes are of great value for association studies due to their greater extent of variability. Therefore, a single haplotype may capture any given functional polymorphism with higher statistical power than its SNP components. The statistical estimation of haplotype frequencies, usually employed in LD studies, requires the individual genotyping for each SNP in the haplotype, thus making it an expensive process. In this talk, we describe a new method for direct measurement of haplotype frequencies in DNA pools, by allele-specific, long-range, amplification of the pool. The proposed method allows high throughput genotyping of haplotypes composed of two SNPs in close vicinity (up to 20Kb). We will discuss some of the statistical implications of applying this approach in large population based association studies.
Some population genetic considerations in association
studies
Mark Seielstad, Genome Institute of Singapore
Many association studies seeking to posit a causal link between a locus and a phenotype are likely to rely on the existence of linkage disequilibrium between a marker locus and the functional variant. Linkage disequilibrium is a measure of the frequency with which neighboring alleles are coinherited on the same chromosome in a population. Ideally, for mapping purposes, there would be a clear link between the distance among loci and the extent of LD they exhibit. Despite recent demonstrations of such a relationship and the apparent involvement of hotspots of recombination in the sometimes rapid decay of LD among markers, the effect of differing demographic histories on LD has yet to be extensively studied. Data are presented on LD over a 500kb stretch of chromosome 22 in several rural Chinese groups and an urban Lebanese population.
Another concern in association studies that do not use family data is the possibility of cryptic population substructure among case and control groups. When this occurs, any alleles showing a frequency difference in the two (or more) subpopulations will be implicated in the phenotypic difference identified in the case and control groups. Though rarely demonstrated, this possibility is routinely suggested to explain false-positive associations. To assess the likelihood of cryptic substructure in typical case-control studies, four moderate case-control samples comprising 3472 individuals were examined. The four population samples include: 500 US Caucasians and 236 US African-Americans each with hypertension (HTN); and 500 US Caucasians and 500 Polish Caucasians each with Type 2 Diabetes (DM2), all with matched controls. In each of the four samples, population substructure was tested using the sum of the case-control allele frequency c2 statistics for 9 STR and 35 SNP markers. Weak evidence for population structure was found only in the African-American sample, but further refining the sample to include individuals only with US born parents and grandparents, eliminated even this source of stratification. These examples provide insight into the factors affecting the replication of association studies, and suggest that carefully matched, moderate-sized case-control samples in cosmopolitan U.S. and European populations are unlikely to contain levels of structure that would result in significantly inflated numbers of false positive associations.
Statistical problems of genetic mapping
David Siegmund, Stanford University, USA
The goal of genetic mapping is to locate genes affecting particular traits (e.g., genes that affect human susceptibility to particular diseases or genes that affect productivity of agriculturally important species) by comparing the phenotypes and genotypes of related individuals. Changes in experimental technique that provide large numbers of informative genetic markers at known locations throughout a genome suggest new statistical problems concerned with the design and analysis of gene mapping experiments. I will discuss three such problems arising from genome scans to detect anonymous genes: (i) multiple comparisons arising from the simultaneous testing of many markers for linkage to the trait of interest; (ii) statistical power to map genes as a function of the true genetic model, especially when there is gene-gene or gene-environment interaction; and (iii) confidence bounds for estimation of genetic effects.
DNA variation of human genes
Clay Stephens, Genaissance Pharmaceuticals Inc., USA
We have investigated the level of DNA-based variation (both SNPS and haplotypes) for over 5,400 human genes. In addition, we have characterized how this variation is distributed in a number of biologically and clinically important ways. First, we have determined how SNPs are distributed in human genes: where they occur relative to various functional regions; levels of variability of human SNPs; pattern of the molecular sequence of SNPs; and how these compare to the corresponding sequence of a chimpanzee. Second, we have determined how these aspects of SNP distribution vary among four human population samples. All genes were sequenced on DNA obtained from 82 unrelated individuals: 20 African-Americans, 20 East Asians, 21 European-Americans, 18 Hispanic-Latinos and 3 Native Americans. In particular, we looked at patterns of SNP and haplotype sharing among the four larger population samples. Third, we have determined the patterns of linkage disequilibrium among SNPs, which of course determines the haplotype variability of each gene. This pattern also varies substantially among populations. In order to connect important clinical variability (e.g., genetic disease or susceptibility, variable drug response) to the DNA variability of human genes, an understanding of these patterns of variability within and among human genes is a fundamental prerequisite.
The effect of epistatic interactions on popular methods
for finding complex disease genes
Susan Wilson, Australian National University, Australia
Empirical evidence from model organisms indicates that the genetic background can strongly influence the phenotype exhibited by a specific genotype due to epistatic (gene-gene) interactions. The prevalent paradigm for the analysis of common human diseases assumes, however, that a single gene is largely responsible for affecting individual disease risk. The consequence of examining each gene as though it were solely responsible for conferring disease risk when in fact that risk is contingent upon interactions with another disease locus has not been fully investigated. Here the simplest case, namely the effect of two (or more) major epistatic disease genes when data are analysed assuming a single disease gene, is examined. A general genetic model for two marker loci is developed. Based on this model it is shown that results can vary markedly depending on the parameters associated with the "unidentified" disease gene. The results indicate that if parameters associated with the second gene were to vary between studies, then the conclusions from those studies may also vary. Essentially the same result holds for case-control studies, for affected sib-pair studies and for the transmission/disequilibrium (TDT) test. This is a theoretically broad result with important implications for interpreting different results from individual studies and comparing results between studies. It demonstrates that failure to factor in such interactions can lead to elevated rates of false positives and negatives. This is particularly troubling for genomic scan type study designs. Finally, methods for fitting epistatic models to complex disease data will be discussed.
Topics in computational genomics
Wen-Hsiung Li, University of Chicago, USA
Several eukaryotic genomes have been completed and many more will soon be completely sequenced. An extremely challenging problem is how to align genomic sequences, which is often the first step in comparative analysis. Alignment of genomic sequences is much more difficult than alignment of protein sequences or DNA sequences of protein coding regions because (1) the large demand of computer time and memory space, (2) the sequences often have been scrambled by insertion of transosable elements, translocation, and inversion, and (3) the existence of highly divergent regions. Some methods for aligning genomic sequences will be presented. Antoher topic to be presented is "Methods for detecting duplicated genes in a genome". Gene duplication has been thought to be the source of genetic novelties. The abundance of genomic sequence data allows the study of gene duplication at the genomic level. However, detecting duplicate genes and classifying them into gene families is not a simple matter. I shall discuss some newly developed methods.
Mapping adaptive polymorphisms in Arabidopsis using
linkage disequilibrium
Magnus Nordborg, University of Southern California, USA
There is currently tremendous interest in using population associations for fine-scale mapping of human disease loci. This talk will argue that the same methods may be even more useful in other species. In particular, it will be shown that highly selfing species, like Arabidopsis, may be uniquely suited for association mapping because of the high levels of linkage disequilibrium that results from inbreeding.
Approximate Bayesian computation in population genetics
David Balding, University of London, UK
I will discuss methods that have been evolving recently within the population genetics literature for approximating low-dimensional marginal posterior distributions under complex models involving large numbers of nuisance parameters. Although MCMC is sometimes feasible, there are typically problems with poor mixing, and model comparison is usually unachieveable. The alternative being proposed is based on simulation of parameters and datasets, from the prior and model respectively, followed by local regression to model the posterior density in terms of appropriate data summary statistics. Several levels of approximation are involved, but the reward is the ability to handle complex models and to perform model comparison via approximate Bayes factors.
I will discuss applications in human ppulation genetics and conservation genetics, as the possibility of applications in other fields.
This is joint work with Wenyang Zhang, Statistics, University of Kent, and Mark Beaumont, Animal & Microbial Sciences, University of Reading.