Joint NUS-ISI Workshop on Recent Advances in Statistics and Probability - IMS

Joint NUS-ISI Workshop on Recent Advances in Statistics and Probability
(18 - 19 Nov 2008)

Jointly organized with Indian Statistical Institute, Kolkata, and Department of Statistics & Applied Probability, NUS

~ Abstracts ~

Variance estimation for tree order restricted models
Sanjay Chaudhuri, National University of Singapore

We initially consider s+1 independent normal populations, with unknown means \mu_i, i=0,2,...,s and a common variance. We further assume that the means are constrained by a tree order restriction, namely \mu_0<=\mu_i, i=1,2,...,s. It is known that the constrained maximum likelihood estimator of \mu_0 can be obtained as the minimum of extremely correlated functions of the observations. Thus it is always biased and under very mild conditions may even diverge to -infinity as s increases to infinity. However conditions under which the same mle is bounded from below in probability or even consistent are also known. In this talk we consider estimation of the common variance under various such conditions. We show that as s tends to infinity, depending on the rate of growth of the sample sizes from the above populations with s the mle of the common variance may sometimes be consistent and asymptotically normal under appropriate centring and scaling. We further discuss similar properties of the least square estimates when the populations are not Gaussian.

Authors: Antar Bandyopadhyay, Indian Statistical Institute, India and Sanjay Chaudhuri, Department of Statististics and Applied Probability, National University of Singapore

On periodicities in the occurrence of nucleotides in protein coding stretches of genomes
Probal Chaudhuri, Indian Statistical Institute, India

The protein coding stretches of a genome often exhibit interesting periodic patterns. We shall start with a brief review of biological and statistical significance of such periodicities in the occurrence of nucleotides in some stretches of prokaryotic genomes. Then some statistical models and techniques for detecting and analyzing such periodicities in DNA sequences will be presented.

Application of these concepts and techniques in the prediction of genes in prokaryotic genomes will be discussed. The gene prediction problem that arises for a newly sequenced un-annotated genome can be viewed as an unsupervised learning problem, and an analysis of possible base periodicities of different ORFs can be effectively utilized there for an initial training of the classifier.

Bias adjustments in data analysis of double-bounded dichotomous choice contingent valuation surveys
Chen-Hsin Chen, Academia Sinica, Taiwan

Contingent valuation (CV) methods have been extensively used to elicit information on people?s willingness to pay (WTP) for non-market goods or services. The double-bound dichotomous choice approach has become one of the commonly used techniques among CV methods. Although its advantages over the other methods have been well documented, the possible existences of starting point bias, and/or yea-saying bias or nay-saying bias have also been recognized. In other words, extreme respondents who may be willing to or not willing to pay any price are frequently encountered. Also, ordinary respondents, those who are willing to pay a reasonable price, may be subject to the starting-point bias associated with the anchoring effect. As a consensus for making adjustments on these biases has not reached, in this study we utilize a three-component mixture model to tackle the issues simultaneously, in which a multinomial logistic model is taken to specify the proportions for the three different types of respondents and to adjust for starting point bias, yea-saying bias or nay-saying bias; and an accelerated failure time model with the adjusted anchoring effect is to formulate the distribution of WTP price for ordinary respondents. An empirical example on WTP prices for a new hypertension treatment is also provided to illustrate the proposed method.

Optimal design of epidemic experiments
Alex R Cook, National University of Singapore

Alex Cook^1, Gavin Gibson^2 and Christopher Gilligan^3
1: DSAP
2: Department of Actuarial Mathematics and Statistics and the Maxwell Institute, Heriot-Watt University, UK
3: Department of Plant Sciences, University of Cambridge, UK

We describe a method for optimal design of experiments for systems governed by stochastic processes, motivated by the processes typically used to model disease dynamics. The search for the optimal design uses Bayesian computational methods to explore the joint parameter-data-design space [4], accounting for a priori parametric uncertainty as well as population stochasticity. Statistical methods for inference in infectious diseases are well established [3] but are typically highly computationally intensive, making them impractical for a Monte Carlo search over multiple possible outcomes. We therefore resort to approximation to make the likelihood approximately tractable.
We have shown that well-designed experiments can yield almost as much information as much more costly designs [1, 2]. The method is illustrated by application to botanical epidemics. Extension of the work to designing observational studies of human diseases and to account for economic costs will be discussed.

References
[1] Cook AR, Otten W, Marion G, Gibson GJ, Gilligan CA (2007).
Estimation of multiple transmission rates for epidemics in heterogeneous populations. Proc Natl Acad Sci USA 104:20392?20397.
[2] Cook AR, Gibson GJ, Gilligan CA (2008). Optimal observation times in epidemic processes. Biometrics, 64:860-8.
[3] Gibson GJ, Renshaw E (1998). Estimating parameters in stochastic compartment models using Markov chain methods. IMA J Math Appl Med Biol
15: 19-40.
[4] M\"{u}ller P (1998). Simulation based optimal design. Bayesian Statistics 6:459-474.

Strong laws for balanced triangular urns
Amites Dasgupta, Indian Statistical Institute, India

Consider an urn model whose replacement matrix is triangular, has all entries nonnegative and the row sums are all equal to one. We obtain the strong laws for the number of balls corresponding to each color. The scalings for these laws depend on the diagonal elements of a rearranged replacement matrix. We use the strong laws obtained to study further behavior of certain three color urn models.

Nonparametric estimation of quality adjusted lifetime (QAL) distribution in a simple illness-death model
Anup Dewanji, Indian Statistical Institute, India

In this work, we consider nonparametric estimation of quality adjusted lifetime distribution in a simple illness-death model. We first derive the expression of QAL distribution in terms of the distribution of sojourn time in each health state. Next we substitute the estimate of sojourn time distributions in the expression of QAL distribution to obtain its estimate. Consistency and asymptotic normality of the proposed nonparametric estimator have been established. Estimation in the presence of some missing data on the transition time to illness is also discussed. We conduct a simulation study to investigate the performance of the proposed estimator. A data set of the Stanford Heart Transplant program has been analyzed for illustration. Extension to multi-state progressive model has been discussed along with an example of International Breast Cancer Study Group (IBCSG) Trial V data.

On the robustness and efficiency of trimmed estimates
Subhra Sankar Dhar, Indian Statistical Institute, India

In this talk, we will start with a brief review of different trimming procedures in univariate and multivariate set ups. Then, we will present some results on the effect of trimming on robustness and efficiency of estimates.

On multivariate generalization of univariate nonparametric tests
Anil K. Ghosh, Indian Statistical Institute, India

Over the last couple of decades, several univariate rank based nonparametric methods have been generalized into the multivariate set up. In this talk, we will propose a simple recipe for such generalization. Here we project the observations in an appropriate direction estimated from a small part of the data and use the rest of the projected observations to perform the usual univariate test.
Since we estimate the projection based on a subsample, in order to remove the subjectivity and to provide better stability, we propose to repeat it over different subsamples and aggregate the results.
Our proposed test is distribution free and it has the same asymptotic efficiency as that of its univariate version. A modified version of this proposed method can be used even when the dimension is much larger than the sample size.

A new latent structure model for binary network analysis
Jing-Shiang Hwang, Academia Sinica, Taiwan

A new latent structure model is proposed for directly modeling the dependence between tie variables in a binary directed network. The tie variable is determined by three latent variables representing respectively the sending status of an actor, the receiving status of another actor, and the chance of successful contact between them. Another binary variable is also included to specify the noise unexplained by the available covariates. The proposed model allows us to easily describe the interdependence between ties. The parameters within the latent variables are further linked with covariates such as actor attributes and other actor-to-actor-relations. The likelihood function for model parameters can be expressed in a closed but complex form. We propose a fully Bayesian method that uses Markov chain Monte Carlo sampling technique to make inferences on the parameters and predictions. The feasibility of practical applications of the proposed model is explored by using the network data of Lazega?s lawyer study. The advantage and disadvantage of the proposed model are discussed. This is a joint work with Wei-Chung Liu.

Probabilistic properties of the sum-of-digits functions of random integers
Hsien-Kuei Hwang, Academia Sinica, Taiwan

A short survey will be given of most of the known results concerning the distributional properties of the sum-of-digits function of random integers. Several new findings will be then presented. In particular, I'll talk about a new approach based on the orthogonality of the Krawtchouk polynomials.

(This talk is based on joint work with Vytas Zacharovas.)

Estimating the number of neurons in multi-neuronal spike trains
Mengxin Li, National University of Singapore

Spike sorting is a class of procedures used in the analysis of electrophysiological data. In particular, spike sorting procedures use the waveshapes of electric action potentials (spikes) collected with one or more electrodes in the brain to identify the neurons that generated them. This talk proposes a method-of-moments technique for estimating the number of neurons in multi-neuronal spike trains. The technique applies to isolated spikes and overlapping spikes. The resulting estimate is shown to be strongly consistent under mild conditions and an upper bound on the convergence rate is obtained.

This is joint work with Wei-Liem Loh.

New principles for model selection when models are possibly misspecified
Jun Liu, Harvard University, USA

Model misspecification is commonly encountered when we misspecify the family of distributions or when we have the correct family of distributions but miss some true predictor.In this paper, we derive extensions of the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) when the posited generalized linear model (GLM) may be misspecified. The new formulations we discovered enabled us to propose a new semi-Baysian model selection criterion (SIC), which a combination of the estimated Kullback-Leibler (KL) divergence between the Bayesian marginal distribution and the true distribution of the response variable and the excessive KL divergence of the model relative to the minimum one. SIC is shown to trade off among the model fitting, model complexity, and model misspecification. Through numerical studies we show that SIC is a promising model selection criterion in finite sample even when the model is correctly specified.

Authors: Jinchi Lv (USC) and Jun Liu (Harvard)

Estimating the parameters of burst-type signals
Swagata Nandi, Indian Statistical Institute, India

In this paper, we study a model which exhibits burst-type features such as ECG signals, under certain condition. The model is proposed by Sharma and Sircar (2001) and we call it burst-type signals. It is a generalization of the fixed amplitude sinusoidal model. The amplitudes take a certain form with several parameters. We assume that the error random variables are independent and identically distributed. The least squares method is proposed to estimate the unknown parameters. We show that the least squares estimators are strongly consistent and find their asymptotic distribution as Gaussian. Some numerical results based on simulations results are discussedted for illustrative purposes.
(This is a joint work with D. Kundu)

Sharma, R. K. and Sircar, P. (2001). Parametric modelling of burst-type signals. Journal of the Franklin Institute. 338, 817-832.

Approximating the marginal likelihood using copula
David Nott, National University of Singapore

Model selection is an important activity in modern data analysis and the conventional Bayesian approach to this problem involves calculation of marginal likelihoods for different models, together with diagnostics which examine specific aspects of model fit. Calculating the marginal likelihood is a difficult computational problem. In this talk we discuss some extensions of the Laplace approximation for this task that are related to copula models and which are easy to apply. Variations which can be used both with and without simulation from the posterior distribution are considered, as well as use of the approximations with bridge sampling and in random effects models with a large number of latent variables. The use of a t-copula to obtain higher accuracy when multivariate dependence is not well captured by a Gaussian copula is also discussed.

Some models of random oriented trees
Anish Sarkar, Indian Statistical Institute, India

Random oriented trees have been used in many physical models such as river networks. In this talk, I will describe some of these models such as Howards model, Scheidegger model and other related models. Scaling limits of these models are also of interest, which brings out the connection of Brownian web with these models. I will state some of the results and describe some of the open problems.

A kernel classifier for two populations with partially dependent data
Debasis Sengupta, Indian Statistical Institute, India

Data available for training two-population classifiers sometimes occur in pairs. We consider a general form of training data consisting of paired as well as individual samples from the two populations. Using nonparametric maximum likelihood estimator of the joint distribution of the paired samples, we derive a kernel density estimator of the joint density.
We show theoretically, in a simple special case, that the implied estimator of the marginal density has smaller integrated mean squared error than that of a similar estimator obtained by ignoring dependence of the paired observations. We establish consistency of the marginal density estimator under suitable conditions and show that the misclassification probability of the resulting classifier is asymptotically equivalent to that of the Bayes classifier. We demonstrate small sample superiority of the proposed classifier over classifiers that ignore dependence and non-normality of training samples, through a simulation study with dependent and non-normal data. We also include a data analytic illustration.

Statistical inference for P( X < Y < Z)
Xiping Wang, National University of Singapore

Co-author(s): Pan Guangming2, Zhou Wang1

1National University of Singapore
2Nanyang Technological University

Let X, Y and Z be three independent random variables. In this study, we make statistical inference for P( X < Y < Z) via two methods, normal approximations and the jackknife empirical likelihood. Simulation studies indicate that these two methods work promisingly.

Best viewed with IE 7 and above