Reliable information about a target population in form of a representative data-set is indispensable for any statistical analysis. The accuracy of the estimates or the measures computed by a statistician depend directly on the information contained in the data-set he or she uses. However, often such data-sets may not have adequate information to produce estimates of required accuracy. It may not have enough observations, which would reduce accuracy of the estimates. More frequently, it may not have enough variables to build a meaningful model for the response of interest. Another common problem is that the available data is not representative of the target population because of informative sampling and what is even more problematic, because of high rates of informative non-response. Even though a statistician would prefer to design and collect appropriate data for the study of interest, data collection is often prohibitively expensive. As a result, one needs to devise procedures either for merging different data-sets or for borrowing information from similar observations within the same data.
Easy collection and storage of big data-sets, obtained from web based applications, social networks or medical records provide interesting opportunities. Such data sources provide what a statistician needs, that is, millions of observations on thousands of variables. These data-sets however are not collected in any designed way. In other words, they are observational and may not represent the population targeted by the analyst. Use of big data sources with or without an integration with carefully designed survey data would often be beneficial in computing official statistics, which are required for making policy decisions. Integration of various data sources is a popular topic of research in several branches of current statistics.
The first week of the programme will be devoted to a workshop on statistical data integration. The workshop will consist of expository lectures on different aspects of data integration methods in statistics. The broad topics of the lectures will include small area estimation, statistical methods for record linkage, data confidentiality, disclosure methods and privacy assessment, multiple imputation techniques and generation of synthetic data to protect privacy, big data integration techniques in official statistics, methods for analysing big data-sets obtained from social networks, online transactions etc. The workshop is designed to be a precursor to a conference that will take place during the second week, where more recent developments in the above topics would be discussed. The programme in the second week will consist of a three-day conference on the current trends in survey statistics.