Workshop on Living Analytics: Analyzing High-Dimensional Behavioral and Other Data from Dynamic Network Environments 1 - IMS

Workshop on Living Analytics: Analyzing High-Dimensional Behavioral and Other Data from Dynamic Network Environments 1
(26 - 28 Feb 2014)

~ Abstracts ~

Gaussian process emulation of models with massive output
Jim Berger, Duke University, USA

Complex models of processes are often difficult to work with - e.g. for prediction at new model inputs - simply because of the massiveness of the model output. A common approach for dealing with this when working with computer models of physical processes is called emulation, which means developing a fast approximation of the computer model. Approaches that have been considered include utilization of multivariate emulators, modeling of the output (e.g., through some basis representations, including PCA), and construction of parallel emulators at each output point. These approaches will be reviewed, with the startling computational simplicity with which the last approach can be implemented being highlighted. This research has been conducted in the context of computer models of physical processes such as pyroclastic flow and wind fields, which will be used for illustration. It is hoped that these ideas might transfer to models of network processes.
Authors: James Berger and Mengyang Gu

Successful data mining in practice
Richard De Veaux, Williams College, USA

The sheer volume and complexity of data collected or available to most organizations has created an imposing barrier to its effective use. These challenges have propelled data mining and predictive analytics to the forefront of making profitable and effective use of data. Like traditional statistical analysis, data mining is a process that uses a variety of data analysis and modeling techniques to discover patterns and relationships in data that may be used to make accurate predictions.

While the most widespread application of data mining are in CRM (customer relationship management) some of the other important applications include fraud detection and identifying good credit risks. In this course, we'll first take a brief tour of types of problems best suited to data mining and predictive analytics. Then, we'll discuss some ot the challenges that an analyst faces and suggest some solutions to the most common problems. Via a series of case studies, we'll compare and constrast some of the most widely used methods.

LARC-LiveLabs Overview
Stephen E. Fienberg, Carnegie Mellon University, USA

Living Analytics Research Centre (LARC) and LiveLabs are two research centres established by SMU to pursue research on analytics for business, consumer and social insights, an area of research excellence identified by the university. LARC in particular is jointly established with Carnegie Mellon University, Pittsburgh to develop novel data and decision analytics techniques that improve users' experience by offering them personalized services and optimize use of resources to improve service quality. Beyond analytics, LARC researchers also study randomized controlled experiments to measure the effects of user intervention based on insights gained from analytics. LiveLabs, on the other hand, is a testbed initiative consisting of thousands of volunteers whose locations and activities within the SMU campus can be tracked via their wireless devices connected to Wifi. In this talk, we will briefly highlight the research works in LARC and LiveLabs providing the workshop participants the necessary data and research context for meaningful interactions and idea exchange.

More effective distributed ML via a stale synchronous parallel parameter server
Qirong Ho, Carnegie Mellon University, USA

Modern applications awaiting next generation machine intelligence systems have posed unprecedented scalability challenges. These scalability needs arise from at least two aspects: 1) massive data volume, such as societal-scale social graphs with up to hundreds of millions of nodes; and 2) massive model size, such as the Google Brain deep neural network containing billions of parameters. Although there exist means and theories to support reductionist approaches like subsampling data or using small models, there is an imperative need for sound and effective distributed ML methodologies for users who cannot be well-served by such shortcuts. To this end, we propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easy-to-use shared interface for read/write access to an ML model's values (parameters and variables), and the SSP model allows distributed workers to read older, stale versions of these values from a local cache, instead of waiting to get them from a central storage. This significantly increases the proportion of time workers spend computing, as opposed to waiting. Furthermore, the SSP model ensures ML algorithm correctness by limiting the maximum age of the stale values. We provide a proof of correctness under SSP, as well as empirical results demonstrating that the SSP model achieves faster algorithm convergence on several different ML problems, compared to fully-synchronous and asynchronous schemes.

Online learning for big data and living analytics
Steven Hoi, Nanyang Technological University

The rapidly increasing data from our daily lives represents an important scenario of big data, which has presented numerous challenges and opportunities for automated data analytics with machine learning techniques. Several common challenges of living analytics with big data include high volume, high velocity, high dimensionality, sparse data, and a variety of diverse data sources and formats, etc. In this talk, I will introduce some of our recent work on online learning --- a family of highly efficient and scalable machine learning algorithms, for addressing some of these open challenges of living analytics with big data, and will also discuss some future directions.

Mining viewpoints from online forums
Jing Jiang, Singapore Management University

Threaded discussion forums provide an important social media platform that allows people to express their opinions and communicate with others. In particular, comments to major sociopolitical issues can often be found in discussion forums, and people often hold different opinions. Such information serves as an importance source of public feedback. To automatically discover the viewpoints from discussion forums, we propose a probabilistic model that considers different aspects of an issue as well as users? interactions in order to identify different viewpoints and user groups. Evaluation results show that our model clearly outperforms a number of baseline models in terms of both clustering posts based on viewpoints and clustering users with different viewpoints.

Point-of-interest recommendation in location-based social networks
Irwin King, The Chinese University of Hong Kong, Hong Kong

Personalized Point-of-interest (POI) recommendation is an important task in location-based social networks (LBSNs) as it can help targeted users to explore their surroundings and third-party developers to provide personalized services. Different from typical social networks, LBSNs contain check-ins data, geographical information, and social information, which reveal unique characteristics of frequency data, sparsity, and context. In this talk, I will introduce some of our recent work on POI recommendation in LBSNs, including how to model and incorporate geographical information with social information, how to explore the temporal information and how to tackle the computational issue. I will also address some open challenges and discuss some future directions.

Statistical computing in protein folding
Samuel Kou, Harvard University, USA

Predicting the native structure of a protein from its amino acid sequence is a long standing problem. A significant bottleneck of computational prediction is the lack of efficient sampling algorithms to explore the configuration space of a protein. In this talk we will introduce a sequential Monte Carlo method to address this challenge: fragment regrowth via energy-guided sequential sampling (FRESS). The FRESS algorithm combines statistical learning (namely, learning from the protein data bank) with sequential sampling to guide the computation, resulting in a fast and effective exploration of the configurations. We will illustrate the FRESS algorithm with both lattice protein model and real proteins.

Active learning for probabilistic models
Wee Sun Lee, National University of Singapore

Data from networked environments often have dependencies that can be modeled using probabilistic graphical models. We study the problem of selecting a subset of variables in a probabilistic model to query, so that the remaining variables can be well predicted. In the non-adaptive case, where the entire subset of query variables has to be selected before their labels are revealed, a greedy method that iteratively selects the variable that maximizes the increase of the Shannon entropy of the query subset can be shown to be within a constant factor of optimal. However, this result does not extend to the adaptive case, where the label is revealed immediately after each variable is queried. We look at greedy methods based on a different measure of uncertainty, the Gibbs error, which measures the expected error of the classifier that classifies by sampling a labeling of the variables from the probabilistic model. Both the Shannon entropy and the Gibbs error are special cases of the Tsallis entropy, a generalization of entropy that has been proposed in statistical mechanics. Greedy methods for variable selection can be shown to be within a constant factor of optimal for both the non-adaptive and adaptive cases when the Gibbs error is used as the performance criterion.

LARC-LiveLabs Overview
Ee-Peng Lim, Singapore Management University

Living Analytics Research Centre (LARC) and LiveLabs are two research centres established by SMU to pursue research on analytics for business, consumer and social insights, an area of research excellence identified by the university. LARC in particular is jointly established with Carnegie Mellon University, Pittsburgh to develop novel data and decision analytics techniques that improve users' experience by offering them personalized services and optimize use of resources to improve service quality. Beyond analytics, LARC researchers also study randomized controlled experiments to measure the effects of user intervention based on insights gained from analytics. LiveLabs, on the other hand, is a testbed initiative consisting of thousands of volunteers whose locations and activities within the SMU campus can be tracked via their wireless devices connected to Wifi. In this talk, we will briefly highlight the research works in LARC and LiveLabs providing the workshop participants the necessary data and research context for meaningful interactions and idea exchange.

Learning-based approaches for link discovery given unlabelled data
Shou-De Lin, National Taiwan University, Taiwan

Link discovery aims at identifying hidden connections between instances in social networks. A popular strategy to perform link discovery relies on training a classifier using labelled links with a set of meaningful features. However, in some cases, the labels of the links to be discovered are not available. For instance, for privacy concern, in some social network services such as Foursquare.com the 'like' relationship is not available to public. In this talk, I'll introduce two kinds of challenging link discovery scenarios where the links to be discovered has never appeared in the dataset: links that represents unseen relationship and links that represents novel diffusion in social networks. I'll then describe a learning-based framework that integrates diverse knowledge and data to address the challenge.

Large-scale social identity linkage via heterogeneous behavior modeling
Siyuan Liu, Carnegie Mellon University, USA

We study the problem of large-scale social identity linkage across different social media platforms, which is of critical importance to business intelligence by gaining from social data a deeper understanding and more accurate profiling of users. We propose HYDRA, a solution framework which consists of three key steps: (I) modeling heterogeneous behavior by long-term behavior distribution analysis and multi-resolution temporal information matching; (II) constructing structural consistency graph to measure the high-order structure consistency on users' core social structures across different platforms; and (III) learning the mapping function by multi-objective optimization composed of both the supervised learning on pair-wise ID linkage information and the cross-platform structure consistency maximization. Extensive experiments on 10 million users across seven popular social network platforms demonstrate that HYDRA correctly identifies real user linkage across different platforms, and outperforms existing state-of-the-art algorithms by at least 20% under different settings, and 4 times better in most settings.

Bayesian approaches to describing complex problems
Kerrie Mengerson, Queensland University of Technology, Australia

High dimensional data provide tremendous opportunities to learn more about biological, environmental, industrial and social systems. However, the development of models, methods and computational algorithms to realise on these opportunities remains a challenge. In this presentation, I will discuss two projects that contribute to addressing htis challenge. The first project involves identifying spatio-temporal trends and changes in large in high dimensional data. We suggest a Bayesian approach to analyze change points in multivariate time series and space-time data, and demonsrate the approach to assess the impact of extended inundation on the ecosystem of the Gulf Plains bioregion in northern Australia. The associated computational algortihm is appreciably faster, possibly millions of times, than a standard implementation in such cases, making it feasible to analyze high dimensional, but of realistic size, space-time data sets. The second project involves modelling airport systems based on multiple sources of data. We focus on a particular airport process and illustrate the use of a Bayesian network to describe the system.

On the algorithmic and system interface of big learning
Eric Xing, Carnegie Mellon University, USA

In many modern applications built on massive data and using high-dimensional models, such as web-scale content extraction via topic models, genome-wide association mapping via sparse regression, and image understanding via deep neural network, one needs to handle BIG machine learning problems that threaten to exceed the limit of current infrastructures and algorithms. While ML community continues to strive for new scalable algorithms, and several attempts on developing new system architectures for BIG ML have emerged to address the challenge on the backend, good dialogs between ML and system remain difficult --- most algorithmic research remain disconnected from the real system/data they are to face; and the generality, programmability, and theoretical guarantee of most systems on ML programs remain largely unclear. In this talk, I will present Petuum -- a general-purpose framework for distributed machine learning, and demonstrate how innovations in scalable algorithms and distributed systems design work in concert to achieve multiple orders of magnitude of scalability on a modest cluster for a wide range of large scale problems in social network (mixed-membership inference on 100M node), personalized genome medicine (sparse regression on 100M dimensions), and computer vision (classification over 20K labels), with provable guarantee on correctness of distributed inference.

Real-time bursty topic detection on social media
Feida Zhu, Singapore Management University

Social media websites like Twitter have become the major platforms for users around the world to share anything happening around them with friends and beyond. A bursty topic in Twitter is one that triggers a surge of relevant tweets within a short time, which often reflects important events of mass interest. How to leverage Twitter for early detection of bursty topics has therefore become an important research problem with immense practical value. Despite the wealth of research work on topic modeling and analysis in Twitter, it remains a huge challenge to detect bursty topics in real-time. As existing methods can hardly scale to handle the task with the tweet stream in real-time, we introduce in this talk TopicSketch, a novel sketch-based topic model together with a set of techniques to achieve real-time detection. We show experiment results evaluated on a tweet stream with over 30 million tweets to demonstrate both the efficiency and effectiveness of our approach.

Best viewed with IE 7 and above