Data Science and Mining team (DaSciM)
LIX Laboratory, École Polytechnique.

Research

My main research interests are in the areas of machine learning and data science, and artificial intelligence. In particular models for predicting multiple outputs and doing this scalably and in challenging contexts such as data streams and time series (including many applications involving sensory data). I am also recently interested in topics of reinforcement learning.

Multi-label Classification

In multi-label classification, values for multiple target variables are predicted for each instance, as opposed to the traditional supervised learning problem where a single target variable (class) is associated with each instance. The main challenge is detecting and modelling dependencies among labels, while maintaining scalibility to large problems. This task is relevant to many domains where multiple labels can be assigned to text documents, images, video and other media, and are also involved in many medical and biological applications.

In recent years, multi-label methods have become increasingly scalable. This has led to a broader possibility of applications. Actually multi-label prediction is a particular type of structured output prediction and can be applied to topics such as sequence prediction and forecasting, localization and segmentation, and many image and text-based tasks. There exist strong connections with other topics, namely probabilistic graphical models, neural networks (including deep learning), time series forecasting, sequence segmentation and mining, and dynamical models; all of which are among my research interests.

CC

Data Stream Classification

Many real-world applications are found in the context of data streams, where data instances arrive continuously in a theoretically-infinite stream, for example in sensor networks, online social media and text streams, anomaly and event detection.

When temporal dependence is present (rather than an i.i.d. stream) we may consider this a time series, but with additional important challenges: methods must be able to process large volumes of data quickly and learn and make predictions in real time (a full forward-backward pass is not possible), as well as detect and adapt to concept drift.

Applications

I have been interested in many applications involving real-world data and deployments, often dealing with sensory-data.

Learning to predict a traveller’s route and destination

Given only a week or so of location data from a mobile phone device, it was possible to make reasonably accurate predictions about the a traveller’s route and future destination in an urban setting. See, for example, this Demo Animation (the captions explain what is going on) — using real data collecting in the greater Helsinki area.

Tracking on very low-power sensor motes

In Madrid I worked on formulating and implementing a distributed particle filter on very low-power motes (4 MHz CPU) for real-time target tracking. For example, see this more information, see this video of a demo of testbed deployment for tracking using only light sensor observations.

Modelling tree growth in Scots pine

In Helsinki I worked with forestry scientists to model intra-annual growth of Scots pine trees in Finand and France using time series and machine learning models.

Modelling and diagnosis of insomnia

We are building predictive models for diagnosing different types of insomnia (in particular, those which can be treated with non-pharmaceutical options) and predicting a response to the different variations in treatment, thereby to formulate the most appropriate treatment that will lead to a fast and effective recovery. This involves a number of subtasks such as event detection in sequences an streams involving multiple correlated outputs.