Research

In many high impact applications -- ranging from online services, over cyber-physical systems, to the health sciences --, massive amounts of data are gathered. The need to extract meaningful information from this ever-growing data builds the foundation of our research. Our group focuses on the development of robust data mining/machine learning techniques for large, complex data. In our projects, we analyze data from multiple domains, thus, being highly interdisciplinary: e-commerce and social network data, biology and health sciences, and the monitoring of technical systems are only a few examples. Our works are based on sound theoretical principles covering Bayesian statistics and combinatorial optimization.

Below we give an overview of some of our ongoing projects. If you are interested in these research directions, please don't hesitate to contact us.

Analysis of Complex Networks

With the rapid growth of social media, sensor technologies, and life science applications, large-scale complex graphs have become a ubiquitous and highly informative source of information. Besides the mere description of individual objects, the relations between different objects are captured by an underlying graph structure. Some examples include review and co-purchase networks (e.g. Amazon, Yelp, ...), protein interaction networks (e.g. BioGrid), or social networks (e.g. Facebook). The goal of this project is to develop and analyze robust data mining techniques for large-scale complex graphs. Specifically, since in real life applications, complex graphs are often corrupted, prone to outliers, and vulnerable to attacks, we will focus on the methods' robustness properties. The obtained research results will act as a foundation for research and development in areas such as spam and fraud detection, advanced data cleansing, and recommender systems.

Robust Temporal Data Mining

While many analysis techniques focus on static data, a temporal analysis is often much more expressive. Online reviews of users reflect the changing perception and quality of products, the sensor measurements of a technical system represent a multivariate time series, and friendship relations in a social network evolve over time. In this project, we focus on the design of temporal analysis techniques. In our works, for example, we analyze the temporal rating data ubiquitous on e- commerce websites. Often, these datasets are prone to anomalies since, e.g., spammers try to push the success of a product or since the product’s underlying manufacturing process has been changed -- still ensuring a meaningful analysis is the goal of our research. As an example, our works have been used to spot irregularities in large-scale datasets from Amazon, Yelp, and TripAdvisor. In our ongoing projects and collaborations, these principles are extended to the domain of technical systems and the health sciences.

Subspace Learning Principles

The increasing potential of storage technology has opened the possibility to conveniently record a multitude of characteristics for each object: a person in a social network is characterized by hundreds of attributes, genes are described in detail by a high number of expression values. In general, the data we collect is often high-dimensional. Due to the increasing number of features we collect, the risk to include noisy and irrelevant ones is very high. Considering all features together, one cannot expect to find meaningful patterns in the data. The goal of this project is to develop analysis techniques that are robust to noisy and irrelevant features. In particular, we study the principle of subspace learning, where the analysis is automatically performed in subspace projections of the data. As an example, we are designing methods for subspace clustering which aim to simultaneously find groups of similar instances as well as their relevant features.