• German

Main Navigation

Wouter Duivesteijn, University of Leiden, 16.15 o'clock, OH 14, E23

Event Date: March 14, 2013 16:15

Exceptional Model Mining - Identifying Deviations in Data

Finding subsets of a dataset that somehow deviate from the norm, i.e. where something interesting is going on, is an ancient task. In traditional local pattern mining methods, such deviations are measured in terms of a relatively high occurrence (frequent itemset mining), or an unusual distribution for one designated target attribute (subgroup discovery). These, however, do not encompass all forms of "interesting".

To capture a more general notion of interestingness in subsets of a dataset, we develop Exceptional Model Mining (EMM). This is a supervised local pattern mining framework, where several target attributes are selected, and a model over these attributes is chosen to be the target concept. Then, subsets are sought on which this model is substantially different from the model on the whole dataset. For instance, we can find parts of the data where:

  • two target attributes have an unusual correlation;
  • a classifier has a deviating predictive performance;
  • a Bayesian network fitted on several target attributes has an exceptional structure.

We will discuss some fascinating real-world applications of EMM instances, for instance using the Bayesian network model to identify meteorological conditions under which food chains are displaced, and using a regression model to find the subset of households in the Chinese province of Hunan that do not follow the general economic law of demand. Additionally, we will statistically validate whether the found local patterns are merely caused by random effects. We will simulate such random effects by mining on swap randomized data, which allows us to attach a p-value to each found pattern, indicating whether it is likely to be a false discovery. Finally, we will shortly hint at ways to use EMM for global modeling, enhancing the predictive performance of multi-label classifiers and improving the goodness-of-fit of regression models.

Newsletter RSS Twitter