Event Date: December 5, 2013 16:15

**MDL for Pattern Mining**

Pattern mining is arguably the biggest contribution of data mining to data analysis with scaling to massive volumes as a close contender. There is a big problem, however, at the very heart of pattern mining, i.e., the pattern explosion. Either we get very few – presumably well-known patterns – or we end up with a collection of patterns that dwarfs the original data set. This problem is inherent to pattern mining since patterns are evaluated individually. The only solution is to evaluate sets of patterns simultaneously, i.e., pattern set mining.

In this talk I will introduce one approach to solve this problem, viz., our Minimum Description Length (MDL) based approach with the KRIMP algorithm. After introducing the pattern set problem I will discuss how MDL may help us. Next I introduce the heuristic algorithm called KRIMP. While KRIMP yields very small pattern sets, we have, of course, to validate that the results are characteristic pattern sets. We do so in two ways, by swap randomization and by classification.

Time permitting I will then discuss some of the statistical problems we have used the results of KRIMP for, such as data generation, data imputation, and data smoothing.

Short Biography

Since 2000, Arno is Chair of Algorithmic Data Analysis at Utrecht University. After doing his PhD and some years as a postdoc as a database researcher, he switched his attention to data mining in 1993 and he still hasn’t recovered. His research has been mostly in the area of pattern mining and since about 8 years in pattern set mining. In the second half of the nineties he was a co-founder of and chief-evangelist and sometimes consultant at Data Distilleries, which by way of SPSS is now a part of IBM. He has acted as PC-member, vice chair or even PC chair of many of the major conferences of the field for many years. Currently he is also on the editorial board of DMKD and KAIS.