Introduction to machine learningSchedule: Tuesday, 04.09., 09.00-15.30 Lecturer: Michèle Sébag Machine learning is all about building some (educated) common sense from the data jungle where we live. WHAT: The end is to form hypotheses and to use them to predict, decide or play. HOW: Machine learning proceeds by defining the quality of a hypothesis, aka learning criterion, and finding hypotheses with optimal, or sufficiently good, quality. Many learning criteria have been designed and none is universal: your prior knowledge - about the application domain and/or about the learning algorithms - is what makes the difference. The course will provide you with some general principles (which criteria are sound, which are effective depending on the context) and methodology (rules of good practice - how to conduct a ML application). |
Numerical Optimization in Data AnalysisSchedule: Tuesday, 04.09., 16.00-17.30 Lecturer: Sangkyun Lee Many interesting problems in data analysis can be formulated as mathematical programs for which solutions can be found via numerical optimization. Optimization studies canonical forms of such programs, providing us with useful tools to understand their structures and thereby to design resource-efficient computation algorithms. In this lecture we discuss some fundamental ideas in optimization that are important in efficient data analysis. |
Data Mining with RapidMinerSchedule: Tuesday, 04.09., 18.00-19.00 Lecturer: Tim Ruhe Data Mining can become a lot easier using the right tools. One of these popular tools and voted as the most widely used solutions on KDNugget is RapidMiner. This lecture will introduce the basic principles of RapidMiner and its application on data mining problems. The usage of several basic operators (Input/Output, Learner, Performance Evaluation, Validation, ...) will be explained in the form of simple use cases. Each process will be designed such that the students can follow along on their computers. It is thus highly recommended, that every attendee has a current version of RapidMiner up and running. RapidMiner Community Edition can be downloaded at no cost from: rapid-i.com/content/view/26/84/. Although the installation is just two clicks, there exists an installation guide at: rapid-i.com/content/view/17/211/lang,en/ Example datasets will be provided for download on this website a few weeks prior to the workshop. |
Data Mining from Ubiquitous Data StreamsSchedule: Wednesday, 05.09., 09.00-10.30 and 11.00-12.30 Lecturer: João Gama The lecture discusses the challenges in learning from distributed sources of continuous data generated by dynamic environments. Learning in these environments is faced with new challenges: we need to continuously maintain a decision model consistent with the most recent data. Stream learning algorithms work with limited computational resources. They need to be able to maintain any time decision models, modify the decision model when new information is available, detect and react to changes in the underlying process generating data, and forget outdated information. The tutorial will introduce the area of data stream mining using illustrative problems, present state-of-the-art learning algorithms in change detection, clustering, classification, and discuss current trends and opportunities of research in learning from ubiquitous data streams. The second part includes exercises using software for massive online analysis (MOA) and other software for stream mining. |
Statistical methods for model selectionSchedule: Wednesday, 05.09., 14.00-15.30 and 16.00-17.30 Lecturer: Jörg Rahnenführer, Exercises: Michel Lang Model selection is a central challenge both for regression and classification tasks. For evaluating the quality of a statistical model many different partly conflicting criteria are applied, for example the fit of the model in terms of likelihood, the sparsity of the model, the computation time of an algorithm for model fitting, or the interpretability of the resulting model. In the first part, we give a general introduction to model selection criteria, starting with the tradeoff between model fit and complexity referring to bias-variance decompositions. We explain how in a linear model the estimation of the optimism of the model fit motivates the popular AIC (Akaike information criterion). As alternatives for such an explicit score methods based on resampling or cross-validation are introduced, which require less model assumptions but more computation time. Finally we discuss variable selection algorithms like forward or backward selection or based on regularized estimates. All these approaches are demonstrated in a practical session in the statistical programming language R. Reasons for time-consuming experiments are the application of resampling algorithms, a large number of potential models, or large data sets. In these situations comparisons of statistical algorithms are best performed on high performance computing clusters. We present two new R packages that greatly simplify working in batch computing environments. The package BatchJobs implements the basic objects and procedures to control a batch cluster from within R. It is structured around cluster versions of the well-known higher order functions Map/Reduce/Filter from functional programming. The package BatchExperiments is tailored for the general scenario of analyzing arbitrary algorithms on problem instances. It extends BatchJobs by letting the user define an array of jobs of the kind 'apply algorithm A on problem instance P and store results R'. It is possible to associate statistical designs with parameters of algorithms and problems and therefore systematically study their influence on the algorithm's performance. A persistent database guarantees reproducible results, even on other systems. Examples for the application of these packages are presented in a practical session in R. |
Exploitation of memory hierarchiesSchedule: Thursday, 06.09., 09.00-10.30 Lecturer: Peter Marwedel According to Burks, Goldstine and v. Neumann (1946), Ideally one would desire an indefinitely large memory capacity such that any particular ... word ... would be immediately available - i.e. in a time which is ... shorter than the operation time of a fast electronic multiplier. ... It does not seem possible physically to achieve such a capacity. We are therefore forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible. Modern computers contain such memory hierarchies. Their presence may have a dramatic impact on the performance of applications. Nevertheless, programmers are frequently unaware of this fact. In the talk, we will present examples of memory hierarchy levels and we will also demonstrate how these levels can be exploited. In particular, we will look at techniques for improving the utilization of caches and of scratchpad memories. Corresponding source-to-source code transformations will be presented. We will close with a brief look at secondary memory and the related so-called I/O-Algorithms. |
Battery capacity modelsSchedule: Thursday, 06.09., 11.00-12.30 Lecturer: Peter Marwedel Batteries are very crucial components of all portable electronic devices. Nevertheless, users of batteries are frequently unaware of their characteristics. In this talk, we would like to provide fundamental knowledge in this area. We will start with a look at the expected future of battery technology and we will continue with a presentation of models of the remaining battery charges. We will then briefly present real-time calculus and show how it can be applied to model remaining battery charge over time. |
Towards Self-Powered SystemsSchedule: Thursday, 06.09., 14.00-15.30 and Friday, 07.09., 09.00-10.30 Lecturer: Jan Madsen A Wireless Sensor Network (WSN) is a distributed network, where a large number of computational components (also referred to as "sensor nodes" or simply "nodes") are deployed in a physical environment. Each component collects information about and offers services to its environment, e.g. environmental monitoring and control, healthcare monitoring and traffic control, to name a few. The collected information is processed either at the component, in the network or at a remote location (e.g. the “cloud”), or in any combination of these. WSNs are typically required to run unattended for very long periods of time, often several years, only powered by standard batteries. This makes energy-awareness a particular important issue when designing WSNs. With the advances in energy harvesting technologies, energy harvesting is an attractive new source of energy to power the individual nodes of a WSN. Not only is it possible to extend the lifetime of the WSN, it may eventually be possible to run them without batteries – effectively turning them into self-powered systems. However, this will require that the WSN system is carefully designed to effectively use adaptive energy management, and hence, adds to the complexity of the problem. One of the key challenges is that the amount of energy being harvested over a period of time is highly unpredictable. In this lecture I will address trends and challenges of energy efficiency for both single sensor nodes and networks of sensor nodes, when powered by energy harvested from the environment – leading towards self-powered systems. |
From Data Taking to Rapid Mining – Analysis of large Data Volumes in Astroparticle PhysicsSchedule: Thursday, 06.09., 16.00-17.30 Lecturer: Wolfgang Rhode In Astroparticle Physics experiments are set up at barely accessible places like the South Pole or the top of high mountains like the Roque de los Muchachos, La Palma, to explore the most exotic sources of astrophysically accelerated particles. Under restriction of the resources energy, bandwidth, CPU-time and storage huge amounts of data are pre-analyzed and stored, containing only one signal event per 103 - 109 background events. The separation of signal and background, the construction of sky maps and the reconstruction of energy spectra are carried out based on the finally stored data and high precision Monte Carlo simulations describing the complete chain between the primary particle approaching the Earth, their interactions in the atmosphere and the detector. Artificial Intelligence is used to solve the classification and measurement problem. Taking this setup as example, the complete data analysis chain for such complex experiments is discussed in this lecture. |
Use the power of your GPU: Massively parallel programming with CUDASchedule: Thursday, 05.09., 18.00-19.00 Lecturer: Nico Piatkowski Todays graphics processing units (GPUs) may be used as highly parallel co-processor beside the usual central processing unit. They are now an established platform for high-performance scientific computing, and a multitude of general and domain-specific programming environments, libraries and tools have emerged over the last years. We will give an overview on common techniques and libraries which can be used directly to benefit from this high parallelism. The goals of this tutorial are to provide an introduction to GPU computing and to explain how to accelerate data mining and machine learning with GPUs. |