Welcome-breakfast and introduction to resource-aware machine learningSchedule: Monday, 29.09., 09.00-10.30 Lecturer: Katharina Morik Start the summer school on resource-aware machine learning together with your fellow participants and our welcome-breakfast. During the breakfast, listen to the introduction about hot topics on resource-awareness: Where are the limitations, being it due to huge amounts of data, high-dimensionality or restrictions imposed by small devices? What about the four Vs of data analysis: Velocity, Variety, Volume and Value. |
Mining patterns in attributed dynamic graphsSchedule: Monday, 29.09., 11.00-12.30 Lecturer: CĂ©line Robardet People, organizations, systems as well as researchers are generating torrents of data that provide exciting opportunities to enhance our knowledge on the underlying mechanisms. These data are increasingly both interconnected by a relationship and changing over time. In this lecture, we will present data mining tools for the analysis of such attributed dynamic graphs to understand how network structure and node attribute values relate and affect each other. Such approaches make possible to retrieve sub-parts of such graphs that satisfy some properties related to the graph structure, the attribute values associated with the nodes and the edges as well as the dynamics that transform these data. We will consider methods that (1) identify vertex attributes whose values are related to the graph structure in static graphs, (2) retrieve homogeneous sub-graphs in dynamic graphs, (3) capture some causality between attribute value variations and graph structure evolution, and (4) find sub-graphs that are specific to the traces left by a portion of the population. Several applications will be considered: the analysis of bicycle sharing systems footprints, trace mining from urban sensors and social network mining. |
From k-means clustering to DESICOM: Matrix Factorization for Data AnalysisSchedule: Monday, 29.09., 16.00-19.00 Lecturer: Christian Bauckhage The goal of this tutorial is to demonstrate that the problem of clustering multivariate numerical data can be thought of as a matrix factorization task. We will discuss classical clustering algorithms and point out their connection to rank-reduction and factorization problems. Once this idea has been established, we will look into more recent, conceptually more demanding approaches. We discuss how the corresponding factorization problems can be solved and present examples from several application areas to illustrate their potential for descriptive data mining or BIG DATA settings. |
Cyber-Physical Systems: Opportunities, Challenges and (some) SolutionsSchedule: Tuesday, 30.09., 09.00-10.30 Lecturer: Peter Marwedel The term cyber-physical systems characterizes the integration of information and communication technologies (ICT) with their physical environment. This integration results in a huge potential for the development of intelligent systems in a large set of industrial sectors. The potential will be covered in the first part of the talk. Sectors comprise industrial automation (industry 4.0), traffic, consumer devices, the smart grid, the health sector, urban living and computer-based analysis in science and engineering. A multitude of goals can be supported in this way, e.g., the availability of higher standards of living, higher efficiency of many processes, the generation of knowledge and safety for the society. However, the realization of this integration implies manifold challenges. Challenges covered in the second part of this talk include security, timing, safety, reliability, energy efficiency, interfacing, and the discovery of information in huge amounts of data. Also, the inherent multidisciplinarity poses challenges for knowledge acquisition and application. In the third and final part of the talk, we will present some of our contributions addressing these issues. These contributions techniques for improving the energy efficiency include an integration of a timing model into the code generation process, tradeoffs between timeliness and reliability, and approaches for education crossing boundaries of involved disciplines. |
Computation offloading for performance and energy improvementsSchedule: Tuesday, 30.09., 11.00-12.30 Lecturer: Jian-Jia Chen Embedded systems have been adopted in many application domains. However, most of these systems have limited resources. To accommodate more computation demands during the run-time, one solution is to adopt the Computation Offloading concept by moving some computation-intensive tasks to a powerful remote processing unit. Computation offloading mechanism has been recently adopted to improve the performance of the resource-constrained devices, and reduce the energy consumption. In this tutorial, we will examine several solutions and approaches to adopt computation offloading, especially with respect to timing, performance, and energy efficiency, for real-time and embedded systems. |
Presentation of student grant awardeesSchedule: Tuesday, 30.09., 13.00-14.30 Presenter: Daniel Alves Credit analysis through a clustering weightless neural network approach Noise is a common characteristic of real world data distributions, thus robustness is an important aspect in the consideration of any state-of-art classifiers. An example instance of this problem is the use of classifiers for credit analysis, i.e. identifying good and bad payers. Not only are both categories hard to distinguish, changing tendencies cause shifts in the data patterns. The ClusWiSARD, a clustering variant of the WiSARD weightless neural network, was proposed as a classifier for this problem. In a comparison with Support Vector Machine (SVM) the ClusWiSARD shows competitiveness in regards to accuracy. The ClusWiSARD also possesses capability to online learning and faster times to train and classify. Presenter: Zaffar Haider Janjua Intelligent System for Early Diagnosis and Follow-up at Home In this research the elderly people are monitored while performing ADLs inside a smart home. The long term observation and quality assessment of ADLs helps the physicians in an improved diagnosis of MCI. The smart homes are equipped with sensors of various modalities such as pressure, temperature, presence and door. The sensors collect the information about different events happened during an ADL and later on this data is processed for both ADL and anomaly recognition. I am using Markov Logic Network (MLN) as the basic framework for this purpose. MLN provides the ability to combine knowledge based (KB) models with probabilistic graphical models. Besides that I have also performed some experimentation with simple machiene learning techniques such as Support Vector Machiene (SVM). Presenter: Christos Karatsalos Transfer Learning and Applications The most basic assumption used in machine learning is that training data and test data are drawn from the same underlying distribution. Unfortunately, in many applications, the target domain data is drawn from a distribution that is related, but not identical, to the source domain distribution of the training data. We consider the common case in which labeled source domain data is plentiful, but labeled target data is scarce. After analyzing the conditions in which we can transfer knowledge from a previous related problem to a knew one, we introduce a novel algorithm for this problem based on a modification of the k-means, so that we can classify the unlabeled target domain data via clustering. Our experimental results show that our approach leads to improved performance on the classification of the target domain data using the previous knowledge of source domain data. We apply our alogrithm for solving text classification problems and problems from bioinformatics area such as splice site recognition. Presenter: Fabio Pulvirenti Misleading generalized itemset mining in the cloud After a highlevel introduction about the Hadoop and MapReduce paradigm, the talk presents a distributed framework whose goal is to discover misleading situations from Big datasets. Our tool, MGICLOUD, a cloudbased data mining framework, allows to discover hidden and actionable patterns from potentially Big datasets: it extracts Misleading Generalized Itemsets (MGIs), which represent contrasting correlations among itemsets at different abstraction levels. We prove its effectiveness in a Smart City environment, analysing a real dataset collected in an urban scenario. Presenter: Xin Xiao A Clustering-Based Approach to Analyse Examination Pathways for Diabetic Patients Health care data collections are usually characterized by an inherent sparseness due to a large cardinality of patient records and a large variety of medical treatments usually adopted for a given pathology. Innovative data analytics approaches are needed to effectively extract interesting knowledge from these large collections. In our research paper, we present an explorative data mining approach, based on a density-based clustering algorithm, to identify the examination pathways commonly followed by patients with a given disease. To cluster patients undergoing similar medical treatments and sharing common patient profiles (i.e., patient age and gender), a novel combined distance measure has been proposed. Furthermore, to focus on different dataset portions and locally identify groups of patients, the clustering algorithm has been exploited in a multiple-level fashion. Cluster results have been evaluated through the Silhouette index as well as by medical domain experts. Based on the cluster set, a classification model has been created to characterize the content of clusters and measure the effectiveness of the clustering process. The experiments, performed on a real diabetic patient dataset, demonstrate the effectiveness of the proposed approach in discovering interesting groups of patients with a similar examination history and with increasing disease severity. |
Privacy Aware LearningSchedule: Tuesday, 30.09., 16.00-19.00 Lecturer: John Duchi The increased collection of data, and potential problems of personal identification, has given impetus to a growing field of work on privacy at the intersection of databases, statistics, machine learning, and cryptography. In these talks, I will review recent work, that gives sharp characterizations of the fundamental tradeoffs between providing privacy and maintaining utility of results of data analyses. By combining statistical tools from minimax and decision theory, information theoretic techniques, and stochastic optimization algorithms, we will see tight bounds on the performance of privacy-preserving procedures, and we also develop provably optimal algorithms. |
The multi-core revolutionSchedule: Wednesday, 01.10., 09.00-10.30 Lecturer: Peter Marwedel Jian-Jia Chen Over the last decade, there has been a clear trend towards multi-core processors, including multi-core processors in embedded systems. The key driving force behind this trend is the need to provid higher performances from a limited energy budget and under certain thermal constraints. The replacement of single-core processors by multi-core processors has led to a revolution in hardware- and in software-design with far-reaching consequences. In this talk, we will explain why multi-core processors are more energy efficient than high-speed single cores. We will provide examples on how sequential applications can be parallelized to take advantage of parallel execution. The impact of parallel execution on energy efficiency will be shown. Also, the talk will briefly describe how to consider the multi-core thermal behavior and to exploit new opportunities for the design of fault-tolerant systems. Finally, we will touch upon the consequences of multi-core availability on embedded real-time applications. |
Model CompressionSchedule: Wednesday, 01.10., 11.00-12.30 and 14.00-15.30 Lecturer: Rich Caruana In some settings it is not enough for classifiers to be accurate, they also have to meet stringent time and space requirements. For example, if models will be executed billions of times (e.g., rankers at Bing and Google, or face-recognizers that need to scan over sub-images in video streams), if storage space is limited (e.g., PDAs, or Mars rovers), or if computational power is limited (e.g., hearing aids, or satellites), the classifiers must be time, space, and/or power efficient. Often, however, the most accurate models are too slow and too large to meet these requirements, while fast and compact models are significantly less accurate. In these settings, model compression can be used to obtain fast and compact models that are also very accurate. The main idea behind model compression is to train a fast/compact model to approximate the function learned by a slower, larger, higher accuracy model. The function that was learned by the high accuracy model can be used to label large amounts of pseudo data which can then be used to train the compact model with little chance of overfitting. This allows slow, complex models such as massive ensembles to be compressed into fast, compact models with little loss in accuracy. It also allows the complex functions learned by deep neural nets to be learned by shallow nets, suggesting that compact shallow models may be capable of learning models as accurate as those learned by deep learning. |
Processing Data Streams: From Small Devices to Big DataSchedule: Monday, 29.09., 14.00-15.30, Wednesday, 01.10., 16.00-19.00 and Thursday, 02.10., 11.00-12.30 Lecturer: Christian Bockermann The ways data is acquired in modern IT systems have changed: In almost all areas data acquisition has emerged to continuous streams of events like sensor readings, log messages, status updates, video content or transactions. Handling and continuously processing data streams has therefore become a key aspect in modern IT and also in data analytics. In this tutorial, we will recap the landscape of big data streaming architectures and introduce the streams framework, which has been developed in the context of the SFB program and provides high-level means of defining streaming applications. The tutorial will focus on hands-on exercises with real-world data (e.g. DEBS-2013/2014 challenge, FACT telescope data, video data) to define and implement feature extraction processes and apply machine learning in a continuously running system. Despite handling feature extraction we will outline the integration of MOA, an online machine learning library to produce prediction models from streaming data. As another aspect we will cover the online application of models trained on offline data, e.g. with the RapidMiner tools suite. |
Coresets for k-means ClusteringSchedule: Thursday, 02.10., 09.00-10.30 Lecturer: Melanie Schmidt Modern technologies challenge our algorithmic knowledge and, in particular, our ability to perform machine learning tasks. Data comes in large amounts, as a constant stream of new information. This situation poses a lot of new research problems in very different areas. In addition to the huge amount of practical challenges, it also calls for new theoretical models and foundations. In this lecture, we consider an unsupervised learning method, clustering, to illustrate a theoretical concept developed for data that arrives as a stream. The technique in question is the development of so-called coresets. A coreset of a set of input objects is a summary of the data. When presented with a stream of items that is too large to store, it is natural to compute some sort of compressed version of the input. It is clear that we cannot keep all important information. A coreset is a summary specifically designed for a prespecified optimization problem. For this specific problem, however, the feature of a coreset is that it has a theoretically proven guarantee that it has approximately the same cost as the original point set, for any possible solution candidate. We review the detailed definition of coresets, and different techniques to construct coresets in the context of k-means clustering, the task to find k centers such that the sum of squared distance of all input points to their respective closest center is minimized. Finally, we see an example of how the insights from the theoretical study of coresets can lead to a practically fast algorithm (for the k-means problem). |
Introduction to Astroparticle PhysicsSchedule: Thursday, 02.10., 11.00-12.30 Lecturer: Marlene Doert Experiments in Astroparticle Physics are built to study signals from exotic phenomena such as super-massive black holes at the centers of galaxies or huge bursts of radiation created in the collision of neutron stars. Typically, also the sites of operation are exotic: A few examples are the high energy neutrino detector IceCube, which is built deep into the Antarctic ice at the geographic South Pole, and the gamma-ray telescopes MAGIC and FACT, which are located near the top of the Roque de los Muchachos on the island of La Palma, at a height of 2200m. The Astroparticle Physics research group at TU Dortmund participates in these experiments, in particular by contributing to the constant improvement and innovation of the applied data analysis techniques. One of the main challenges for Astroparticle Physics experiments in general is the search for extremely rare particle events in a vast background of data stemming from other interactions, with signal-to-background ratios typically ranging from 1 in 103 up to 1 in 109 events. To fulfill this task, large amounts of data are collected, saved and filtered for interesting signal patterns. Here, Data Mining algorithms and Classification Methods clearly present the key to a better performance of the experiments and thus to the detection of fainter signals and rare (astro-)physical events. During this school, a project will be set up which aims searching for a signal of cosmic muons. These cosmic particles are able to pass the atmosphere towards the Earth's surface and can be detected with common CCD devices. The analysis of the acquired data will be performed in the context of streaming algorithms, using the data analysis software package FACT-Tools. |
Astroparticle DetectorSchedule: Thursday, 02.10., 14.00-17.30 Presenter: Stefan Michaelis Earth is hit every day by countless particles from outside the solar systems. Research collaborations from all over the world are building gigantic telescopes to capture these events. But even a standard CCD sensor as included in every smartphone is able to detect these phenomena. While in normal operation regarded as noise, tiny variations in CCD cell charging due to particle impact can be attributed to these events. After constantly capturing images these need to be processed to differentiate interesting events from background noise as caused from heating or faulty pixels. This lecture will teach how to implement the necessary steps from capturing the images from the phones camera, postprocessing and particle classification. Methods from the previous lectures will be applied to the detection process. Finally, learn how to detect the impact on your phones battery due to running the detection pipeline. For active participation it is highly recommended to have mobile computer with you. Instructions on the development environment will be issued before the summer school. Previous experience in Java programming is helpful. Optionally, you will learn how to run the software on an Android smartphone or the emulation software. |