Active Simulation Data Mining

Simulations can serve as data generators for machine learning. Their high computational cost however requires a smart—i.e. an active—sampling of what to simulate. Depending on the simulation, such a sampling amounts to active learning or to active class selection.

Learning from Simulations

A simulation models how the state of a system evolves over time, given the simulation parameters . Multiple steps of a simulation constitute a black-box data generator which produces data from its input parameters.

A simulation is a black-box data generator
Figure 1: A simulation generates training data from a vector of parameters. For active sampling, we consider either the feature vector or the label to be part of the parameters.

Active sampling tries to acquire an optimal set of labeled data to be used for training a supervised model. If we consider the observation or the label to be a part of the simulation input , this task respectively corresponds to active learning [1] or to active class selection [2].

Active Class Selection

The goal of active class selection (ACS) [2] is to optimize the class proportions in newly acquired data; a classifier trained from that data should exhibit maximum performance during its deployment. Having a simulation in which the label is part of the simulation parameters , we are facing exactly the ACS problem: in which class proportions should we simulate?

You can find all of our online conference talks on YouTube:

Video 1: Our certificate motivates a strategy for ACS data acquisition under uncertainty (Interactive Adaptive Learning Workshop @ ECML-PKDD 2021).
Video 2: We refine our earlier theory to certify the robustness of classifiers against label shift—in ACS and anywhere else (ECML-PKDD 2021).
Video 3: We consider the ACS problem from the view-point of information theory, yielding an upper bound of the classifier's error (ICDM 2020).

Use Case—Cherenkov Astronomy

Cherenkov astronomy reasons about the characteristics of cosmic objects by studying their gamma radiation. Since no labeled data is available from the actual detectors, it is necessary to simulate the training data. In each simulation run, we can arbitrarily choose the type of the particle to be simulated—which is the label in the prediction task at hand.

A gamma particle interacting in Earth's atmosphere
Figure 2: A high-energy particle interacting in Earth's atmosphere produces a cascade of secondary particles, the air shower. This shower emits Cherenkov light, which is measured by a telescope [3]. The entire process of particle detection is resembled in simulations.

Publications

Supplementary Material

Bibliography

  1. B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2012.
  2. R. Lomasky, C. E. Brodley, M. Aernecke, D. Walt, and Mark A. Friedl. Active class selection. In Proc. of the ECML, pages 640–647, 2007.
  3. C. Bockermann, K. Brügge, J. Buss, A. Egorov, K. Morik, W. Rhode, and T. Ruhe. Online analysis of high-volume data streams in astroparticle physics. In Proc. of the ECML-PKDD, pages 100–115, 2015.

Share your ideas with us!

Mirko Bunse Mirko Bunse

We are always looking for comments, criticism, and for collaborators. Can we count you in? :)

mirko.bunse [ät] cs.tu-dortmund.de

You may also like our work on deconvolution.