Active Simulation Data Mining

Simulations can serve as data generators for machine learning. Their high computational cost however requires a smart—i.e. an active—sampling of what to simulate. Depending on the simulation, such a sampling amounts to active learning or to active class selection.

Learning from Simulations

A simulation models how the state $\,\boldsymbol s\in\mathcal S\,$ of a system evolves over time, given the simulation parameters $\,\boldsymbol\rho\in\mathcal P\,$ . Multiple steps of a simulation constitute a black-box data generator which produces data from its input parameters.

$Sim_{\boldsymbol\rho}(\boldsymbol s_{t}, \, \Delta t) \;=\; \boldsymbol s_{t+\Delta t} \,,\qquad 0 \leq t \leq T$

A simulation is a black-box data generator — Figure 1: A simulation generates training data from a vector of parameters. For active sampling, we consider either the feature vector or the label to be part of the parameters.

Active sampling tries to acquire an optimal set of labeled data to be used for training a supervised model. If we consider the observation $\,\boldsymbol x\,$ or the label $\,y\,$ to be a part of the simulation input $\,\boldsymbol\rho\,$ , this task respectively corresponds to active learning [1] or to active class selection [2].

Active Class Selection

The goal of active class selection (ACS) [2] is to optimize the class proportions in newly acquired data; a classifier trained from that data should exhibit maximum performance during its deployment. Having a simulation in which the label $\,y\,$ is part of the simulation parameters $\,\boldsymbol\rho\,$ , we are facing exactly the ACS problem: in which class proportions should we simulate?

You can find all of our online conference talks on YouTube:

Video 1: Our certificate motivates a strategy for ACS data acquisition under uncertainty (Interactive Adaptive Learning Workshop @ ECML-PKDD 2021).

Video 2: We refine our earlier theory to certify the robustness of classifiers against label shift—in ACS and anywhere else (ECML-PKDD 2021).

Video 3: We consider the ACS problem from the view-point of information theory, yielding an upper bound of the classifier's error (ICDM 2020).

Use Case—Cherenkov Astronomy

Cherenkov astronomy reasons about the characteristics of cosmic objects by studying their gamma radiation. Since no labeled data is available from the actual detectors, it is necessary to simulate the training data. In each simulation run, we can arbitrarily choose the type of the particle to be simulated—which is the label in the prediction task at hand.

A gamma particle interacting in Earth's atmosphere — Figure 2: A high-energy particle interacting in Earth's atmosphere produces a cascade of secondary particles, the air shower. This shower emits Cherenkov light, which is measured by a telescope [3]. The entire process of particle detection is resembled in simulations.

Publications

M. Bunse and K. Morik: Active Class Selection with Uncertain Deployment Class Proportions In Interactive Adaptive Learning Workshop at ECML-PKDD, 2021 (to appear).
M. Bunse and K. Morik: Certification of Model Robustness in Active Class Selection In Europ. Conf. on Mach. Learn. and Knowledge Discovery in Databases, 2021 (to appear).
M. Bunse, D. Weichert, A. Kister, and K. Morik: Optimal Probabilistic Classification in Active Class Selection. In Int. Conf. on Data Mining, 2020.
M. Bunse, A. Saadallah, and K. Morik: Towards Active Simulation Data Mining. In Int. Tutorial and Workshop on Interactive Adaptive Learning at ECML-PKDD, pages 104-107, 2019.
M. Bunse and K. Morik: What Can We Expect from Active Class Selection? In Lernen, Wissen, Daten, Analysen (LWDA), 2019, pages 79-83.

Supplementary Material

Experiments on ACS at ECML-PKDD 2021: https://github.com/mirkobunse/AcsCertificates.jl
Experiments on ACS at ICDM 2020: https://github.com/mirkobunse/acs-icdm20
Interactive Adaptive Learning at ECML-PKDD 2019: https://p.ies.uni-kassel.de/ial2019/index.html

Bibliography

B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2012.
R. Lomasky, C. E. Brodley, M. Aernecke, D. Walt, and Mark A. Friedl. Active class selection. In Proc. of the ECML, pages 640–647, 2007.
C. Bockermann, K. Brügge, J. Buss, A. Egorov, K. Morik, W. Rhode, and T. Ruhe. Online analysis of high-volume data streams in astroparticle physics. In Proc. of the ECML-PKDD, pages 100–115, 2015.

Mirko Bunse

We are always looking for comments, criticism, and for collaborators. Can we count you in? :)

mirko.bunse [ät] cs.tu-dortmund.de

You may also like our work on deconvolution.