• German

Main Navigation

Data Science

Using data to gain scientific insight has long been a key aim of statistics and knowledge acquisition and discovery. The current interest in data science is motivated by the mass of data and the high dimensions of data that no longer allow a scientist to inspect and explore the data with standard procedures. The international Conference on Data Science and Advanced Analytics started in 2014 and receives growing interest. In 2017, Katharina Morik gave the invited keynote reporting on the results of C3. As its title already indicates, the CRC 876 aims at providing information based on data. Projects in bioinformatics and physics demonstrate the impact of resource constrained data analysis for science.


Life sciences have experienced a tremendous upswing due to modern techniques that deliver detailed data. Cancer patients are better diagnosed and receive a more personalised therapy due to genomic data on the order of about 100 gigabytes per patient. Yet, detecting biomarkers in the high-dimensional data is still challenging. The oncology project C1 analyses tumour development on the example of neuroblastoma. The question whether mutations distinguish between primary and relapse neuroblastoma was successfully investigated, and a paper on the results was published in Nature Genetics. Mapping DNA fragments against a reference genome is computationally demanding. C1 has developed a new approach to read mapping that uses hashing to distinguish between true read and random matches. Recent nanopore sequencing may turn DNA sequencing into a commodity, but at the same time demands new data analysis methods. The raw data here is a large-volume high-frequency signal of ion currents, which need to be translated into a DNA sequence. Identifying biomarkers requires either better methods for DNA base calling from ion currents or a different representation of the biomarkers. C1 will investigate both alternatives.

The transfer project TB1 has developed methods for the analysis of breath measurements, i.e., ion mobility spectrometry (IMS). The CRC 876 together with the Center of Breath Research at the University of the Saarland, Reutlingen University and B & S Analytik organised an international symposium on Metabolites in Process Exhaust Air and Breath Air in 2015. The company B & S Analytik offers the products Edmon (measures exhaled drugs), BioScout, BreathDiscovery-Animal and BreathDiscovery-Bacteria. From the resource-aware automatic preprocessing of raw IMS data until the decision support for disease diagnosis and treatment selection, state of the art statistical classification algorithms have been identified and tested on prototypical cases. A publication in PLOS ONE summarizes the results. Hence, the transfer project ends successfully with the second phase.

Project B2 on detecting biological nano-objects such as, e.g., viruses or vesicles, uses the Plasmon Assisted Microscopy Of Nano-sized Objects (PAMONO). The journal Sensors had PAMONO on its cover and the paper on nanoparticle size distribution as the leading article in March 2017. A challenge is that the processing of the image sequences should be done in (soft) real-time while minimising resource consumption, e.g., that of energy and memory. Efficient and effective algorithms, partly in the framework of deep learning, have been developed. It turned out, that the PAMONO sensor can be extremely useful in relevant fields of pharmaceutical quality control and in-process control of biological materials that may contain viruses (blood products) or virus-like particles (vaccines). Hence, for a third funding phase, project B2 is proposed to be continued as a transfer project. Application partners are ARTES Biotechnology GmbH and Paul-Ehrlich-Institute. The PAMONO technology will be extended to become an adaptive biosensor/actuator unit that can be put to good use in medicinal product quality control.


Understanding production processes and developing models of controlling them are now supported by data-driven prognosis. Predictive maintenance, quality prediction, and support of model predictive control have found their way into international engineering conferences. The theme of the 2018 Conference on Manufacturing Systems is Smart Manufacturing and two keynote speakers address the use of data, called the "fourth industrial revolution" or "industrial digitalisation", and the "disruptive technologies" such as, among others, big data analytics. The CRC 876 project B3 on data mining in automated processes contributed a paper on optimising the milling process through machine learning. Machine learning for production often suffers from too few data and these are extremely unbalanced with respect to ok and not-ok. Simulation contributes additional data, but is often very slow. A way out of this dilemma is active learning, which orders specific simulations for the enhancement of the current learned model. Combining simulation and active learning is a promising approach that also has an impact on another project, namely the Mercur project on tunnelling processes with the Ruhr University Bochum.

In addition to gain insight into production processes, quality prediction and predictive monitoring based on distributed sensor measurements are a key to practical applications. Project B3 together with the company RapidMiner organised the Industrial Data Science Day in 2017. Use cases on outlier detection, quality prediction and reduced testing efforts exists, that will be used for feedback on our methods. However, further research on real-time learning and management of many models is still urgently needed.

The book on Industrie 4.0, edited by Michael ten Hompel and colleagues has 2.16 millions downloads, being number one in the field of engineering. These two examples already show the visibility of Dortmund's research in the field.

Traffic prognosis and vehicle routing are important for sustainable transport systems, for private routing, routing of fleets of carrier vehicles, and for public transportation. After early studies of traffic flow, the more individual aspect of multimodal route planning has been investigated based on graph theory and algorithm engineering 8 . Mobility patterns of citizens have been analysed in geographic knowledge discovery. Now, automated vehicles have raised many questions about traffic control, routing, signalling, and communication between the vehicles and how autonomous cars and human drivers will coordinate their driving. Project B4 integrates the traffic data analysis and the information gathering from vehicles and road infrastructure through communication technologies, e.g., 5G networks, in order to model the hybrid traffic.

The European project VaVel already used spatio-temporal random fields of the analysis project A1 on streaming data and the distributed label proportion approach of the production project B3 for privacy-preserving traffic prediction. Christian Wietfeld contributes results from the CRC 876 and his other projects to the 5G-initiative for Germany and collaborates with car manufacturers on optimal transfer times for car data. In this way, basic research of the CRC 876 is already being transferred.


The role of experiments in physics has become more and more data oriented, because the events of interest demand sophisticated indirect sensing that produces petabytes of data. The finding of the 17th of August 2017, the merger of two supernovae, illustrates the many senses of astrophysics. 130 million light years away, two stars went supernova, i.e., super-dense neutron stars, circling around each other for most of the history of the universe. They merged in a collision and gamma ray bursts from twin jet streams, outflow of neutrinos, and ejected material were measured by the SWOPE telescope in Chile. Moreover, the gravital wave observatories in the USA and Italy detected ripples in space and time 9 . Such evidence for hypotheses is only possible, if extremely rare events can be captured. The instruments need the application of data management and analysis supporting the way to a more fundamental understanding of the universe. The CRC 876 takes part in the endeavour with two projects, namely astrophysics project C3 and particle project C5.

The C3 project developed a complete pipeline from calibration and data cleaning through feature extraction and selection to signal separation and energy estimation for the FACT telescope, the FACT tools, built on the streams-framework in the second funding phase of the CRC 876. A student group for one year, PG 594, used the compute cluster of the CRC 876 in order to implement the processes of the workflow in a parallel and distributed programming paradigm that combines MapReduce and Spark or Storm, testing various machine learning alternatives for each step in the pipeline. The next generation of large telescopes, as well the upcoming Cherenkov Telescope Array (CTA) and IceCube-Gen2 demand even more resource-efficient analytics. A visit of C3 member Kai Brügge at the French Alternative Energies and Atomic Energy Commission (CEA) in spring 2017 has ported the concepts and algorithms of the FACT tools to CTA. Work on real-time tools for CTA will be continued. Also, the work on spectral reconstruction and Deep Convolution Networks will be continued in order to exploit the telescopes' full potential.

The particle physics project C5 started in the second phase of the CRC 876. It is based on a collaboration with the Large Hadron Collider (LHCb) at CERN and aims at the development of compact and low-energy streaming algorithms that are accelerated using GPUs and FPGAs. As in C3, the learning task is of the type of searching a needle in the haystack. C5 focuses on memory and storage of the data and then uses the MapReduce paradigm for selecting the extremely rare events. The goal for the proposed funding phase is to connect the results from the current phase, yielding a pipeline that scales up to the coming upgrade of the LHCb project.

The physics projects combine hardware aspects of the measuring instruments and the platforms for analysis with the models of learning and algorithms. In addition to delivering tools and publishing collaborative papers at conferences, both in physics and in machine learning, the transfer into the physical scientific community has recently stepped up via the working group "Physics, Information Technology, Artificial Intelligence" within the German Physics Society (DPG), where Wolfgang Rhode and Katharina Morik are on the steering committee. Moreover, the data sets from our CRC will be used for machine learning training, since the data do not have privacy or competition restrictions. It is planned to deploy data sets to the Competence Center on Machine Learning Rhein Ruhr for its education transfer activities.