• German
German

Main Navigation

C4  Regression approaches for large-scale high-dimensional data


Ickstadt.JPG
Prof. Dr. Ickstadt, Katja
Sohler.JPG
Prof. Dr. Sohler, Christian

The main objective of project C4 is the development of highly efficient regression approaches. We want to make modern statistical regression methods scalable to very large and high-dimensional data sets and settings where computational resources are scarce.

We focus on algorithmic approaches that can be efficiently implemented in streaming as well as in distributed environments. In particular, we develop methods to aggregate data and to reduce the number of observations using, e.g., random linear projections and sampling, as well as methods to reduce the dimensionality of the underlying, possibly Bayesian, model classes.

Sketching and sampling methods for regression approaches on large-scale data are important areas of research with many interesting open questions. Although basic models are well studied, research on complex and modern statistical methods has just begun. We pursue the study of novel data reduction techniques for, e.g., Bayesian generalised linear models, and aim at the challenging objective of unifying their algorithmic treatment to provide blueprints for broad statistical settings.

Project management:

Prof. Dr. Ickstadt, Katja
Prof. Dr. Sohler, Christian

Project members:

Geppert, Leo N.
Dr. Munteanu, Alexander
Riehl, Julian

Alumni:

Dr. Driemel, Anne
Dr. Köllmann, Claudia
König, Helena

Software:

RaProR - Random Projections for Bayesian linear regression (R package)

Publications:

Ickstadt/etal/2018a Ickstadt, Katja and Schäfer, Martin and Zucknick, Manuela. Toward Integrative Bayesian Analysis in Molecular Biology. In Annual Review of Statistics and Its Application, Vol. 5, pages 141-167, 2018.


Molina/etal/2018a Molina, Alejandro and Munteanu, Alexander and Kersting, Kristian. Core Dependency Networks. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI), 2018.


Munteanu/etal/2018a Alexander Munteanu and Chris Schwiegelshohn and Christian Sohler and David P. Woodruff. On Coresets for Logistic Regression. In Advances in Neural Information Processing Systems (NIPS), to appear, 2018.


Munteanu/Schwiegelshohn/2018a Munteanu, Alexander and Schwiegelshohn, Chris. Coresets - Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms. In KI - Künstliche Intelligenz, Vol. 32, No. 1, pages 37-53, 2018.


Weihs/Ickstadt/2018a Weihs, Claus and Ickstadt, Katja. Data Science: the impact of statistics. In International Journal of Data Science and Analytics, Springer, 2018.


Geppert/etal/2017a Geppert, Leo N. and Ickstadt, Katja and Munteanu, Alexander and Quedenfeld, Jens and Sohler, Christian. Random projections for Bayesian regression. In Statistics and Computing, Vol. 27, No. 1, pages 79-101, 2017.


Schlieker/etal/2017a Schlieker, Laura and Telaar, Anna and Lueking, Angelika and Schulz-Knappe, Peter and Theek, Carmen and Ickstadt, Katja. Multivariate binary classification of imbalanced datasets - A case study on high-dimensional multiplex autoimmune assay data. In Biometrical Journal, 2017.


Treppmann/etal/2017a Treppmann, Tabea and Ickstadt, Katja and Zucknick, Manuela. Integration of multiple genomic data sources in a Bayesian Cox model for variable selection and prediction. In Computational and Mathematical Methods in Medicine, Vol. Vol. 2017, pages 1-19, 2017.


Huels/etal/2016a Hüls, Anke and Krämer, Ursula and Stolz, Sabine and Hennig, Frauke and Hoffmann, Barbara and Ickstadt, Katja and Vierkötter, Andrea and Schikowski, Tamara. Applicability of the Global Lung Initiative 2012 Reference Values for Spirometry for Longitudinal Data of Elderly Women. In PLOS ONE, Vol. 11, No. 6, pages e0157569, 2016.


Koellmann/etal/2016a Köllmann, Claudia and Ickstadt, Katja and Fried, Roland. Beyond unimodal regression: modelling multimodality with piecewise unimodal regression or deconvolution models. arXiv:1606.01666 [stat.AP], 2016.


Munteanu/Wornowizki/2015a Alexander Munteanu and Max Wornowizki. Correcting statistical models via empirical distribution functions. In Computational Statistics, Vol. 31, No. 2, pages 465-495, Springer, 2016.


Geppert/etal/2014a Geppert, Leo N. and Ickstadt, Katja and Munteanu, Alexander and Sohler, Christian. Random projections for Bayesian regression. No. 4, TU Dortmund, 2014.


Koellmann/etal/2014a Köllmann, Claudia and Bornkamp, Björn and Ickstadt, Katja. Unimodal regression using Bernstein-Schoenberg-splines and penalties. In Biometrics, Vol. 70, No. 4, pages 783-793, 2014.


Koellmann/etal/2014b Köllmann, Claudia and Ickstadt, Katja and Fried, Roland. Beyond unimodal regression: modelling multimodality with piecewise unimodal, mixture or additive regression. No. 8, TU Dortmund, 2014.


Munteanu/Wornowizki/2014a Alexander Munteanu and Max Wornowizki. Demixing empirical distribution functions. No. 2, TU Dortmund, 2014.


Schwiegelshohn/Sohler/2014a Chris Schwiegelshohn and Christian Sohler. Logistic Regression for Datastreams. No. 1, TU Dortmund, 2014.


Binder/etal/2012a Binder, Harald and Müller, Tina and Schwender, Holger and Golka, Klaus and Steffens, Michael and Hengstler, Jan G. and Ickstadt, Katja and Schumacher, Martin. Cluster-localized sparse logistic regression for SNP data. In Statistical Applications in Genetics and Molecular Biology, Vol. 11, No. 4, 2012.


Canzar/etal/2011c Canzar, Stefan and Marschall, Tobias and Rahmann, Sven and Schwiegelshohn, Chris. Solving The Minimum String Cover Problem. In David A. Bader and Petra Mutzel (editors), Proceedings of the SIAM Meeting on Algorithm Engineering and Experiments (ALENEX'12), pages 75--83, 2012.


Koellmann/etal/2012a Köllmann, Claudia and Bornkamp, Björn and Ickstadt, Katja. Unimodal regression using Bernstein-Schoenberg-splines and penalties. No. 6, TU Dortmund, 2012.


Lohr/etal/2012a Lohr, M. and Köllmann, C. and Freis, E. and Hellwig, B. and Hengstler, J. G. and Ickstadt, K. and Rahnenführer, J.. Optimal strategies for sequential validation of significant features from high-dimensional genomic data. In Journal of Toxicology and Environmental Health, Part A, Vol. 75, No. 8-10, pages 447-460, 2012.


Schwender/etal/2012a Schwender, Holger and Selinski, Silvia and Blaszkewicz, Meinolf and Marchan, Rosemarie and Ickstadt, Katja and Golka, Klaus and Hengstler, Jan G.. Distinct SNP combinations confer susceptibility to urinary bladder cancer in smokers and non-smokers. In Plos One, Vol. 7, No. 12, 2012.


Ickstadt/etal/2011b Ickstadt, Katja and Bornkamp, Björn and Grzegorczyk, Marco and Wieczorek, Jakob and Sheriff, M.Rahuman and Grecco, Hérnan E. and Zamir, Eli. Nonparametric Bayesian Networks (with discussion). In Bernardo, José M. and Bayarri, M. J. and Berger, James O. and Dawid, A. Philip and Heckerman, David and Smith, Adrian F. M. and West, M. (editors), Bayesian Statistics, Vol. 9, pages 283-316, 2011.


Schwender/etal/2011a Schwender, Holger and Ruczinski, Ingo and Ickstadt, Katja. Testing SNPs and sets of SNPs for importance in association studies. In Biostatistics, Vol. 12, No. 1, pages 18-32, 2011.


Sohler/Woodruff/2011a Sohler, Christian and Woodruff, David P.. Subspace embeddings for the \(L_1\)-norm with applications. In Lance Fortnow and Salil P. Vadhan (editors), Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC), pages 755-764, ACM, 2011.



Disserations:

  • Munteanu/2018a - On large-scale probabilistic and statistical data analysis
  • Koellmann/2016a - Unimodal spline regression and its use in various applications with single or multiple modes

Final Theses:

Freese/2017a Freese, Maximilian. Sketch-basierte Bayes-regression mit MapReduce. TU Dortmund, 2017.


Lategahn/2016a Lategahn, Niels. Vergleich von Methoden zur Auswahl von Beobachtungen bei Regression mit fehlenden Y-Werten. TU Dortmund, 2016.


Mueller/2016a Müller, Steffen. Untersuchung von Regression auf eingebetteten Datensätzen unter Verwendung von verschiedenen Abstandsnormen und Penalisierungstermen. TU Dortmund, 2016.


Wollenberg/2016a Wollenberg, Alexander. Reduktion hochdimensionaler Datensätze für die logische Regression unter Verwendung von Leverage Scores mit besonderer Berücksichtigung von SNP-Daten. TU Dortmund, 2016.


Horn/2015a Simon Horn. Analyse von Metabolom-Daten der Arzneipflanze Duboisia: Hauptkomponentenanalyse, Clusterung und Peakidentifzierung. TU Dortmund, 2015.


Lange/2015a Laura Lange. Analyse von GC/IMS-Atemluftmessungen unter Berücksichtigung verschiedener Atemerfrischer. TU Dortmund, 2015.


Rathjens/2015a Rathjens, Jonathan. Hierarchische Bayes-Regression bei Einbettung großer Datensätze. TU Dortmund, 2015.


Mueller/2013a Müller, Steffen. Untersuchung der praktischen Anwendbarkeit von unimodaler Regression auf diverse naturwissenschaftliche Datensätze. TU Dortmund, 2013.


Okroy/2013a Okroy, Lena. Untersuchung der praktischen Anwendbarkeit von nichtlinearer Regression auf verschiedene Datensätze. TU Dortmund, 2013.


Quedenfeld/2013a Quedenfeld, Jens. Experimentelle Analyse verschiedener linearer \(\ell_2\)-Einbettungen von dünn besetzten Eingabedaten. TU Dortmund, 2013.


Jabs/2012a Jabs, Verena. Vergleich von Methoden zur Dimensionsreduktion unter Berücksichtigung der Rechenzeit und des Speicherbedarfs. TU Dortmund, 2012.


Rueppert/2012a Rüppert, Andreas. LASSO Regression für große Datenmengen. TU Dortmund, 2012.


Zhu/2012a Qingchui Zhu. Datenstromalgorithmen für Regression. TU Dortmund, 2012.


  • Freese/2017a - Sketch-basierte Bayes-regression mit MapReduce
  • Lategahn/2016a - Vergleich von Methoden zur Auswahl von Beobachtungen bei Regression mit fehlenden Y-Werten
  • Mueller/2016a - Untersuchung von Regression auf eingebetteten Datensätzen unter Verwendung von verschiedenen Abstandsnormen und Penalisierungstermen
  • Wollenberg/2016a - Reduktion hochdimensionaler Datensätze für die logische Regression unter Verwendung von Leverage Scores mit besonderer Berücksichtigung von SNP-Daten
  • Horn/2015a - Analyse von Metabolom-Daten der Arzneipflanze Duboisia: Hauptkomponentenanalyse, Clusterung und Peakidentifzierung
  • Lange/2015a - Analyse von GC/IMS-Atemluftmessungen unter Berücksichtigung verschiedener Atemerfrischer
  • Rathjens/2015a - Hierarchische Bayes-Regression bei Einbettung großer Datensätze
  • Mueller/2013a - Untersuchung der praktischen Anwendbarkeit von unimodaler Regression auf diverse naturwissenschaftliche Datensätze
  • Okroy/2013a - Untersuchung der praktischen Anwendbarkeit von nichtlinearer Regression auf verschiedene Datensätze
  • Quedenfeld/2013a - Experimentelle Analyse verschiedener linearer \(\ell_2\)-Einbettungen von dünn besetzten Eingabedaten
  • Jabs/2012a - Vergleich von Methoden zur Dimensionsreduktion unter Berücksichtigung der Rechenzeit und des Speicherbedarfs
  • Rueppert/2012a - LASSO Regression für große Datenmengen
  • Zhu/2012a - Datenstromalgorithmen für Regression

Preliminary Work:

Bornkamp/etal/2010a B. Bornkamp and K. Ickstadt and D. B. Dunson. Stochastically ordered multiple regression. In Biostatistics, Vol. 11, No. 3, pages 419-431, 2010.


Feldman/etal/2010a Dan Feldman and Morteza Monemizadeh and Christian Sohler and David Woodruff. Coresets and Sketches for High Dimensional Subspace Approximation Problems. In Proceedings 21st Annual ACM-SIAM Symposium on Discrete Algorithms, pages 630-649, 2010.


Bornkamp/etal/2009a B. Bornkamp and A. Fritsch and O. Kuss and K. Ickstadt. Penalty specialists among goalkeepers: A nonparametric Bayesian analysis of 44 years of German Bundesliga. In B. Schipp and W. Krämer (editors), Statistical Inference, Econometric Analysis and Matrix Algebra: Festschrift in Honour of Götz Trenkler, pages 63-76, Physica Verlag, 2009.


Bornkamp/Ickstadt/2009b Bornkamp, Björn and Ickstadt, Katja. Bayesian nonparametric estimation of continuous monotone functions with applications to dose-response analysis. In Biometrics, Vol. 65, pages 198 -- 205, 2009.


Frahling/etal/2008a Gereon Frahling and Piotr Indyk and Christian Sohler. Sampling in Dynamic Data Streams and Applications. In International Journal of Computational Geometry and Applications (Special Issue with selected papers from the 21st ACM Symposium on Computational Geometry), Vol. 18, No. 1/2, pages 3 -- 28, 2008.


Schwender/Ickstadt/2008a Schwender, H. and Ickstadt, K.. Identification of SNP interactions using logic regression. In Biostatistics, Vol. 9, pages 187 -- 198, 2008.


Feldman/etal/2007a Dan Feldman and Morteza Monemizadeh and Christian Sohler. A PTAS for k-means clustering based on weak coresets. In Proceedings of the 23rd ACM Symposium on Computational Geometry, pages 11-18, 2007.


Fritsch/2007a Fritsch, A. und Ickstadt, K.. Comparing logic regression based methods for identifying SNP interactions. In Hochreiter, S. and Wagner, R. (editors), Bioinformatics in Research and Development, Springer, 2007.


Nunkesser/etal/2007a Nunkesser, R. and Bernholt, T. and Schwender, H. and Ickstadt, K. and Wegener, I.. Detecting high-order interactions of single nucleotide polymorphisms using genetic programming. In Bioinformatics, Vol. 23, pages 3280 -- 3288, 2007.


Ickstadt/Wolpert/99a K. Ickstadt and R. L. Wolpert. Spatial regression for marked point processes. In J. M. Bernardo and J. O. Berger and A. P. Dawid and A. F. M. Smith (editors), Bayesian Statistics 6, pages 323-341, Oxford, Oxford University Press, 1999.


Wolpert/Ickstadt/98a R. L. Wolpert and K. Ickstadt. Poisson/Gamma random field models for spatial statistics. In Biometrika, Vol. 85, pages 251-267, 1998.