• German
German

Main Navigation

C4  Regression approaches for large-scale high-dimensional data


Ickstadt.JPG
Prof. Dr. Ickstadt, Katja
Alex.JPG
Dr. Munteanu, Alexander

The main objective of project C4 is the development of highly efficient regression approaches. We want to make modern statistical regression methods scalable to very large and high-dimensional data sets and settings where computational resources are scarce.

We focus on algorithmic approaches that can be efficiently implemented in streaming as well as in distributed environments. In particular, we develop methods to aggregate data and to reduce the number of observations using, e.g., random linear projections and sampling, as well as methods to reduce the dimensionality of the underlying, possibly Bayesian, model classes.

Sketching and sampling methods for regression approaches on large-scale data are important areas of research with many interesting open questions. Although basic models are well studied, research on complex and modern statistical methods has just begun. We pursue the study of novel data reduction techniques for, e.g., Bayesian generalised linear models, and aim at the challenging objective of unifying their algorithmic treatment to provide blueprints for broad statistical settings.

Project management:

Prof. Dr. Ickstadt, Katja
Dr. Munteanu, Alexander

Alumni project management:


Sohler.JPG
Prof. Dr. Sohler, Christian

Alumni:

Denecke, Esther
Ding, Zeyu
Dr. Driemel, Anne
Dr. Geppert, Leo N.
Dr. Köllmann, Claudia
König, Helena
Dr. Munteanu, Alexander
Omlor, Simon
Riehl, Julian

Software:

Coresets and Oblivious Sketches for Logistic Regression
R package mrregression: Regression Analysis for Very Large Data Sets via Merge and Reduce
RaProR - Random Projections for Bayesian linear regression (R package)

Publications:

Munteanu/etal/2022a Munteanu, Alexander and Omlor, Simon and Peters, Christian. p-Generalized Probit Regression and Scalable Maximum Likelihood Estimation via Sketching and Coresets. In The 25th International Conference on Artificial Intelligence and Statistics (AISTATS), 2022. LaTeX Symbol Green Arrow


Munteanu/etal/2022b Munteanu, Alexander and Omlor, Simon and Song, Zhao and Woodruff, David P.. Bounding the Width of Neural Networks via Coupled Initialization - A Worst Case Analysis. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022. LaTeX Symbol Green Arrow


Madjar/etal/2021a Madjar, Katrin and Zucknick, Manuela and Ickstadt, Katja and Rahnenführer, Jörg. Combining heterogeneous subgroups with graph-structured variable selection priors for Cox regression. In BMC Bioinform., Vol. 22, No. 1, pages 586, 2021. LaTeX Symbol Green Arrow


Munteanu/etal/2021a Munteanu, Alexander and Omlor, Simon and Woodruff, David P.. Oblivious Sketching for Logistic Regression. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021. LaTeX Symbol Green Arrow


Parry/etal/2021a Parry, Katharina and Geppert, Leo N. and Munteanu, Alexander and Ickstadt, Katja. Cross-Leverage Scores for Selecting Subsets of Explanatory Variables. In arXiv e-prints, Vol. abs/2109.08399, 2021. LaTeX Symbol Green Arrow


Geppert/etal/2020a Geppert, Leo N. and Ickstadt, Katja and Munteanu, Alexander and Sohler, Christian. Streaming statistical models via Merge & Reduce. In International Journal of Data Science and Analytics, Vol. 10, No. 4, pages 331-347, 2020. LaTeX Symbol Green Arrow


Krivosija/Munteanu/2019a Krivo\vsija, Amer and Munteanu, Alexander. Probabilistic smallest enclosing ball in high dimensions via subgradient sampling. In Proceedings of the 35th International Symposium on Computational Geometry (SoCG), pages 47:1--47:14, 2019. LaTeX Symbol Green Arrow


Meintrup/etal/2019a Meintrup, Stefan and Munteanu, Alexander and Rohde, Dennis. Random projections and sampling algorithms for clustering of high-dimensional polygonal curves. In Advances in Neural Information Processing Systems 32 (NeurIPS), pages 12807--12817, 2019. LaTeX Symbol Green Arrow


Munteanu/etal/2019a Munteanu, Alexander and Nayebi, Amin and Poloczek, Matthias. A Framework for Bayesian Optimization in Embedded Subspaces. In Proceedings of the 36th International Conference on Machine Learning (ICML), Vol. 97, pages 4752--4761, Long Beach, California, USA, PMLR, 2019. LaTeX Symbol Green Arrow


Tietz/etal/2019a Tietz, Tobias and Selinski, Silvia and Golka, Klaus and Hengstler, Jan G. and Gripp, Stephan and Ickstadt, Katja and Ruczinski, Ingo and Schwender, Holger. Identification of interactions of binary variables associated with survival time using survivalFS. In Archives of Toxicology, Vol. 93, No. 3, pages 585--602, 2019. LaTeX Symbol Green Arrow


Wigmann/etal/2019a Wigmann, Claudia and Lange, Laura and Vautz, Wolfgang and Ickstadt, Katja. Modelling and Classification of GC/IMS Breath Gas Measurements for Lozenges of Different Flavours. In Applications in Statistical Computing, pages 31--48, Springer, 2019. LaTeX Symbol Green Arrow


Ickstadt/etal/2018a Ickstadt, Katja and Schäfer, Martin and Zucknick, Manuela. Toward Integrative Bayesian Analysis in Molecular Biology. In Annual Review of Statistics and Its Application, Vol. 5, No. 1, pages 141-167, 2018. LaTeX Symbol Green Arrow


Molina/etal/2018a Molina, Alejandro and Munteanu, Alexander and Kersting, Kristian. Core Dependency Networks. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI), 2018. LaTeX Symbol Green Arrow


Munteanu/etal/2018a Munteanu,Alexander and Schwiegelshohn, Chris and Sohler, Christian and Woodruff, David P.. On Coresets for Logistic Regression. In Advances in Neural Information Processing Systems 31 (NeurIPS), 2018. LaTeX Symbol Green Arrow


Munteanu/Schwiegelshohn/2018a Munteanu, Alexander and Schwiegelshohn, Chris. Coresets - Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms. In KI - Künstliche Intelligenz, Vol. 32, No. 1, pages 37-53, 2018. LaTeX Symbol Green Arrow


Weihs/Ickstadt/2018a Weihs, Claus and Ickstadt, Katja. Data Science: the impact of statistics. In International Journal of Data Science and Analytics, Springer, 2018. LaTeX Symbol Green Arrow


Geppert/etal/2017a Geppert, Leo N. and Ickstadt, Katja and Munteanu, Alexander and Quedenfeld, Jens and Sohler, Christian. Random projections for Bayesian regression. In Statistics and Computing, Vol. 27, No. 1, pages 79-101, 2017. LaTeX Symbol Green Arrow


Schlieker/etal/2017a Schlieker, Laura and Telaar, Anna and Lueking, Angelika and Schulz-Knappe, Peter and Theek, Carmen and Ickstadt, Katja. Multivariate binary classification of imbalanced datasets - A case study on high-dimensional multiplex autoimmune assay data. In Biometrical Journal, 2017. LaTeX Symbol Green Arrow


Treppmann/etal/2017a Treppmann, Tabea and Ickstadt, Katja and Zucknick, Manuela. Integration of multiple genomic data sources in a Bayesian Cox model for variable selection and prediction. In Computational and Mathematical Methods in Medicine, Vol. Vol. 2017, pages 1-19, 2017. LaTeX Symbol Green Arrow


Huels/etal/2016a Hüls, Anke and Krämer, Ursula and Stolz, Sabine and Hennig, Frauke and Hoffmann, Barbara and Ickstadt, Katja and Vierkötter, Andrea and Schikowski, Tamara. Applicability of the Global Lung Initiative 2012 Reference Values for Spirometry for Longitudinal Data of Elderly Women. In PLOS ONE, Vol. 11, No. 6, pages e0157569, 2016. LaTeX Symbol Green Arrow


Koellmann/etal/2016a Köllmann, Claudia and Ickstadt, Katja and Fried, Roland. Beyond unimodal regression: modelling multimodality with piecewise unimodal regression or deconvolution models. arXiv:1606.01666 [stat.AP], 2016. LaTeX Symbol Green Arrow


Munteanu/Wornowizki/2015a Munteanu, Alexander and Wornowizki, Max. Correcting statistical models via empirical distribution functions. In Computational Statistics, Vol. 31, No. 2, pages 465-495, Springer, 2016. LaTeX Symbol Green Arrow


Koellmann/etal/2014a Köllmann, Claudia and Bornkamp, Björn and Ickstadt, Katja. Unimodal regression using Bernstein-Schoenberg-splines and penalties. In Biometrics, Vol. 70, No. 4, pages 783-793, 2014. LaTeX Symbol Green Arrow


Koellmann/etal/2014b Köllmann, Claudia and Ickstadt, Katja and Fried, Roland. Beyond unimodal regression: modelling multimodality with piecewise unimodal, mixture or additive regression. No. 8, TU Dortmund, 2014. PDF-Symbol LaTeX Symbol


Schwiegelshohn/Sohler/2014a Chris Schwiegelshohn and Christian Sohler. Logistic Regression for Datastreams. No. 1, TU Dortmund, 2014. PDF-Symbol LaTeX Symbol


Binder/etal/2012a Binder, Harald and Müller, Tina and Schwender, Holger and Golka, Klaus and Steffens, Michael and Hengstler, Jan G. and Ickstadt, Katja and Schumacher, Martin. Cluster-localized sparse logistic regression for SNP data. In Statistical Applications in Genetics and Molecular Biology, Vol. 11, No. 4, 2012. LaTeX Symbol


Canzar/etal/2011c Canzar, Stefan and Marschall, Tobias and Rahmann, Sven and Schwiegelshohn, Chris. Solving The Minimum String Cover Problem. In David A. Bader and Petra Mutzel (editors), Proceedings of the SIAM Meeting on Algorithm Engineering and Experiments (ALENEX'12), pages 75--83, 2012. LaTeX Symbol Green Arrow


Koellmann/etal/2012a Köllmann, Claudia and Bornkamp, Björn and Ickstadt, Katja. Unimodal regression using Bernstein-Schoenberg-splines and penalties. No. 6, TU Dortmund, 2012. PDF-Symbol LaTeX Symbol


Lohr/etal/2012a Lohr, M. and Köllmann, C. and Freis, E. and Hellwig, B. and Hengstler, J. G. and Ickstadt, K. and Rahnenführer, J.. Optimal strategies for sequential validation of significant features from high-dimensional genomic data. In Journal of Toxicology and Environmental Health, Part A, Vol. 75, No. 8-10, pages 447-460, 2012. LaTeX Symbol


Schwender/etal/2012a Schwender, Holger and Selinski, Silvia and Blaszkewicz, Meinolf and Marchan, Rosemarie and Ickstadt, Katja and Golka, Klaus and Hengstler, Jan G.. Distinct SNP combinations confer susceptibility to urinary bladder cancer in smokers and non-smokers. In Plos One, Vol. 7, No. 12, 2012. LaTeX Symbol


Ickstadt/etal/2011b Ickstadt, Katja and Bornkamp, Björn and Grzegorczyk, Marco and Wieczorek, Jakob and Sheriff, M.Rahuman and Grecco, Hérnan E. and Zamir, Eli. Nonparametric Bayesian Networks (with discussion). In Bernardo, José M. and Bayarri, M. J. and Berger, James O. and Dawid, A. Philip and Heckerman, David and Smith, Adrian F. M. and West, M. (editors), Bayesian Statistics, Vol. 9, pages 283-316, 2011. LaTeX Symbol


Schwender/etal/2011a Schwender, Holger and Ruczinski, Ingo and Ickstadt, Katja. Testing SNPs and sets of SNPs for importance in association studies. In Biostatistics, Vol. 12, No. 1, pages 18-32, 2011. LaTeX Symbol


Sohler/Woodruff/2011a Sohler, Christian and Woodruff, David P.. Subspace embeddings for the \(L_1\)-norm with applications. In Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC), pages 755-764, ACM, 2011. LaTeX Symbol


  • Munteanu/etal/2022a - p-Generalized Probit Regression and Scalable Maximum Likelihood Estimation via Sketching and Coresets
  • Munteanu/etal/2022b - Bounding the Width of Neural Networks via Coupled Initialization - A Worst Case Analysis
  • Madjar/etal/2021a - Combining heterogeneous subgroups with graph-structured variable selection priors for Cox regression
  • Munteanu/etal/2021a - Oblivious Sketching for Logistic Regression
  • Parry/etal/2021a - Cross-Leverage Scores for Selecting Subsets of Explanatory Variables
  • Geppert/etal/2020a - Streaming statistical models via Merge & Reduce
  • Krivosija/Munteanu/2019a - Probabilistic smallest enclosing ball in high dimensions via subgradient sampling
  • Meintrup/etal/2019a - Random projections and sampling algorithms for clustering of high-dimensional polygonal curves
  • Munteanu/etal/2019a - A Framework for Bayesian Optimization in Embedded Subspaces
  • Tietz/etal/2019a - Identification of interactions of binary variables associated with survival time using survivalFS
  • Wigmann/etal/2019a - Modelling and Classification of GC/IMS Breath Gas Measurements for Lozenges of Different Flavours
  • Ickstadt/etal/2018a - Toward Integrative Bayesian Analysis in Molecular Biology
  • Molina/etal/2018a - Core Dependency Networks
  • Munteanu/etal/2018a - On Coresets for Logistic Regression
  • Munteanu/Schwiegelshohn/2018a - Coresets - Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms
  • Weihs/Ickstadt/2018a - Data Science: the impact of statistics
  • Geppert/etal/2017a - Random projections for Bayesian regression
  • Schlieker/etal/2017a - Multivariate binary classification of imbalanced datasets - A case study on high-dimensional multiplex autoimmune assay data
  • Treppmann/etal/2017a - Integration of multiple genomic data sources in a Bayesian Cox model for variable selection and prediction
  • Huels/etal/2016a - Applicability of the Global Lung Initiative 2012 Reference Values for Spirometry for Longitudinal Data of Elderly Women
  • Koellmann/etal/2016a - Beyond unimodal regression: modelling multimodality with piecewise unimodal regression or deconvolution models
  • Munteanu/Wornowizki/2015a - Correcting statistical models via empirical distribution functions
  • Koellmann/etal/2014a - Unimodal regression using Bernstein-Schoenberg-splines and penalties
  • Koellmann/etal/2014b - Beyond unimodal regression: modelling multimodality with piecewise unimodal, mixture or additive regression
  • Schwiegelshohn/Sohler/2014a - Logistic Regression for Datastreams
  • Binder/etal/2012a - Cluster-localized sparse logistic regression for SNP data
  • Canzar/etal/2011c - Solving The Minimum String Cover Problem
  • Koellmann/etal/2012a - Unimodal regression using Bernstein-Schoenberg-splines and penalties
  • Lohr/etal/2012a - Optimal strategies for sequential validation of significant features from high-dimensional genomic data
  • Schwender/etal/2012a - Distinct SNP combinations confer susceptibility to urinary bladder cancer in smokers and non-smokers
  • Ickstadt/etal/2011b - Nonparametric Bayesian Networks (with discussion)
  • Schwender/etal/2011a - Testing SNPs and sets of SNPs for importance in association studies
  • Sohler/Woodruff/2011a - Subspace embeddings for the \(L_1\)-norm with applications

Disserations:

  • Geppert/2018a - Bayesian and Frequentist Regression Approaches for Very Large Data Sets
  • Munteanu/2018a - On large-scale probabilistic and statistical data analysis
  • Koellmann/2016a - Unimodal spline regression and its use in various applications with single or multiple modes

Final Theses:

MoellerEhmcke/2022a Moeller-Ehmcke, Rieke Deborah. Effiziente Variablenselektion aus großen SNP Daten mittels approximierter Cross-Leverage Scores. TU Dortmund, 2022. PDF-Symbol LaTeX Symbol


Peters/2021a Christian Peters. Data Reduction for Efficient Probit Regression. TU Dortmund, 2021. PDF-Symbol LaTeX Symbol


Sandig/2019a Ludger Sandig. Effiziente Verfahren für Bayes'sche Mischungsmodelle. TU Dortmund, 2019. LaTeX Symbol


Freese/2017a Freese, Maximilian. Sketch-basierte Bayes-regression mit MapReduce. TU Dortmund, 2017. LaTeX Symbol


Lategahn/2016a Lategahn, Niels. Vergleich von Methoden zur Auswahl von Beobachtungen bei Regression mit fehlenden Y-Werten. TU Dortmund, 2016. PDF-Symbol LaTeX Symbol


Mueller/2016a Müller, Steffen. Untersuchung von Regression auf eingebetteten Datensätzen unter Verwendung von verschiedenen Abstandsnormen und Penalisierungstermen. TU Dortmund, 2016. PDF-Symbol LaTeX Symbol


Wollenberg/2016a Wollenberg, Alexander. Reduktion hochdimensionaler Datensätze für die logische Regression unter Verwendung von Leverage Scores mit besonderer Berücksichtigung von SNP-Daten. TU Dortmund, 2016. PDF-Symbol LaTeX Symbol


Horn/2015a Simon Horn. Analyse von Metabolom-Daten der Arzneipflanze Duboisia: Hauptkomponentenanalyse, Clusterung und Peakidentifzierung. TU Dortmund, 2015. LaTeX Symbol


Lange/2015a Laura Lange. Analyse von GC/IMS-Atemluftmessungen unter Berücksichtigung verschiedener Atemerfrischer. TU Dortmund, 2015. LaTeX Symbol


Rathjens/2015a Rathjens, Jonathan. Hierarchische Bayes-Regression bei Einbettung großer Datensätze. TU Dortmund, 2015. PDF-Symbol LaTeX Symbol


Mueller/2013a Müller, Steffen. Untersuchung der praktischen Anwendbarkeit von unimodaler Regression auf diverse naturwissenschaftliche Datensätze. TU Dortmund, 2013. LaTeX Symbol


Okroy/2013a Okroy, Lena. Untersuchung der praktischen Anwendbarkeit von nichtlinearer Regression auf verschiedene Datensätze. TU Dortmund, 2013. LaTeX Symbol


Quedenfeld/2013a Quedenfeld, Jens. Experimentelle Analyse verschiedener linearer \(\ell_2\)-Einbettungen von dünn besetzten Eingabedaten. TU Dortmund, 2013. PDF-Symbol LaTeX Symbol


Jabs/2012a Jabs, Verena. Vergleich von Methoden zur Dimensionsreduktion unter Berücksichtigung der Rechenzeit und des Speicherbedarfs. TU Dortmund, 2012. LaTeX Symbol


Rueppert/2012a Rüppert, Andreas. LASSO Regression für große Datenmengen. TU Dortmund, 2012. LaTeX Symbol


Zhu/2012a Qingchui Zhu. Datenstromalgorithmen für Regression. TU Dortmund, 2012. PDF-Symbol LaTeX Symbol


  • MoellerEhmcke/2022a - Effiziente Variablenselektion aus großen SNP Daten mittels approximierter Cross-Leverage Scores
  • Peters/2021a - Data Reduction for Efficient Probit Regression
  • Sandig/2019a - Effiziente Verfahren für Bayes'sche Mischungsmodelle
  • Freese/2017a - Sketch-basierte Bayes-regression mit MapReduce
  • Lategahn/2016a - Vergleich von Methoden zur Auswahl von Beobachtungen bei Regression mit fehlenden Y-Werten
  • Mueller/2016a - Untersuchung von Regression auf eingebetteten Datensätzen unter Verwendung von verschiedenen Abstandsnormen und Penalisierungstermen
  • Wollenberg/2016a - Reduktion hochdimensionaler Datensätze für die logische Regression unter Verwendung von Leverage Scores mit besonderer Berücksichtigung von SNP-Daten
  • Horn/2015a - Analyse von Metabolom-Daten der Arzneipflanze Duboisia: Hauptkomponentenanalyse, Clusterung und Peakidentifzierung
  • Lange/2015a - Analyse von GC/IMS-Atemluftmessungen unter Berücksichtigung verschiedener Atemerfrischer
  • Rathjens/2015a - Hierarchische Bayes-Regression bei Einbettung großer Datensätze
  • Mueller/2013a - Untersuchung der praktischen Anwendbarkeit von unimodaler Regression auf diverse naturwissenschaftliche Datensätze
  • Okroy/2013a - Untersuchung der praktischen Anwendbarkeit von nichtlinearer Regression auf verschiedene Datensätze
  • Quedenfeld/2013a - Experimentelle Analyse verschiedener linearer \(\ell_2\)-Einbettungen von dünn besetzten Eingabedaten
  • Jabs/2012a - Vergleich von Methoden zur Dimensionsreduktion unter Berücksichtigung der Rechenzeit und des Speicherbedarfs
  • Rueppert/2012a - LASSO Regression für große Datenmengen
  • Zhu/2012a - Datenstromalgorithmen für Regression

Preliminary Work:

Bornkamp/etal/2010a B. Bornkamp and K. Ickstadt and D. B. Dunson. Stochastically ordered multiple regression. In Biostatistics, Vol. 11, No. 3, pages 419-431, 2010. LaTeX Symbol


Feldman/etal/2010a Dan Feldman and Morteza Monemizadeh and Christian Sohler and David Woodruff. Coresets and Sketches for High Dimensional Subspace Approximation Problems. In Proceedings 21st Annual ACM-SIAM Symposium on Discrete Algorithms, pages 630-649, 2010. LaTeX Symbol


Bornkamp/etal/2009a B. Bornkamp and A. Fritsch and O. Kuss and K. Ickstadt. Penalty specialists among goalkeepers: A nonparametric Bayesian analysis of 44 years of German Bundesliga. In B. Schipp and W. Krämer (editors), Statistical Inference, Econometric Analysis and Matrix Algebra: Festschrift in Honour of Götz Trenkler, pages 63-76, Physica Verlag, 2009. LaTeX Symbol


Bornkamp/Ickstadt/2009b Bornkamp, Björn and Ickstadt, Katja. Bayesian nonparametric estimation of continuous monotone functions with applications to dose-response analysis. In Biometrics, Vol. 65, pages 198 -- 205, 2009. LaTeX Symbol


Frahling/etal/2008a Gereon Frahling and Piotr Indyk and Christian Sohler. Sampling in Dynamic Data Streams and Applications. In International Journal of Computational Geometry and Applications (Special Issue with selected papers from the 21st ACM Symposium on Computational Geometry), Vol. 18, No. 1/2, pages 3 -- 28, 2008. LaTeX Symbol


Schwender/Ickstadt/2008a Schwender, H. and Ickstadt, K.. Identification of SNP interactions using logic regression. In Biostatistics, Vol. 9, pages 187 -- 198, 2008. LaTeX Symbol


Feldman/etal/2007a Dan Feldman and Morteza Monemizadeh and Christian Sohler. A PTAS for k-means clustering based on weak coresets. In Proceedings of the 23rd ACM Symposium on Computational Geometry, pages 11-18, 2007. LaTeX Symbol


Fritsch/2007a Fritsch, A. und Ickstadt, K.. Comparing logic regression based methods for identifying SNP interactions. In Hochreiter, S. and Wagner, R. (editors), Bioinformatics in Research and Development, Springer, 2007. LaTeX Symbol


Nunkesser/etal/2007a Nunkesser, R. and Bernholt, T. and Schwender, H. and Ickstadt, K. and Wegener, I.. Detecting high-order interactions of single nucleotide polymorphisms using genetic programming. In Bioinformatics, Vol. 23, pages 3280 -- 3288, 2007. LaTeX Symbol


Ickstadt/Wolpert/99a K. Ickstadt and R. L. Wolpert. Spatial regression for marked point processes. In J. M. Bernardo and J. O. Berger and A. P. Dawid and A. F. M. Smith (editors), Bayesian Statistics 6, pages 323-341, Oxford, Oxford University Press, 1999. LaTeX Symbol


Wolpert/Ickstadt/98a R. L. Wolpert and K. Ickstadt. Poisson/Gamma random field models for spatial statistics. In Biometrika, Vol. 85, pages 251-267, 1998. LaTeX Symbol