TES OptML/CoMinds Workshop Optimization Methods for and within Machine Learning

Institut für Mathematik | TES OptML/ CoMinds Workshop

TES OptML/CoMinds Workshop Optimization Methods for and within Machine Learning

The third workshop of the Thematic Einstein Semester Optimization for Machine Learning will be held from July 12 to 14 at the Humboldt-Universität zu Berlin. It is organized together with the GAMM Activity group Computational and Mathematical Methods in Data Science (CoMinds). It focuses on the development and the application of optimization methods in the context of machine learning as well as on the analysis of optimization problems arising in machine learning.

Invited Speakers

Serkan Gugercin (Virginia Tech)

Philipp Hennig (University of Tübingen)

Fanny Yang (ETH Zürich)

PROGRAMME

Our workshop will roughly constitute seventeen talks. This includes both regular (with 35 minutes for the presentation and 10 minutes for questions) and invtited presentations (lasting 50 minutes with an additional 10 minutes for questions). Below is the detailed schedule.

July, 12th

13:00

Opening

13:15 - 14:15

Invited presentation - Serkan Gugercin, Virgina Tech: Learning dynamical systems from time- and frequency-response data

14:15 - 15:00

Miltiadis Poursanidis, ITWM Kaiserslautern:

🖉

Shape-constrained regression is an active subfield of informed machine learning that aims to incorporate shape constraints into regression problems. Shape constraints, such as boundedness, monotonicity, or convexity, restrict the shape of the regression function over the entire domain. Shape-constrained regression problems can be formulated as semi-infinite optimization problems that are typically convex. In our work, we propose an adaptive discretization algorithm for convex semi-infinite optimization with the following properties: first, the algorithm terminates at a feasible point whose value deviates from the optimal value of the problem only by a preset error bound. For general semi-infinite optimization problems, the best possible precision of the approximate solution can be arbitrarily bad. Second, the algorithm terminates after a finite number of iterations, and we provide an upper bound on the number of iterations. Third, the discretizations performed by the algorithm are considerably smaller thanks to an exchange-type procedure. Fourth, all occurring finite optimization subproblems in our algorithm need to be solved only approximately. Moreover, strict feasibility is the only assumption made on the feasible set and it can easily be verified for a given shape-constrained polynomial regression problem. In the end, we apply the algorithm to a real-world example from quality prediction in manufacturing.

15:00 - 15:30

Coffee break (Erwin-Schrödinger Zentrum, Foyer)

15:30 - 17:45

Jia-Jie Zhu, WIAS:

🖉

Motivated by the force-balance formulation of PDE gradient flows, I will demonstrate using the optimality condition in the dual space to solve distributionally robust optimization (DRO) problems, which seek learning and optimization solutions under distribution shifts. One of the recent breakthroughs of DRO is the adoption of the Wasserstein geometry. While the literature has exploded, it is restricted to the pure transport regime, i.e., no creation or destruction of mass. Its practical effect is also similar to existing regularization techniques already used by machine learners. In addition, many state-of-the-art Wasserstein DRO algorithms place severe limitations on the learning models and scalability. With those in mind, I will introduce applications of the duality principle beyond the Wasserstein DRO using unbalanced optimal transport and kernel geometry.

Rohit Pochampalli, RPTU Kaiserslautern:

🖉

It is a well known empirical phenomenon that although stochastic gradient descent (SGD) is slower than some adaptive gradient based methods such as Adam or AdaGrad, it produces an optimum that, typically, has a better generalization ability. The loss functions of neural networks are non-convex but have interesting local properties, such as one point convexity. We show that, given an appropriate selection of step sizes, SGD can converge to points of small gradient norm, whereas the adaptive methods are more unpredictable based on the problem. This ties into the idea that SGD converges to a more robust minimum, in the sense that it converges to an optimum where points in its neighborhood have small gradient norms.

Hanno Gottschalk, TU Berlin:

18:00 -

Get together (Erwin-Schrödinger Zentrum, Foyer)

July, 13th

10:00 - 11:30

Shubhaditya Burela, TU Berlin:

Parametric model order reduction of a 2D wildland fire model with the shifted POD-based deep learning method 🖉

Parametric model order reduction techniques often struggle to accurately represent transport-dominated phenomena due to a slowly decaying Kolmogorov n-width. To address this challenge, we propose a non-intrusive,data-driven methodology that combines the shifted proper orthogonal decomposition (POD) with deep learning. Specifically, the shifted POD technique is utilized to derive a high-fidelity, low-dimensional model of the flow, which is subsequently utilized as input to a deep learning framework to forecast the flow dynamics under various temporal and parameter conditions. The efficacy of the proposed approach is demonstrated through the analysis of one- and two-dimensional wildland fire models with varying reaction rates, and its performance is evaluated using multiple error measures. The results indicate that the proposed approach yields highly accurate results within the percent range, while also enabling rapid prediction of system states within seconds.

Julia Pelzer, Uni Stutgart: Surrogate Models for Groundwater Flow Simulations

11:30 - 12:00

Coffee break (Erwin-Schrödinger Zentrum, Foyer)

12:00 - 12:45

Timo Kreimeier, HU Berlin: Solving Data-Based Two-Step Nonsmooth Retail Portfolio Maximization Problems

12:45 - 14:00

Lunch break (attendees on their own)

14:00 - 15:00

Invited presentation - Philipp Hennig, University of Tübingen:

Learning from Computation or Computation for Learning? 🖉

Optimization methods and other numerical algorithms are widely considered as internal routines of learning machines. But in this task, optimization methods estimate an unknown function (the optimal argument of the optimization objective) that is defined implicitly through data. I will argue that this creates a unnecessarily distinction between ``compute’’ — numbers produced from a processing chip and ``data’’ — numbers produced from a storage chip. And that this distinction causes tangible problems of complexity and inconsistency in machine learning systems. Both computational and empirical data should thus be phrased in the joint language of probabilistic inference. This gives rise to a formalism of probabilistic computation that allows machine learning systems to actively manage both their computational cost and their consumption of data.

15:00 - 15:45

Martin Spindler, Uni Hamburg:

🖉

Boosting is one of the most significant developments in machine learning. This paper studies the rate of convergence of L2Boosting, which is tailored for regression, in a high-dimensional setting. Moreover, we introduce so-called "post-Boosting". This is a post-selection estimator which applies ordinary least squares to the variables selected in the first stage by L2Boosting. Another variant is "Orthogonal Boosting" where after each step an orthogonal projection is conducted. We show that both post-L2Boosting and the orthogonal boosting achieve the same rate of convergence as LASSO in a sparse, high-dimensional setting. We show that the rate of convergence of the classical L2Boosting depends on the design matrix described by a sparse eigenvalue constant. To show the latter results, we derive new approximation results for the pure greedy algorithm, based on analyzing the revisiting behavior of L2Boosting. We also introduce feasible rules for early stopping, which can be easily implemented and used in applied work. Our results also allow a direct comparison between LASSO and boosting which has been missing from the literature. Finally, we present simulation studies and applications to illustrate the relevance of our theoretical results and to provide insights into the practical aspects of boosting. In these simulation studies, post-L2Boosting clearly outperforms LASSO.

15:45 - 16:15

Coffee break (Erwin-Schrödinger Zentrum, Foyer)

16:15 - 17:45

Franz Bethke, HU Berlin: On a Semismooth Conjugate Gradients Method

Martin Stoll, Uni Chemnitz:

Efficient Training of Gaussian Process Regression 🖉

Gaussian processes are a versatile tool in machine learning and statistics. We will in this talk introduce the general concept of Gaussian processes and in particular focus on the optimization problems that are associated with obtaining the hyperparameters that best fit the training data. For the optimization of the likelihood function one not only needs to solve a linear system of equations but also requires the evaluation of a log-determinant. While this is easy for small data sets it requires different approaches if the kernel matrices get large. This leads to a trace estimation problem that can be solved via a Krylov subspace method. No optimization without gradients, right? So computing them also becomes challenging and some mathematical care is needed to compute the correct gradient. We are in luck as this again can be done with some Krylov subspace magic and I can hopefully illustrate on some examples that this can be used beyond the traditional matrix case.

19:00 -

Workshop dinner (Don Giovanni, Bismarckstraße 28, 10625 Berlin)

July 14th

9:00 - 10:00

Invited presentation - Fanny Yang, ETH Zürich:

Strong inductive biases provably prevent harmless interpolation 🖉

Interpolating models have recently gained popularity in the statistical learning community due to common practices in modern machine learning: complex models achieve good generalization performance despite interpolating high-dimensional training data. In this talk, we prove generalization bounds for high-dimensional linear models that interpolate noisy data generated by a sparse ground truth. In particular, we first show that minimum-l1-norm interpolators achieve high-dimensional asymptotic consistency at a logarithmic rate. Further, as opposed to the regularized or noiseless case, for min-lp-norm interpolators with 1<p<2 we surprisingly obtain polynomial rates. Our results suggest a new trade-off for interpolating models: a stronger inductive bias encourages a simpler structure better aligned with the ground truth at the cost of an increased variance. We finally discuss our latest results where we show that this phenomenon also holds for nonlinear models.

10:00 - 10:30

Coffee break (Erwin-Schrödinger Zentrum, Foyer)

10:30 - 12:45

Markus Reiss, HU Berlin:

Early stopping in iterative learning methods 🖉

Many statistical and learning procedures rely on iterative solvers, e.g. gradient descent steps for minimizing an empirical risk criterion. It turns out that stopping the iterates before convergence ("early stopping") usually regularizes the problem and is thus recommendable from a statistical perspective while it also reduces computational costs considerably. The underlying statistical problem is to design early stopping rules that are completely data-driven and lead to optimal statistical estimation bounds. We clarify the basic ideas for spectral-cutoff projection methods with exact oracle inequalities and matching lower bounds under sequential information constraints. We then look at the nonlinear conjugate gradient or partial least squares algorithm for linear models and discuss how the basic ideas can be adapted to a modern greedy learning procedure.

Frederik Köhne, Uni Bayreuth

Adaptive Step Size Control for Stochastic Optimization 🖉

The training process of machine learning models usually requires several hyperparameters to be set in advance. The lack of clear guidelines on how to choose such hyperparameters optimally leads to computationally expensive testing of several hyperparameter - settings. A prominent example of such a hyperparameter is the step size (or learning rate) for a stochastic optimization algorithm like Stochastic Gradient Descent (SGD).
In this talk, we show that in a special setting it is possible to derive the step size adaptively from quantities, that can be computed relatively cheep during the run of the algorithm. This leads to a version of SGD with adaptive step size control. It is shown, that step size selection and convergence properties can benefit from preconditioning, when the algorithm is applied to stochastic quadratic problems.
We demonstrate, that this algorithm performs well even in problem domains not covered by theory, e.g. classic image classification tasks.

Volker Mehrmann, TU Berlin:

12:45 - 13:00

Closing

A list of participants can be found here.

VENUE

Attention: Important change!

Due to the closing of the Mathematic building of the TU Berlin, the workshop will take place at the Erwin-Schrödinger-Zentrum of the HU Berlin, Campus Adlershof, Rudower Chaussee 26, 12489 Berlin, in the room 0'119.

Here you can find information on how to get to this building.

Organizers

Tobias Breiten, Axel Klawonn, Martin Stoll, Andrea Walther

Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Institut für Mathematik

PROGRAMME

July, 12th

13:00

13:15 - 14:15

14:15 - 15:00

Miltiadis Poursanidis, ITWM Kaiserslautern:

15:00 - 15:30

15:30 - 17:45

18:00 -

July, 13th

10:00 - 11:30

11:30 - 12:00

12:00 - 12:45

12:45 - 14:00

14:00 - 15:00

15:00 - 15:45

15:45 - 16:15

16:15 - 17:45

19:00 -

July 14th

9:00 - 10:00

10:00 - 10:30

10:30 - 12:45

12:45 - 13:00

VENUE