
Keynote Speakers
Clicking a speaker’s photo will jump to their talk information below.

Francis Bach
INRIA - Ecole Normale Superieure

Matthias Bethge
University of Tübingen

Niao He
ETH Zurich

Andreas Krause
ETH Zurich

Yingyu Liang
University of Hong Kong, University of Wisconsin-Madison

Bernhard Scholkopf
Max Planck Institute for Intelligent Systems / ELLIS Institute Tübingen

Taiji Suzuki
University of Tokyo / RIKEN AIP

Jared Tanner
University of Oxford

Leena Chennuru Vankadara
University College London

Fanny Yang
ETH Zurich
Talk Details
Francis Bach
INRIA - Ecole Normale Superieure
Title: Recent advances in uncertainty quantification: anytime guarantees and multivariate predictions
Time and Location: Day 3, 2:30 PM, MPH Lecture Hall, Max-Planck-Ring 6
Abstract
Quantifying uncertainty in statistics and machine learning is crucial, but challenging in high-dimensional prediction problems. Probabilisitic calibration and conformal prediction have emerged as key practical theoretically well motivated frameworks. In this talk I will present recent advances that allow greater flexibility in their applications, in terms of anytime guarantees and applications in multivariate prediction problems beyond univariate regression and binary classification.
Matthias Bethge
University of Tübingen
Title: Designing Frontier AI Institutions as Institutional Learning Machines
Time and Location: Day 3, 9:00 AM, MPH Lecture Hall, Max-Planck-Ring 6
Abstract
Advances in AI are rapidly reducing transaction costs, enabling continuous, large-scale interaction data and closed-loop optimization in real-world systems. A hallmark of frontier AI is its close symbiosis with business models and institutional data processing that optimize objectives derived from user feedback. In institutions such as firms, markets, education, and governance, objectives are traditionally implicit—encoded in laws, organizational structures, and incentive systems. As these systems become increasingly digital, they can be modeled as institutional learning machines, where inputs, outputs, and feedback signals become observable and programmable with vanishing transaction costs. Building on the transition to the post-dataset era, this perspective introduces a new level of meta-adaptivity at the level of objectives, opening the door to explicit and adaptive institutional design beyond fixed and opaque objectives. In this talk, I illustrate this perspective through the AIS project, an AI-based education platform that treats the school system itself as a learning system. By enabling continuous, data-driven optimization of curricula and learning processes, AIS provides a concrete example of how institutional learning machines can be built in practice. From this perspective, the central challenge shifts from optimizing models under a given objective to designing the learning signals, feedback loops, and data access mechanisms that govern how objectives evolve—particularly in markets, where demand acts as a primary learning signal. The value proposition of institutional machine learning thus spans both the design of frontier AI companies and the reframing of policy debates—such as Europe’s search for its role in the AI race—into a well-defined research agenda at the intersection of machine learning, mechanism design, and digital infrastructure. More fundamentally, it recasts frontier AI debates on social acceptance, antitrust, and data silos as a technological design problem: how institutions learn, and who controls that process to maximize public rather than privately concentrated value.
Niao He
ETH Zurich
Title: From Learning Dynamics to Minima Selection—and Back
Time and Location: Day 4, 9:00 AM, MPH Lecture Hall, Max-Planck-Ring 6
Abstract
Deep learning operates in highly overparameterized, nonconvex regimes where the loss landscape often contains many global minimizers forming connected manifolds. In such settings, optimization does not simply find a minimizer. Empirically, even after training error reaches zero, test performance often continues to improve, suggesting that learning dynamics keep evolving within the manifold of solutions. This raises a fundamental question: which minima do learning algorithms select? More ambitiously, can we steer learning dynamics toward desirable minima? In this talk, I will present recent results that shed light on both implicit and active minima selection, highlighting the roles played by optimization dynamics (first- and zeroth-order methods), stochastic noise, and the geometry of the solution manifold.
Andreas Krause
ETH Zurich
Title: Learning at Test Time: Foundations and Agents
Time and Location: Day 4, 10:30 AM, MPH Lecture Hall, Max-Planck-Ring 6
Abstract
Test-time training (TTT) is an emerging paradigm for adapting large pre-trained foundation models to specific instances at inference time. Instead of treating the model as fixed at test time, TTT performs a small number of gradient updates using data related to the test instance. Empirical studies have shown that this simple idea can significantly improve performance across a wide range of applications. In this talk, I will present our recent work aimed at understanding why test-time training works, even when test instances are in-distribution. I will also present methods for actively selecting informative data for TTT, and recent efforts to enable reinforcement learning agents to adapt at test time when solving difficult reasoning tasks.
Yingyu Liang
University of Hong Kong, University of Wisconsin-Madison
Title: A Tangram Theory of Generalization: Rethinking Machine Learning via the Lens of Composition
Time and Location: Day 2, 9:00 AM, MPH Lecture Hall, Max-Planck-Ring 6
Abstract
Modern machine learning models display abilities that exceed the assumptions of classical statistical learning, particularly their capacity to solve test-time tasks far beyond those seen during training—an ability widely viewed as central to progress toward AGI. Such phenomena call for new theoretical frameworks. This talk presents a perspective based on composition: the idea that models generalize by recombining learned skills to address novel, more complex tasks. Empirical evidence and preliminary theoretical results will be provided to support this viewpoint, aiming to motivate further investigation into this tangram theory of generalization.
Bernhard Scholkopf
Max Planck Institute for Intelligent Systems / ELLIS Institute Tübingen
Title: TBA
Time and Location: Day 4, 4:00 PM, MPH Lecture Hall, Max-Planck-Ring 6
Abstract
TBA
Taiji Suzuki
University of Tokyo / RIKEN AIP
Title: Feature learning theories for test time inference
Time and Location: Day 2, 10:30 AM, MPH Lecture Hall, Max-Planck-Ring 6
Abstract
Feature learning is a fundamental advantage of deep foundation models, including Transformers. Specifically, the quality of the features acquired during the pre-training stage is crucially important for effective test-time inference. While acquiring irrelevant features can severely degrade performance, learning robust representations can enable strong out-of-distribution (OOD) generalization. In this talk, I will explore this phenomenon through the lenses of associative recall and in-context learning. In the first half, I will focus on associative recall, demonstrating that training data diversity allows Transformers to acquire useful features that generalize to OOD test data. Furthermore, I will show that Transformers can handle infinite-length contexts in nonlinear associative recall problems to achieve a minimax optimal rate. In the second half, I will discuss in-context learning from a feature learning perspective. I will demonstrate that nonlinear feature learning can occur even at test time without gradient descent. Finally, I will show how applying test-time training can further improve performance while requiring fewer instructions.
Jared Tanner
University of Oxford
Title: Deep neural network initialisation: Nonlinear activations impact on the Gaussian process
Time and Location: Day 2, 2:30 PM, MPH Lecture Hall, Max-Planck-Ring 6
Abstract
Randomly initialised deep neural networks are known to generate a Gaussian process for their pre-activation intermediate layers. We will review this line of research with extensions to deep networks having structured random entries such as block-sparse or low-rank weight matrices. We will then discuss how the choice of nonlinear activations impacts the evolution of the Gaussian process. Specifically we will discuss why sparsifying nonlinear activations such as soft thresholding are unstable, we will show conditions to overcome such issues, and we will show how increasing the variance of the hidden layer Gaussian process can improve training instability for sparsifying activations. This work is joint with Ilan Price (DeepMind), and from Oxford: Alireza Naderi, Thiziri Nait Saada, and Emily Dent.
Leena Chennuru Vankadara
University College London
Title: Towards a theory of scaling in deep learning
Time and Location: Day 4, 2:30 PM, MPH Lecture Hall, Max-Planck-Ring 6
Abstract
Scale plays a central role in modern deep learning, where increasing model size, data, and compute often produces regular improvements in performance, but can also bring models into qualitatively different regimes. Yet, these gains are not determined by size alone: whether larger models perform well depends crucially on how models and training procedures are scaled. In this talk, I will discuss scaling theory as a principled framework for studying these questions. I will focus on scaling limits as a natural lens for large-scale learning, showing how they help us derive principled scaling rules while shedding light on the empirical behavior of practical, finite-width networks.
Fanny Yang
ETH Zurich
Title: On the sample complexity of semi-supervised multi-objective learning
Time and Location: Day 3, 10:30 AM, MPH Lecture Hall, Max-Planck-Ring 6
Abstract
In multi-objective learning (MOL), several possibly competing prediction tasks must be solved jointly by a single model. Achieving good trade-offs may require a model class G with larger capacity than what is necessary for solving the individual tasks. This, in turn, increases the statistical cost, as reflected in known MOL bounds that depend on the complexity of G. We show that this cost is unavoidable for some losses, even in an idealized semi-supervised setting, where the learner has access to the Bayes-optimal solutions for the individual tasks as well as the marginal distributions over the covariates. On the other hand, for objectives defined with Bregman losses, we prove that the complexity of G may come into play only in terms of unlabeled data. Concretely, we establish sample complexity upper bounds, showing precisely when and how unlabeled data can significantly alleviate the need for labeled data. These rates are achieved by a simple, semi-supervised algorithm via pseudo-labeling.