Conference on Parsimony and Learning (CPAL)

March 2025, Stanford

Keynote Speakers

Clicking a speaker’s photo will jump to their talk information below.

Yingyu Liang

University of Hong Kong, University of Wisconsin-Madison

Michael Unser

École Polytechnique Fédérale de Lausanne (EPFL)

Talk Details

We study the geometry of deep learning through the lens of approximation theory via splines. The enabling insight is that a large class of deep networks can be written as a composition of continuous piecewise affine spline operators, which provides a powerful portal through which to interpret and analyze their inner workings. Our particular focus is the local geometry of the spline partition of the network’s input space, which opens up new avenues to study how deep networks parsimoniously organize signals in a hierarchical, multiscale fashion.

Bio

Richard G. Baraniuk is the C. Sidney Burrus Professor of Electrical and Computer Engineering at Rice University and the Founding Director of OpenStax and SafeInsights. His research interests lie in new theory, algorithms, and hardware for machine learning, signal processing, and sensing. He is a Member of the National Academy of Engineering and American Academy of Arts and Sciences and a Fellow of the National Academy of Inventors, AAAS, and IEEE. He has received the DOD Vannevar Bush Faculty Fellow Award, the IEEE Jack S. Kilby Signal Processing Medal, the IEEE Signal Processing Society Technical Achievement and Society Awards, the Harold W. McGraw, Jr. Prize in Education, and the IEEE James H. Mulligan, Jr. Education Medal, among others.

Surya Ganguli

Stanford University

Title: On learning and imagination: of mice and machines

Time and Location: Day 3, 3:00 PM PT, Simonyi Conference Center

Abstract

Three remarkable abilities of brains and machines are to: (1) learn behaviors from a single example, (2) imagine new possibilities, and (3) perform mathematical reasoning. I will discuss completely mechanistically interpretable and predictive models of how mice perform (1) during navigation, and how diffusion models imagine new images in (2). Then I will consider how LLM training protocols can be modified so that they can better perform mathematical reasoning in (3). Together this will cover a subset of how mice and machines learn, imagine and reason.

References:
Mouse navigation: https://www.nature.com/articles/s41586-024-08034-3
Theory of creativity in diffusion models: https://arxiv.org/abs/2412.20292
Mathematical reasoning: https://arxiv.org/abs/2502.07154

Bio

TBA

Alison Gopnik

University of California, Berkeley

Title: Empowerment Gain and Causal Model Construction

Time and Location: Day 2, 11:00 AM PT, Simonyi Conference Center

Abstract

Learning about the causal structure of the world is a fundamental problem for human cognition. Causal models and especially causal learning have proved to be difficult for Large Models using standard techniques of deep learning. In contrast, cognitive scientists have applied advances in our formal understanding of causation in computer science, particularly within the Causal Bayes Net formalism, to understand human causal learning. In the very different tradition of reinforcement learning, researchers have described an intrinsic reward signal called “empowerment” which maximizes mutual information between actions and their outcomes. “Empowerment” may be an important bridge between classical Bayesian causal learning and reinforcement learning and may help to characterize causal learning in humans and enable it in machines. If an agent learns an accurate causal world model they will necessarily increase their empowerment, and increasing empowerment will lead to a more accurate causal world model. Empowerment may also explain distinctive empirical features of children’s causal learning, as well as providing a more tractable computational account of how that learning is possible.

Bio

TBA

Fredrik Kjolstad

Stanford University

Title: Domain-Specific Software for ML, Hardware, and their Composition

Time and Location: Day 3, 9:00 AM PT, Simonyi Conference Center

Abstract

Computer scientists have always built languages and libraries to make it easier to write software for different domains. In the last decade, the demand for performance has dramatically increased due to more expensive uses of more data in machine learning. To deliver this performance, we are increasingly turning to specialized hardware. Such hardware places a large burden on the software stack and increases the need for compilers and programming models to be portable. I will share my thoughts on designing programming systems for machine learning applications that permit portable compilation across disparate hardware. These programming systems must raise the level of abstraction to diverse operations on four abstract collections: tensors, graphs, relations, and space. By raising the level of abstraction and by introducing new compiler techniques, we can make programs portable across different machines and different data structures. To manage complexity, compilers should target hardware-facing abstract machines that separate the software and hardware implementations. Finally, intermediate languages can also help us describe hardware to the compiler, so that we can target it without rewriting large parts of the compiler.

Bio

Fredrik Kjolstad is an Assistant Professor in Computer Science at Stanford University. He works on topics in compilers, programming models, and systems, with an emphasis on compiler techniques that make high-level languages portable. He has received the NSF CAREER Award, the MIT EECS Sprowls PhD Thesis Award in Computer Science, the Tau Beta Phi 2024 Teaching Honor Roll, the Rosing Award, an Adobe Fellowship, a Google Research Scholarship, and several distinguished paper awards.

Konrad Kording

University of Pennsylvania

Title: How Can We Succeed With Parsimonious Models in a World that is Not So Parsimonious

Time and Location: Day 4, 1:30 PM PT, Simonyi Conference Center

Abstract

Many interesting parts of the world, such as ecosystems, economies, psychology, cells and brains are, deep down, very much not explainable in a parsimonious way. And yet, humans, quite successfully, describe the world around them as it it were, and machine learning models are apparently internally quite parsimonious. I will comment on the statistical nature of the world around us and why we may succeed well assuming parsimony.

Bio

TBA

Jason Lee

Princeton University

Title: Emergence and Scaling Laws for SGD Learning and Learning Compositional Functions with Transformers

Time and Location: Day 2, 3:00 PM PT, Simonyi Conference Center

Abstract

This is a two part talk. (1) We study the sample and time complexity of online stochastic gradient descent (SGD) for learning a two-layer neural network with $P$ orthogonal neurons on isotropic Gaussian data. We focus on the challenging regime $P\gg 1$ and allow for large condition number in the second-layer, covering the power-law scaling $a_p= p^{-\beta}$ as a special case. We characterize the SGD dynamics for the training of a student two-layer network to minimize the squared loss, and identify sharp transition times for the recovery of each signal direction. In the power-law setting, our analysis entails that while the learning of individual teacher neurons exhibits abrupt phase transitions, the juxtaposition of $P\gg 1$ emergent learning curves at different timescales leads to a smooth scaling law in the cumulative squared loss. (2) Transformer-based language models have demonstrated impressive capabilities across a range of complex reasoning tasks. Prior theoretical work exploring the expressive power of transformers has shown that they can efficiently perform multi-step reasoning tasks. However, the learnability of such constructions, particularly the conditions on the data distribution that enable efficient learning via SGD, remains an open question. Towards answering this question, we study the learnability of a task called $k$-fold composition, which requires computing an interleaved composition of $k$ input permutations and $k$ hidden permutations, and can be expressed by a transformer with $O(\log k)$ layers. We show that this function class can be efficiently learned, with runtime and sample complexity polynomial in $k$, by gradient descent on an $O(\log k)$-depth transformer via mixed training: one in which data consists of $k'$-fold composition functions with $k' \le k$ trained on simultaneously. Our work sheds light on the necessity and sufficiency of having both easy and hard examples in the data distribution for transformers to learn complex compositional tasks. A corresponding statistical query lower bound shows that without mixed training requires $\exp(k)$ samples and time.

Bio

TBA

Yingyu Liang

University of Hong Kong, University of Wisconsin-Madison

Title: Towards Better Understanding of Deep Learning via Investigations of the Learning Dynamics

Time and Location: Day 4, 9:00 AM PT, Simonyi Conference Center

Abstract

Compared to the unprecedented empirical success of deep learning, theoretical understanding largely lags behind. In particular, traditional tools are inadequate for analyzing the optimization and generalization in deep learning. This talk will discuss the unique and novel challenges in this direction, via a few case studies: the neural tangent kernel (NTK) approach, feature learning beyond NTK, and different in-context learning behavior of larger language models.

Bio

TBA

Yuandong Tian

Meta AI Research

Title: Emergence of Various Structures During Transformer Training via the Lens of Training Dynamics

Time and Location: Day 4, 4:00 PM PT, Simonyi Conference Center

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across diverse applications. However, most empirical works treat the underlying architecture as black boxes, and it remains a mystery what representation the model learns and how it learns. I will cover two aspects of our theoretical analysis, including the training dynamics of self-attention layers when learning Transformers (i.e. how it learns), as well as intriguing structure of the resulting representations (i.e. what it learns), which includes not only basic structure of sparsity and low rankness, but also more complicated ones such as algebraic, hierarchical and spectral structures. Our analysis provides insights into the complicated nonlinear learning process beyond the scope of traditional learning theory, leads to development of novel empirical approaches and shed light on a possible unification of neural and symbolic representations.

Bio

TBA

Michael Unser

École Polytechnique Fédérale de Lausanne (EPFL)

Title: Are the Methods of Convex Analysis Competitive with Deep Neural Networks?

Time and Location: Day 3, 11:30 AM PT, Simonyi Conference Center

Abstract

Computational imaging is currently dominated by two paradigms. Traditional variational methods, supported by well-established theory, provide guarantees for convergence, stability, and signal recovery from limited measurements, as in compressed sensing. In contrast, deep neural network methods generally achieve superior image reconstruction but suffer from a lack of robustness (tendency to hallucinate) and theoretical understanding. This raises a fundamental question: Can variational methods be improved by learning the regularizer while maintaining their theoretical guarantees? To address this, we introduce a general framework for image reconstruction under the constraints of amplitude-equivariance and convexity. We demonstrate that polyhedral norms enable universality, allowing for the design of trainable regularization architectures. These architectures outperform traditional sparsity-based methods, and help us bridge the gap between theoretical rigor and practical performance in computational imaging.

Bio

Michael Unser is Full Professor at the EPFL and the academic director of EPFL’s Center for Imaging, Lausanne, Switzerland. His primary areas of investigation are biomedical imaging and applied functional analysis. He is internationally recognized for his research contributions to sampling theory, wavelets, the use of splines for image processing, and computational bioimaging. He has published over 400 journal papers on those topics. Prof. Unser is a fellow of the IEEE (1999), an EURASIP fellow (2009), and a member of the Swiss Academy of Engineering Sciences. He is the recipient of several international prizes including five IEEE-SPS Best Paper Awards, two Technical Achievement Awards from the IEEE (2008 SPS and EMBS 2010), the Technical Achievement Award from EURASIP (2018), and the IEEE-EMBS Career Achievement Award (2020). He was awarded three ERC AdG grants: FUNSP (2011-2016), GlobalBioIm (2016-2021), and FunLearn (2021-2026) in succession, with the ERC funding scheme being the most competitive one in Europe.

Keynote Speakers

Talk Details

Title: Parsimony in the Geometry of Deep Learning

Time and Location: Day 2, 9:00 AM PT, Simonyi Conference Center

Abstract

Bio

Title: On learning and imagination: of mice and machines

Time and Location: Day 3, 3:00 PM PT, Simonyi Conference Center

Abstract

Bio

Title: Empowerment Gain and Causal Model Construction

Time and Location: Day 2, 11:00 AM PT, Simonyi Conference Center

Abstract

Bio

Title: Domain-Specific Software for ML, Hardware, and their Composition

Time and Location: Day 3, 9:00 AM PT, Simonyi Conference Center

Abstract

Bio

Title: How Can We Succeed With Parsimonious Models in a World that is Not So Parsimonious

Time and Location: Day 4, 1:30 PM PT, Simonyi Conference Center

Abstract

Bio

Title: Emergence and Scaling Laws for SGD Learning and Learning Compositional Functions with Transformers

Time and Location: Day 2, 3:00 PM PT, Simonyi Conference Center

Abstract

Bio

Title: Towards Better Understanding of Deep Learning via Investigations of the Learning Dynamics

Time and Location: Day 4, 9:00 AM PT, Simonyi Conference Center

Abstract

Bio

Title: Emergence of Various Structures During Transformer Training via the Lens of Training Dynamics

Time and Location: Day 4, 4:00 PM PT, Simonyi Conference Center

Abstract

Bio

Title: Are the Methods of Convex Analysis Competitive with Deep Neural Networks?

Time and Location: Day 3, 11:30 AM PT, Simonyi Conference Center

Abstract

Bio