Each awarded CPAL Rising Star will give a talk about their research in one of three sessions during the conference.
Presentations are ten minutes in duration, with two minutes for Q&A.
The ordering of session numbers matches their chronological ordering, and presentations will be delivered in the order they are listed. See the full program for the precise time and location of each CPAL Rising Stars session.
Title: Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks
Abstract: In this work, we reveal a strong implicit bias of stochastic gradient descent (SGD) that drives overly expressive networks to much simpler subnetworks, thereby dramatically reducing the number of independent parameters, and improving generalization. To reveal this bias, we identify invariant sets, or subsets of parameter space that remain unmodified by SGD. We focus on two classes of invariant sets that correspond to simpler (sparse or low-rank) subnetworks and commonly appear in modern architectures. Our analysis uncovers that SGD exhibits a property of stochastic attractivity towards these simpler invariant sets. We establish a sufficient condition for stochastic attractivity based on a competition between the loss landscape’s curvature around the invariant set and the noise introduced by stochastic gradients. Remarkably, we find that an increased level of noise strengthens attractivity, leading to the emergence of attractive invariant sets associated with saddle-points or local maxima of the train loss. We observe empirically the existence of attractive invariant sets in trained deep neural networks, implying that SGD dynamics often collapses to simple subnetworks with either vanishing or redundant neurons. We further demonstrate how this simplifying process of stochastic collapse benefits generalization in a linear teacher-student framework. Finally, through this analysis, we mechanistically explain why early training with large learning rates for extended periods benefits subsequent generalization.
University of California, Berkeley
Title: White-Box Transformers via Sparse Rate Reduction
Abstract: In this talk, I will present the white-box transformer — CRATE (i.e., Coding RAte reduction TransformEr). We contend that the objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a mixture of low-dimensional Gaussian distributions supported on incoherent subspaces. The quality of the final representation can be measured by a unified objective function called sparse rate reduction. From this perspective, popular deep networks such as transformers can be naturally viewed as realizing iterative schemes to optimize this objective incrementally. Particularly, we show that the standard transformer block can be derived from alternating optimization on complementary parts of this objective: the multi-head self-attention operator can be viewed as a gradient descent step to compress the token sets by minimizing lossy coding rate. This leads to a family of white-box transformer architectures which are mathematically interpretable. Our experiments show that these networks indeed learn to optimize the designed objective: they compress and sparsify representations of large-scale real-world vision datasets such as ImageNet, and achieve performance very close to thoroughly engineered transformers (ViTs). I will also present some recent theoretical and empirical results of CRATE on emergence behavior, language modeling, and auto-encoding.
University of Michigan
Title: Understanding Hierarchical Representations in Deep Networks via Intermediate Features
Abstract: Over the past decade, deep learning has proven to be a highly effective method for learning meaningful features from raw data. This work attempts to unveil the mystery of hierarchical feature learning in deep networks. Specifically, in the context of multi-class classification problems, we explore how deep networks transform input data by investigating the output (i.e., features) of each layer after training. Towards this goal, we first define metrics for within-class compression and between-class discrimination of intermediate features, respectively. Through an analysis of these two metrics, we show that the evolution of features follows a simple and quantitative law from shallow to deep layers: Each layer of linear networks progressively compresses within-class features at a linear rate and discriminates between-class features at a sublinear rate. To the best of our knowledge, this is the first quantitative characterization of feature evolution in hierarchical representations of deep networks. Moreover, our extensive experiments validate our theoretical findings numerically.
Title: Emergent properties of heuristics in machine learning
Abstract: Successful methods in modern machine learning practice are built on solid intuition and theoretical insight by their designers, but are often ultimately heuristic and exhibit unintended emergent behaviors. Sometimes these emergent behaviors are detrimental, but surprisingly, many provide unexpected desirable benefits. By theoretically characterizing these emergent behaviors, we can develop a more robust methods development process, where more and more of these desirable behaviors can be included by design and leveraged in powerful ways. I will discuss several examples of heuristics and emergent behavior: subsampling and sketching in linear regression and their equivalence to ridge regularization; empirical risk minimization and the universality of relative performances under distribution shifts; and adaptivity in dropout and feature learning models which are equivalent to parsimony-promoting sparse or low-rank regularization.
Johns Hopkins University
Title: The Role of Parsimonious Structures in Data for Trustworthy Machine Learning
Abstract: This talk overviews recent theoretical results in the geometric foundations of adversarially robust machine learning. Modern ML classifiers can fail spectacularly when subject to specially crafted input-perturbations, called adversarial examples. On the other hand, we humans are quite robust for several tasks involving vision. Motivated by this disconnect, in the first part of this talk we will take a deeper dive into the question of when exactly we can avoid adversarial examples. We will see that a key geometric property of the data-distribution — concentration on small-volume subsets of the input space — characterizes whether any robust classifier exists. In particular, this suggests that natural image distributions are concentrated. In the second part of this talk, we will empirically instantiate these results for a few concentrated data-distributions, and discover that utilizing such structure in data leads to classifiers that enjoy better provable robustness guarantees in several regimes. This talk is based on work at NeurIPS ’23, ’20 and TMLR ’23.
Johns Hopkins University
Title: Sparsity-aware generalization theory for deep neural networks
Abstract: Deep artificial neural networks achieve surprising generalization abilities that remain poorly understood. In this paper, we present a new approach to analyzing generalization for deep feed-forward ReLU networks that takes advantage of the degree of sparsity that is achieved in the hidden layer activations. By developing a framework that accounts for this reduced effective model size for each input sample, we are able to show fundamental trade-offs between sparsity and generalization. Importantly, our results make no strong assumptions about the degree of sparsity achieved by the model, and it improves over recent norm-based approaches. We illustrate our results numerically, demonstrating non-vacuous bounds when coupled with data-dependent priors in specific settings, even in over-parametrized models.
École Polytechnique Fédérale de Lausanne
Title: On the Sparsity-Promoting Effect of Weight Decay in Deep Learning
Abstract: Deep learning has been wildly successful in practice and most state-of-the-art artificial intelligence systems are based on neural networks. Lacking, however, is a rigorous mathematical theory that adequately explains the amazing performance of deep neural networks. In this talk, I present a new mathematical framework that provides the beginning of a deeper understanding of deep learning. This framework precisely characterizes the functional properties of trained neural networks through the lens of sparsity. The key mathematical tools which support this framework include transform-domain sparse regularization, the Radon transform of computed tomography, and approximation theory. This framework explains the effect of weight decay regularization in neural network training, the importance of skip connections and low-rank weight matrices in network architectures, the role of sparsity in neural networks, and explains why neural networks can perform well in high-dimensional problems.
University of Texas at Austin / Eindhoven University of Technology / University of Oxford
Title: Sparsity in Neural Networks: Science and Practice
Abstract: Sparsity has demonstrated its remarkable performance in the realm of model compression through the selectively eliminating a large portion of model parameters. Nevertheless, conventional methods to discover strong sparse neural networks often necessitate the training of an over-parameterized dense model, followed by iterative cycles of pruning and re-training. As the size of modern neural networks exponentially increases, the costs of dense pre-training and updates have become increasingly prohibitive. In this talk, I will introduce an approach that enables the training of sparse neural networks from scratch, without the need for any pre-training steps or dense updates. By achieving the property of over-parameterization in time, our approach demonstrates the capacity to achieve performance levels equivalent to fully dense networks while utilizing only a very small fraction of weights. Beyond the advantages in model compression, I will also elucidate a broader spectrum of benefits of sparsity in neural networks including scalability, robustness, and fairness, and great potentials build large-scale responsible AI.
University of California, Berkeley
Title: Theoretical Foundations of Adversarially Robust Learning
Abstract: Despite extraordinary progress, current machine learning systems have been shown to be brittle against adversarial examples: seemingly innocuous but carefully crafted perturbations of test examples that cause machine learning predictors to misclassify. Can we learn predictors robust to adversarial examples? and how? There has been much empirical interest in this major challenge in machine learning, and in this talk, we will present a theoretical perspective. We will illustrate the need to go beyond traditional approaches and principles, such as empirical (robust) risk minimization, and present new algorithmic ideas with stronger robust learning guarantees.
Johns Hopkins University
Title: Approximately Equivariant Graph Networks
Abstract: Graph neural networks (GNNs) are commonly described as being permutation equivariant with respect to node relabeling in the graph. This symmetry of GNNs is often compared to the translation equivariance of Euclidean convolution neural networks (CNNs). However, these two symmetries are fundamentally different: The translation equivariance of CNNs corresponds to active symmetries, whereas the permutation equivariance of GNNs corresponds to passive symmetries. In this talk, we focus on the active symmetries of GNNs, by considering a learning setting where signals are supported on a fixed graph. In this case, the natural symmetries of GNNs are the automorphisms of the graph. Since real-world graphs tend to be asymmetric, we relax the notion of symmetries by formalizing approximate symmetries via graph coarsening. We propose approximately equivariant graph networks to implement these symmetries and investigate the symmetry model selection problem. We theoretically and empirically show a bias-variance tradeoff between the loss in expressivity and the gain in the regularity of the learned estimator, depending on the chosen symmetry group.
University of Wisconsin / University of Washington
Title: Optimization for statistical learning with low dimensional structure: regularity and conditioning
Abstract: Many statistical machine learning problems, where one aims to recover an underlying low-dimensional signal, are based on optimization. Existing work often overlooked the computational complexity in solving the optimization problem, or required case-specific algorithm and analysis – especially for nonconvex problems. This talk addresses the above two issues from a unified perspective of conditioning. In particular, we show that once the sample size exceeds the intrinsic dimension, (1) a broad class of convex and nonsmooth nonconvex problems are well-conditioned, (2) well conditioning in turn ensures the efficiency of out-of-box optimization methods and inspires new algorithms. Lastly, we show that a conditioning notion called flatness leads to accurate recovery in overparametrized models.
Iowa State University
Title: The Future Geometric Analysis of Optimization Problems in Signal Processing and Machine Learning
Abstract: High-dimensional data analysis and estimation appear in many signal processing and machine learning applications. The underlying low-dimensional structure in these high-dimensional data inspires us to develop optimality guarantees as well as optimization-based techniques for the fundamental problems in signal processing and machine learning. In recent years, non-convex optimization widely appears in engineering and is solved by many heuristic local algorithms, but lacks global guarantees. The recent geometric/landscape analysis provides a way to determine whether an iterative algorithm can reach global optimality. The landscape of empirical risk has been widely studied in a series of machine learning problems, including low-rank matrix factorization, matrix sensing, matrix completion, and phase retrieval. A favorable geometry guarantees that many algorithms can avoid saddle points and converge to local minima. In this presentation, I will discuss potential directions for the future geometric analysis of optimization problems in signal processing and machine learning.
California Institute of Technology
Title: Deep Interpretable Generative Learning for Science and Engineering
Abstract: Discriminative and generative AI are two deep learning paradigms that revolutionized prediction and generation of high-quality images from text prompts. Nonetheless, discriminative learning is unable to generate data, and deep generative models struggle with decoding capabilities. Moreover, both approaches are data-hungry and have low interpretability. These drawbacks have posed significant barriers to the adoption of deep learning in applications where a) acquiring supervised data is expensive or infeasible, and b) goals extend beyond data fitting to attain scientific insights. Furthermore, deep learning applications are fairly unexplored in fields with rich mathematical and optimization frameworks such as inverse problems, or those in which interpretability matters. This talk discusses the theory and applications of deep learning in data-limited or unsupervised inverse problems. These include applications in radar sensing, Poisson image denoising, and computational neuroscience.
New York University
Title: Decoding the Information Bottleneck in Self-Supervised Learning: Pathway to Optimal Representation
Abstract: Deep Neural Networks (DNNs) have excelled in many fields, largely due to their proficiency in supervised learning tasks. However, the dependence on vast labeled data becomes a constraint when such data is scarce. Self-Supervised Learning (SSL), a promising approach, harnesses unlabeled data to derive meaningful representations. Yet, how SSL filters irrelevant information without explicit labels remains unclear. In this talk, we aim to unravel the enigma of SSL using the lens of Information Theory, with a spotlight on the Information Bottleneck principle. This principle, while providing a sound understanding of the balance between compressing and preserving relevant features in supervised learning, presents a puzzle when applied to SSL due to the absence of labels during training. We will delve into the concept of ‘optimal representation’ in SSL, its relationship with data augmentations, optimization methods, and downstream tasks, and how SSL training learns and achieves optimal representations. Our discussion unveils our pioneering discoveries, demonstrating how SSL training naturally leads to the creation of optimal, compact representations that correlate with semantic labels. Remarkably, SSL seems to orchestrate an alignment of learned representations with semantic classes across multiple hierarchical levels, an alignment that intensifies during training and grows more defined deeper into the network. Considering these insights and their implications for class set performance, we conclude our talk by applying our analysis to devise more robust SSL-based information algorithms. These enhancements in transfer learning could lead to more efficient learning systems, particularly in data-scarce environments.
Carnegie Mellon University
Title: Speeding up Large-Scale Machine Learning Model Development Using Low-Rank Models and Gradients
Abstract: Large-scale machine learning (ML) models, such as GPT-4 and Llama2, are at the forefront of advances in the field of AI. Nonetheless, developing these large-scale ML models demands substantial computational resources and a deep understanding of distributed ML and systems. In this presentation, I will introduce three frameworks, namely ATOMO, Pufferfish, and Cuttlefish, which use low-rank approximations on model gradients and model weights to significantly expedite ML model training. ATOMO is a general compression framework that has experimentally established that using low-rank gradients, as opposed to sparse ones, can lead to substantially faster distributed training. Pufferfish further bypasses the cost of compression by directly training low-rank models. However, directly training low-rank models usually leads to a loss in accuracy. Pufferfish mitigates this issue by training a full-rank model and then converting to a low-rank model early in the training process. Nonetheless, Pufferfish necessitates extra hyperparameter tuning, such as determining the optimal transition time from full-rank to low-rank. Cuttlefish addresses this issue by automatically estimating and adjusting these hyperparameters during training. I will present extensive experimental results on the distributed training of large-scale ML models, including LLMs, to demonstrate the efficacy of these frameworks.
New York University
Title: Simulation-Calibrated Scientific Machine Learning
Abstract: Machine learning (ML) has achieved great success in a variety of applications suggesting a new way to build flexible, universal, and efficient approximators for complex high-dimensional data. These successes have inspired many researchers to apply ML to other scientific applications such as industrial engineering, scientific computing, and operational research, where similar challenges often occur. However, the luminous success of ML is overshadowed by persistent concerns that the mathematical theory of large-scale machine learning, especially deep learning, is still lacking and the trained ML predictor is always biased. In this talk, I’ll introduce a novel framework of (S)imulation-(Ca)librated (S)cientific (M)achine (L)earning (SCaSML), which can leverage the structure of physical models to achieve the following goals: 1) make unbiased predictions even based on biased machine learning predictors; 2) beat the curse of dimensionality with an estimator suffers from it. The SCASML paradigm combines a (possibly) biased machine learning algorithm with a de-biasing step design using rigorous numerical analysis and stochastic simulation. Theoretically, I’ll try to understand whether the SCaSML algorithms are optimal and what factors (e.g., smoothness, dimension, and boundness) determine the improvement of the convergence rate. Empirically, I’ll introduce different estimators that enable unbiased and trustworthy estimation for physical quantities with a biased machine learning estimator. Applications include but are not limited to estimating the moment of a function, simulating high-dimensional stochastic processes, uncertainty quantification using bootstrap methods, and randomized linear algebra.