Conference on Parsimony and Learning (CPAL)

March 2026, Tübingen

CPAL 2026 Rising Stars Awardees

Avrajit Ghosh

University of California, Berkeley

Postdoctoral Researcher

Title: A fundamental statistical limitation of Gradient Descent in learning sparse targets from hard labels

Abstract: The success of gradient descent in deep learning has often been attributed to a property called implicit regularization which enables finding “generalizing” solutions. Contrary to past works, we point out a statistical limitation of learning with gradient descent in the simplest but the most commonly occurring case. We study well-specified sparse logistic regression in the over-constrained regime, where the number of input-label samples (n) exceeds the dimension (d), and the learner has access only to hard discrete labels. In this setting, when gradient descent learns using a standard dot-product model between weights and inputs, it is provably suboptimal and incurs an excess risk that scales with the dimension (d). This limitation holds more broadly for a class of rotation-invariant algorithms, which includes deep neural networks with a fully connected first layer. In contrast, simple non-rotation-invariant parameterizations achieve substantially better statistical performance using early stopping, with only logarithmic dependence on the dimension. Our results highlight a limitation of the commonly celebrated implicit regularization of gradient descent and its failure to efficiently learn sparse targets from hard labels.

Ang Li

University of Maryland

Assistant Professor

Title: Parsimony and Efficiency in Foundation Models: A Holistic Framework for Structural Analysis and Resource-Aware Serving

Abstract: The increasing overparameterization of foundation models necessitates a shift toward parsimonious architectures and resource-aware serving to enable universal deployment. This research establishes a holistic framework for efficiency by identifying structural redundancies and mitigating operational bottlenecks across the AI stack. We introduce a similarity-based metric to quantify module importance, revealing that “Attention Drop” – the strategic pruning of redundant attention layers – can streamline Transformers while preserving core functional performance. For Mixture of Experts (MoE) architectures, we propose a unified compression strategy integrating “Expert Trimming” of structured modules with “Expert Slimming” of individual experts. To enhance sparse inference efficiency, we address the “Straggler Effect” caused by imbalanced token routing through Capacity-Aware Inference, which regulates expert workloads via token dropping and rerouting. Finally, we demonstrate system-level parsimony through EdgeLoRA, a multi-tenant serving system that utilizes adaptive adapter selection and heterogeneous memory management to maximize efficiency on edge hardware. Collectively, these contributions provide a scientific foundation for the next generation of sustainable and hardware-aware intelligent systems.

Bhavya Vasudeva

University of Southern California

Ph.D. Candidate

Title: How Muon’s Spectral Design Benefits Generalization: A Study on Imbalanced Data

Abstract: The growing adoption of spectrum-aware matrix-valued optimizers such as Muon and Shampoo in deep learning motivates a systematic study of their generalization properties and, in particular, when they might outperform competitive algorithms. We approach this question by introducing appropriate simplifying abstractions as follows: First, we use imbalanced data as a testbed. Second, we study the canonical form of such optimizers, which is Spectral Gradient Descent (SpecGD) – each update step is UV^T where USV^T is the truncated SVD of the gradient. Third, within this framework we identify a canonical setting for which we precisely quantify when SpecGD outperforms vanilla Euclidean GD. For a Gaussian mixture data model and both linear and bilinear models, we show that unlike GD, which prioritizes learning dominant principal components of the data first, SpecGD learns all principal components of the data at equal rates. We demonstrate how this translates to a growing gap in class balanced loss favoring SpecGD early in training and further show that the gap remains consistent even when the GD counterpart uses adaptive step-sizes via normalization. By extending the analysis to deep linear models, we show that depth amplifies these effects. We empirically verify our theoretical findings on a variety of imbalanced datasets. Our experiments compare practical variants of spectral methods, like Muon and Shampoo, against their Euclidean counterparts and Adam. The results validate our findings that these spectral optimizers achieve superior generalization by promoting a more balanced learning of the data’s underlying components.

Lingjing Kong

Carnegie Mellon University

Ph.D. Student

Title: Parsimony through Causality: Building Trustworthy and Interpretable AI

Abstract: Foundation models are rapidly becoming capable assistants for knowledge work, yet real deployment remains limited by three persistent gaps: brittle transfer under distribution shift, opaque internal reasoning, and coarse-grained controllability. I will argue that these gaps are best viewed as a structure problem: whether learning recovers the underlying mechanisms of the data-generating process rather than surface correlations. I use causal thinking as a practical lens for parsimony: separating what is invariant from what changes across environments, and identifying what can be intervened upon. First, I present a mechanism-learning approach to transfer, where models learn stable, reusable components from heterogeneous data and become controllable through interventions on learned factors. Second, I connect modern self-supervised objectives (e.g., masking and diffusion) to hierarchical latent-variable models that recover sparse, compositional concept structure even from seemingly unstructured data – yielding interpretable generative factors and enabling targeted edits at multiple levels of abstraction. Finally, I discuss when such structured representations support extrapolation to novel concept combinations, and sketch a compositional generation framework that improves prompt following beyond the training distribution.

Can Yaras

University of Michigan

Ph.D. Candidate

Title: Compressing Activations via Structured Matrices in Transformers

Abstract: Matrix multiplications between weights and activations dominate the computational cost of Transformers. Prior work primarily reduces this cost by compressing weights, whereas activation compression remains underexplored due to its online, input-dependent nature. We introduce a general framework for approximating activation matrices with structured matrices that admit fast matrix-vector multiplication, using efficient on-the-fly optimization. We demonstrate the effectiveness of this approach in the instance of attention activations with MonarchAttention, which has recently been adapted to speed up state-of-the-art video generation.

Chenyu You

Stony Brook University

Assistant Professor

Title: Towards Robust, Efficient, and Generalized Medical AI

Abstract: Artificial intelligence has advanced biomedical image analysis dramatically, yet reliable clinical deployment remains limited. A key reason is that current models do not fully leverage the parsimonious, low-dimensional structure underlying anatomy, physiology, and disease. Scarce labeled data hinder representation learning, distribution shifts break brittle invariances, and the absence of theoretical guarantees reduces trust in high-stakes decisions. My research addresses these challenges by developing robust, efficient, and generalized medical AI methods that extract compact structural representations, unify data-driven learning with mathematical principles, and build medical foundation models that transfer across populations, modalities, and institutions. The central aim of my research is to uncover and exploit intrinsic low-dimensional structure to achieve sample efficiency, computational efficiency, and reliable generalization in real clinical environments.

Lingjiao Chen

Stanford University

Ph.D. Student

Title: Economical AI Marketplaces: From Model Cascading to Compound AI Systems

Abstract: The paradigm of machine learning has shifted from training individual models to utilizing “AI as a Service” via APIs like GPT-5 and Gemini 3 Pro. While this democratizes access, it introduces significant challenges in cost and performance optimization. With thousands of available models and infinite combinations, how do we efficiently orchestrate these services? In this talk, I will present two frameworks that address this challenge at different levels of complexity. First, I will introduce FrugalGPT, a cascading framework for single-turn queries. By learning to route easy queries to cheaper models and only escalating hard queries to expensive ones, FrugalGPT matches the performance of top-tier models with up to 98% cost reduction. Second, I will discuss LLMSelector, which extends this optimization to compound AI systems. I will show that in these complex workflows, naively using the “best” model for every step is inefficient. LLMSelector uses a data-efficient optimization algorithm to assign the optimal model to each module, achieving substantial accuracy gains over baseline approaches.

Harry Dong

Carnegie Mellon University

Ph.D. Student

Title: Squeezing the Most Out of LLMs: The Interplay Between Efficiency and Performance

Abstract: Large language models (LLMs) continuously demonstrate surprising capabilities, which drive the spread of their potential to various domains and applications. Yet, inefficiencies in various components of the architecture and inference pipeline limit broader deployment of LLMs. The demanding nature of LLM inference can harm latency, throughput, and memory. Furthermore, underutilization of information can also artificially limit performance. In this talk, we take a close look at the model architecture, data, features, and hardware behavior to reveal emergent patterns and redundancy that provide clues to optimize inference efficiency and performance. Together, our methods push towards the compute-performance Pareto frontier for LLM inference.

Yuko Kuroki

Intesa Sanpaolo AI Research

Research Scientist

Title: Sequential Learning under Uncertainty: Structure, Adaptivity, and Efficiency

Abstract: How can we make reliable decisions when data is limited, feedback is partial, and environments are uncertain? Many real-world decision-making problems involve complex and structured action spaces, yet only limited and noisy feedback is available. In this talk, I present a unified perspective on sequential learning under uncertainty through the lens of multi-armed bandits, achieving both statistical and computational efficiency by exploiting latent structure. First, we show how latent combinatorial structure can dramatically reduce the cost of exploration. We develop query-efficient methods for recovering clustered structures from noisy feedback. Second, we investigate adaptive algorithms for linear contextual bandits that achieve optimal performance in stochastic environments while remaining robust in adversarial settings – without prior knowledge of the regime. Finally, we connect these foundations to societal systems. We introduce a low-rank matrix bandit framework for optimizing opinion dynamics to mitigate polarization and disagreement in social networks. Looking ahead, my goal is to build a principled theory of adaptive exploration in complex systems and to translate these foundations into tools for data-driven discovery.

Tianjiao Ding

University of Pennsylvania

Ph.D. Candidate

Title: Parsimonious Concept Engineering

Abstract: Large Language Models (LLMs) are being used for a wide variety of tasks. While they are capable of generating human-like responses, they can also produce undesirable output including potentially harmful information, racist or sexist language, and hallucinations. Alignment methods are designed to reduce such undesirable output, via techniques such as fine-tuning, prompt engineering, and representation engineering. However, existing methods face several challenges: some require costly fine-tuning for every alignment task; some do not adequately remove undesirable concepts, failing alignment; some remove benign concepts, lowering the linguistic capabilities of LLMs. To address these issues, we propose Parsimonious Concept Engineering (PaCE), a novel activation engineering framework for alignment. First, to sufficiently model the concepts, we construct a large-scale concept dictionary in the activation space, in which each atom corresponds to a semantic concept. Given any alignment task, we instruct a concept partitioner to efficiently annotate the concepts as benign or undesirable. Then, at inference time, we decompose the LLM activations along the concept dictionary via sparse coding, to accurately represent the activations as linear combinations of benign and undesirable components. By removing the latter ones from the activations, we reorient the behavior of the LLM towards the alignment goal. We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising, and show that PaCE achieves state-of-the-art alignment performance while maintaining linguistic capabilities.

Yuqing Wang

Johns Hopkins University

Postdoctoral Researcher

Title: Data Uniformity Improves Training Efficiency and More, with a Convergence Framework Beyond the NTK Regime

Abstract: Data selection is crucial for data-driven decision-making, including foundation models, but beyond data quality and diversity, it is unclear whether other general quantitative principles can reliably improve complex tasks. In this talk, I will demonstrate that selecting more uniformly distributed data can improve training efficiency while enhancing performance. Specifically, we establish that more uniform (less biased) distribution leads to a larger minimum pairwise distance between data points, denoted by h_min, and prove that a smaller h_min can slow down the training dynamics of gradient descent (GD). Moreover, we theoretically show that the approximation error of neural networks decreases as h_min increases. Our analysis introduces a convergence framework for GD beyond the Neural Tangent Kernel (NTK) regime, applicable to a broad class of architectures, including transformers, without requiring Lipschitz smoothness. This framework further provides theoretical justification for the use of residual connection and function composition in deep neural architectures. In the end, I will show some comprehensive experiments, including supervised fine-tuning across various settings, different optimization strategies, model sizes, and training datasets. The results consistently demonstrate that selecting data by maximizing the minimum pairwise distance significantly accelerates training and achieves comparable or better performance across diverse datasets.

Lu Yin

University of Surrey

Assistant Professor

Title: Not All Layers Are Equal: Criticality and Depth Balance in LLMs

Abstract: Large Language Models (LLMs) have achieved remarkable success across diverse tasks, from text completion and question answering to complex reasoning. However, their increasing scale introduces practical challenges for inference, fine-tuning, and training. This talk examines these challenges through a layerwise lens. First, we show that layers differ substantially in their contributions to overall model performance. We demonstrate how layerwise importance signals enable targeted pruning and more selective fine-tuning by concentrating updates on the most influential parts of the network. We then turn to depth imbalance, where shallow and deep layers contribute unevenly during and after pretraining. We analyze possible causes of this phenomenon and present practical remedies, including layerwise learning-rate and weight-decay schedules, as well as the normalization method, to encourage balanced contributions across the full depth of the model.

Zhen Tan

Arizona State University

Ph.D. Student

Title: Fathoming the Unfathomable Foundation Models: From Algorithms to Scientific Discovery

Abstract: As foundation models become central to information systems, cybersecurity, and scientific decision-making, their opacity raises critical challenges for trust, accountability, and governance. While much prior work has focused on post-hoc explainability, my research shows that such approaches are intrinsically limited and can even obscure model failures. In this talk, I present a shift from explaining opaque systems to designing trustworthy and interpretable AI by construction. I will introduce diagnostic frameworks that reveal failures in explanation faithfulness and security, such as explanatory inversion and vulnerability to retrieval and communication attacks. I will then demonstrate how concept-based, human-centered model architectures enable glass-box reasoning, actionable interventions, and self-reflection for large language models (LLMs). I will conclude by demonstrating how these ideas translate into real-world impact across AI-driven cybersecurity, conversational agents, agricultural robotics, and neuroscience, and by outlining a vision for information-centric AI systems that are transparent, reliable, and aligned with human values.

Zhenyu Zhang

University of Texas at Austin

Ph.D. Student

Title: Democratizing LLM Training via Low-Rank Methods

Abstract: As Large Language Models (LLMs) continue to scale in size and capability, the cost and complexity of training – particularly memory consumption from optimizer states – have become critical bottlenecks. In this talk, we explore the evolution of low-rank optimization techniques for LLMs, tracing the progression from GaLore to APOLLO. GaLore introduces a memory-efficient training paradigm by applying low-rank projections to gradients and optimizer states, enabling the training of models with up to 7 billion parameters on consumer-grade GPUs such as the NVIDIA RTX 4090. Building on this foundation, APOLLO demonstrates the surprising effectiveness of rank-1 optimizer states with purely random projections, achieving SGD-level memory overhead while maintaining performance comparable to AdamW. This method brings substantial system-level improvements, including a 3x increase in throughput on 8xA100-80G GPUs and enhanced model scalability.