Overview
This course is designed to provide hands-on experience and theoretical understanding in the field of machine learning. Students will work in groups of 3 or 4 on either practical or theoretical projects, which will be presented to the class.
-
Langevin Sampling and Denoising Autoencoders
Overdamped Langevin dynamics targets a density π(x) ∝ e^{-U(x)} via the SDE dX_t = -∇U(X_t)dt + √2 dW_t, with practical discretizations such as ULA/MALA and SGLD. Denoising autoencoders (DAEs) learn to predict clean data from noisy inputs; the denoising vector field approximates the score ∇ log p(x), linking DAEs to score matching and Langevin sampling via Tweedie’s formula.
Starter references: Welling & Teh (2011) Bayesian Learning via Stochastic Gradient Langevin Dynamics; Vincent (2011) A Connection Between Score Matching and Denoising Autoencoders.
-
Variational Autoencoders (VAE)
VAEs posit a latent-variable model pθ(x,z)=p(z)pθ(x|z) and optimize the evidence lower bound (ELBO) using the reparameterization trick for low-variance gradients. Core topics include the ELBO–KL decomposition, amortized inference, posterior collapse, and expressivity via richer posteriors (normalizing flows) and tighter bounds (IWAE). Theoretical angles involve identifiability, variational gaps, and bits-back coding interpretations.
Starter references: Kingma & Welling (2014) Auto-Encoding Variational Bayes.
-
Score-Based Diffusion Models
Score-based generative modeling learns the score ∇x log p_t(x) across noise levels and samples via reverse-time SDEs or probability-flow ODEs. The framework unifies denoising score matching, annealed Langevin dynamics, and continuous-time diffusion, with precise links to Fokker–Planck equations and Girsanov’s theorem. Key questions include consistency of score estimators, stability of reverse SDE solvers, and likelihood computation.
Starter references: Hyvärinen (2005/2007) Score Matching; Song & Ermon (2019) Generative Modeling by Estimating Gradients of the Data Distribution; Song et al. (2021) Score-Based Generative Modeling through SDEs; Song et al. (2020) Improved Techniques for Training Score-Based Generative Models.
-
Flow Matching (CNFs, CFM, Rectified Flows)
Flow Matching trains continuous normalizing flows by regressing a time-dependent
velocity field along a chosen probability path between a simple base distribution and the
data, avoiding ODE simulation during training. It connects to diffusion via the
probability-flow ODE and admits practical variants: conditional/generalized flow
matching for conditional tasks and rectified flows that learn near-straight
trajectories for fast sampling. Typical mathematical highlights include deriving the FM
objective from continuity equations and analyzing path choices (Gaussian vs. OT/displacement
interpolation) and their impact on sample complexity and stability.
Starter references:
Lipman et al., “Flow Matching for Generative Modeling” (2022)
;
Gagneux et al., “Improving & A Visual Dive into Conditional Flow Matching (2025)
.
-
Denoising Diffusion Models (Discrete Setting)
For categorical/sequential data, the forward process is a time-inhomogeneous Markov chain (e.g., multinomial corruption) that gradually randomizes symbols; the reverse model learns transition probabilities back to data. Design issues include choosing the corruption family (absorbing vs non-absorbing), parameterizing reverse kernels, and training via variational bounds or score-style objectives on simplices. Applications span text, protein sequences, and graphs.
Starter references: Austin et al. (2021) Structured Denoising Diffusion Models in Discrete State Spaces (D3PM).
-
Autoregressive Data Generation: RNNs and LSTMs
Autoregressive models factorize p(x) into products of conditionals (e.g., language models) and learn to predict the next token. Recurrent networks capture long-range dependencies through hidden states; LSTMs/GRUs mitigate vanishing gradients via gating mechanisms. Theoretical directions include expressivity, mixing properties of the induced Markov chains, and generalization under teacher forcing vs free-running.
Starter references: Hochreiter & Schmidhuber (1997) Long Short-Term Memory; Bengio et al. (2003) A Neural Probabilistic Language Model; Mikolov et al. (2010) RNN-Based Language Models.
-
Autoregressive Data Generation: Transformer Architecture
Transformers replace recurrence with attention, enabling parallelizable sequence modeling with strong inductive biases for long-range dependencies. In the autoregressive regime, causal masking yields powerful language models whose performance scales predictably with data, model size, and compute. Theory topics include attention expressivity, context length extrapolation, and scaling/compute optimality.
Starter references: Vaswani et al. (2017) Attention Is All You Need.