Approximation, Generalization, and Scaling in Neural Networks
1. Approximation in high dimension: Barron-space perspective
For a two-layer network
if target f⋆ has finite Barron norm ||f⋆||_B, there exists a parameterization such that
Proof sketch
Represent f⋆ as an integral over features σ(w^T x) with finite first moment in Fourier domain. Approximate integral by Monte Carlo with m samples; Jensen + variance control gives m^{-1/2}.
This is dimension-robust if spectral mass is controlled.
2. Rademacher-style generalization bounds
For bounded Lipschitz loss and hypothesis class F,
with probability 1 - δ.
For deep nets, R_n can be bounded via products of spectral norms times sums of Frobenius-to-spectral ratios. This highlights implicit regularization via optimization trajectory and normalization.
3. Margin-based bound for classification
Let γ be empirical margin. A typical bound takes the form
where C(θ) captures norm complexity. Training often increases margin while controlling effective complexity, even past interpolation.
4. Why scaling works: optimization + statistics decomposition
A useful decomposition for excess risk:
As model size grows:
- approximation error decreases,
- estimation error may initially rise then fall (double descent regime),
- optimization error decreases in overparameterized lazy/NTK-like regimes.
5. Compute-optimal scaling laws (stylized)
Empirical studies often fit
with parameters N (model size), D (tokens/examples), C (compute). Under budget constraints, optimal allocation balances partial derivatives:
The key principle is equalizing marginal loss reduction per unit compute.
6. Takeaway
Generalization in modern deep learning is better captured by norm/margin/trajectory-dependent complexity and scaling trade-offs than by VC dimension alone.
Key insights:
- Barron space provides dimension-robust approximation guarantees for neural networks
- Modern generalization depends on optimization trajectory, not just model complexity
- Scaling laws emerge from balancing approximation, estimation, and optimization errors
- Compute-optimal training requires equalizing marginal gains across resources