# Paper - Representation Benefits of Deep Feedforward Networks, Telgarsky (2015)

> Source: https://ollybritton.com/notes/uni/part-c/mt25/theories-of-deep-learning/reading/paper-representation-benefits-of-deep-feedforward-networks-telgarsky-2015/ · Updated: 2025-10-22 · Tags: uni, notes

- **Full title**: *Representation Benefits of Deep Feedforward Networks*
- **Author(s)**: Matus Telgarsky
- **Year**: 2015
- **Link**: https://arxiv.org/abs/1509.08101
- **Relevant for**:
	- [Course - Theories of Deep Learning MT25](https://ollybritton.com/notes/uni/part-c/mt25/theories-of-deep-learning/)

### Summary
> This note provides a family of classification problems, indexed by a positive integer $k$, where all shallow networks with fewer than exponentially (in $k$) many nodes exhibit error at least $1/6$, whereas a deep network with $2$ nodes in each of the $2k$ layers achieves zero error, as does a recurrent network with $3$ distinct nodes iterated $k$ times. The proof is elementary, and the networks are standard feedforward networks with ReLU nonlinearities.

This demonstrates why the "deep" part of "deep learning" is important; depth in a network allows it to be as expressive as a shallow network with exponentially many nodes.

### Flashcards
Let:

- $\mathfrak R(\sigma; m, l)$ denote the set of possible functions of a neural networks with $l$ layers and $m$ nodes.
- $\tilde f(x) = \mathbb 1[f(x) \ge 1/2]$ denote the classifier corresponding to a function $f : \mathbb R^d \to \mathbb R$
- $\mathcal R_z(f) := n^{-1} \sum_i \mathbb 1[\tilde f(x_i) \ne y_i]$ for a set of points $((x_i, y_i))^n_{i = 1}$ with $x_i \in \mathbb R^d$ and $y_i \in \{0, 1\}$.

@State the theorem in Telgarsky (2015) which formalises the importance of depth in a neural networks.::

Suppose:

- $k$ is a positive integer
- $l$ is the number of layers
- $m$ is the number of nodes per layer
- $m \le 2^{(k-3)/l-1}$

Then there exists a collection of $2^k$ points $((x_i, y_i))^n_{i = 1}$ with $x_i \in [0, 1]$ and $y \in \{0, 1\}$ such that
$$
\min_{f \in \mathfrak{R}(\sigma_R; 2, 2k} \mathcal R_z(f) = 0
$$
and
$$
\min_{g \in \mathfrak{R}(\sigma_R; m, l)} \mathcal R_z(g) \ge \frac 1 6
$$

Let:

- $\mathfrak R(\sigma; m, l)$ denote the set of possible functions of a neural networks with $l$ layers and $m$ nodes.
- $\tilde f(x) = \mathbb 1[f(x) \ge 1/2]$ denote the classifier corresponding to a function $f : \mathbb R^d \to \mathbb R$
- $\mathcal R_z(f) := n^{-1} \sum_i \mathbb 1[\tilde f(x_i) \ne y_i]$ for a set of points $((x_i, y_i))^n_{i = 1}$ with $x_i \in \mathbb R^d$ and $y_i \in \{0, 1\}$.

Telgarsky (2015) proves that if

- $k$ is a positive integer
- $l$ is the number of layers
- $m$ is the number of nodes per layer
- $m \le 2^{(k-3)/l-1}$

Then there exists a collection of $2^k$ points $((x_i, y_i))^n_{i = 1}$ with $x_i \in [0, 1]$ and $y \in \{0, 1\}$ such that
$$
\min_{f \in \mathfrak{R}(\sigma_R; 2, 2k)} \mathcal R_z(f) = 0
$$
and
$$
\min_{g \in \mathfrak{R}(\sigma_R; m, l)} \mathcal R_z(g) \ge \frac 1 6
$$
What's the rough idea behind the proof?::

Consider "$n$-ap" (alternating points) classification problems like the following:

![Screenshot 2025-10-19 at 17.18.55.png](https://ollybritton.com/assets/attachments/img/Screenshot 2025-10-19 at 17.18.55.png)

- For the lower bound, show that a shallow network has to be a sawtooth-like function, and that sawtooth functions can only cross the origin a few amount of time.
- For the upper bound, show how composing a "mirror map" with itself lets you create sawtooth functions with exponentially many spikes.

![Screenshot 2025-10-19 at 17.22.08.png](https://ollybritton.com/assets/attachments/img/Screenshot 2025-10-19 at 17.22.08.png)

---
Olly Britton — https://ollybritton.com. Machine-readable index: https://ollybritton.com/llms.txt
