Paper - Error bounds for approximations with deep ReLU networks, Yarotsky
- Full title: Error bounds for approximations with deep ReLU networks
- Author(s): Dmitry Yarotsky
- Year: 2016
- Link: https://arxiv.org/abs/1610.01145
- Relevant for:
Summary
We study expressive power of shallow and deep neural networks with piece-wise linear activation functions. We establish new rigorous upper and lower bounds for the network complexity in the setting of approximations in Sobolev spaces. In particular, we prove that deep ReLU networks more efficiently approximate smooth functions than shallow networks. In the case of approximations of 1D Lipschitz functions we describe adaptive depth-6 network architectures more efficient than the standard shallow architecture.
In [[Paper - Representation Benefits of Deep Feedforward Networks, Telgarsky (2015)]]U, it was shown that deep ReLU networks can be exponentially more compact than shallow ReLU networks for a specific problem (the $n$-ap problem).
This paper shows that ReLU networks can be much more compact than shallow ReLU network for a much larger class of functions. More generally, they show:
- Section 2: You can replace ReLU in a network with any other continuous nonlinear (finitely) piecewise activation function and find another network with just a few more weights.
- Section 3: For any $\epsilon$, there is a ReLU network of depth $O(\ln (1/\epsilon))$ and complexity $O(\epsilon^{-d/n} \ln(1/\epsilon))$ that can approximate any sufficiently nice function to error $\epsilon$. If the architecture of the network is allowed to depend on the function, this reduces to a depth $6$ network with $O\left(\frac{1}{\epsilon \ln(1/\epsilon)}\right)$ connections and activation units.
- Section 4: Lower bounds on the depth and complexity of ReLU networks under various assumptions.
Flashcards
How does Yarotsky (2016) argue that the specific choice of ReLU for a nonlinear activation is not important?
Given any other piecewise linear function with $M$ breakpoints, you can find a new ReLU network that computes the same function with a constant factor more weights, where the constant factor depends only on $M$.
Let $f \in L^\infty$. Recall that:
- $|f| _ {\mathcal W^{n,\infty}([0,1]^d)}= \max _ {\mathbf n:\, \vert \mathbf n \vert \le n}\operatorname*{ess\,sup} _ {\mathbf x\in[0,1]^d}\bigl \vert D^{\mathbf n} f(\mathbf x)\bigr \vert $ where $\mathbf n = (n _ 1, \ldots, n _ d) \in {0, 1, \ldots}^d, \vert \mathbf n \vert = n _ 1 + \cdots + n _ d$, i.e. the maximum size of the weak derivative over the domain, ignoring sets of measure $0$.
- $F _ {n, d} = {f \in \mathcal W^{n, \infty}([0, 1]^d) : \vert \vert f \vert \vert _ {\mathcal W^{n, \infty}}([0, 1]^d) \le 1}$
In this context, can you state the key result of Yarotsky (2016) about the expressivity of ReLU networks?
For any $d, n$ and $\epsilon \in (0, 1)$, there is a ReLU network architecture that:
- Is capable of expressing any function from $F _ {d, n}$ with error $\epsilon$, and
- Has the depth at most $c(\ln(1/\epsilon) + 1)$ and at most $c \epsilon^{-d/n}(\ln(1/\epsilon) + 1)$ weights and computation units, with some constant $c = c(d, n)$.
Let $f \in L^\infty$. Recall that:
- $|f| _ {\mathcal W^{n,\infty}([0,1]^d)}= \max _ {\mathbf n:\, \vert \mathbf n \vert \le n}\operatorname*{ess\,sup} _ {\mathbf x\in[0,1]^d}\bigl \vert D^{\mathbf n} f(\mathbf x)\bigr \vert $ where $\mathbf n = (n _ 1, \ldots, n _ d) \in {0, 1, \ldots}^d, \vert \mathbf n \vert = n _ 1 + \cdots + n _ d$, i.e. the maximum size of the weak derivative over the domain, ignoring sets of measure $0$.
- $F _ {n, d} = {f \in \mathcal W^{n, \infty}([0, 1]^d) : \vert \vert f \vert \vert _ {\mathcal W^{n, \infty}}([0, 1]^d) \le 1}$
In this context, the key result of Yarotsky (2016) about the expressivity of ReLU networks is that for any $d, n$ and $\epsilon \in (0, 1)$, there is a ReLU network architecture that:
- Is capable of expressing any function from $F _ {d, n}$ with error $\epsilon$, and
- Has the depth at most $c(\ln(1/\epsilon) + 1)$ and at most $c \epsilon^{-d/n}(\ln(1/\epsilon) + 1)$ weights and computation units, with some constant $c = c(d, n)$.
Roughly, how does the proof proceed?
- Split $f$ into several patches using a partition of unity built from ReLUs
- Approximate each patch via Taylor expansion
- First showing that ReLU networks can approximate $x^2$ to arbitrary accuracy
- Then showing that as $xy = \frac 1 2 ((x + y)^2 - x^2 - y^2)$, we can approximate $x \times y$ to arbitrary accuracy
- Then showing that this lets you approximate an arbitrary polynomial to arbitrary accuracy
- Glue these pieces together to approximate all of $f$
What is a partition of unity?
A collection of functions that sum to one, and divide the domain into evenly sp
Let $f \in L^\infty$. Recall that:
- $|f| _ {\mathcal W^{n,\infty}([0,1]^d)}= \max _ {\mathbf n:\, \vert \mathbf n \vert \le n}\operatorname*{ess\,sup} _ {\mathbf x\in[0,1]^d}\bigl \vert D^{\mathbf n} f(\mathbf x)\bigr \vert $ where $\mathbf n = (n _ 1, \ldots, n _ d) \in {0, 1, \ldots}^d, \vert \mathbf n \vert = n _ 1 + \cdots + n _ d$, i.e. the maximum size of the weak derivative over the domain, ignoring sets of measure $0$.
- $F _ {n, d} = {f \in \mathcal W^{n, \infty}([0, 1]^d) : \vert \vert f \vert \vert _ {\mathcal W^{n, \infty}}([0, 1]^d) \le 1}$
In this context, the key result of Yarotsky (2016) about the expressivity of ReLU networks is that for any $d, n$ and $\epsilon \in (0, 1)$, there is a ReLU network architecture that:
- Is capable of expressing any function from $F _ {d, n}$ with error $\epsilon$, and
- Has the depth at most $c(\ln(1/\epsilon) + 1)$ and at most $c \epsilon^{-d/n}(\ln(1/\epsilon) + 1)$ weights and computation units, with some constant $c = c(d, n)$.
They also prove another result about how this bound changes if the network architecture is allowed to depend on the function $f$. @State this result.
For any $f \in F _ {1,1}$ and $\epsilon \in (0, 1/2)$, there exists a depth-$6$ ReLU network $\eta$ with architecture depending on $f$ that provides an $\epsilon$-approximation of $f$ while having not more than $\frac{c}{\epsilon \ln(1/\epsilon)}$ weights, connections and computation units where $c$ is an absolute constant.
Let $f \in L^\infty$. Recall that:
- $|f| _ {\mathcal W^{n,\infty}([0,1]^d)}= \max _ {\mathbf n:\, \vert \mathbf n \vert \le n}\operatorname*{ess\,sup} _ {\mathbf x\in[0,1]^d}\bigl \vert D^{\mathbf n} f(\mathbf x)\bigr \vert $ where $\mathbf n = (n _ 1, \ldots, n _ d) \in {0, 1, \ldots}^d, \vert \mathbf n \vert = n _ 1 + \cdots + n _ d$, i.e. the maximum size of the weak derivative over the domain, ignoring sets of measure $0$.
- $F _ {n, d} = {f \in \mathcal W^{n, \infty}([0, 1]^d) : \vert \vert f \vert \vert _ {\mathcal W^{n, \infty}}([0, 1]^d) \le 1}$
Using ∆de-vore-weight-lower-bound-continuous, how does Yarotsky (2016) find a lower bound for the number of weights in a ReLU network approximating some function $f$?
Take $\eta$ to be some ReLU network architecture and $\mathbb R^W$ as some weight space. Then if a ReLU network is capable of expressing any function from $F _ {d, n}$ with error $\epsilon$, then under the hypothesis of continuous weight selection, the network must have at least $c\epsilon^{-d/n}$ weights.
Questions
- The networks that are being considered to prove these upper bounds have a very special format. How similar are they to the networks that actually end up solving the problem after training?
- Is it possible to weaken the assumption that the functions being approximated live in $F _ {n, d}$?