Notes - Optimisation for Data Science HT25, Accelerated methods



Flashcards

Heavy ball method

Suppose we wish to minimise the convex quadratic function

f(x)=12xAxbx

where A is a symmetric matrix with eigenvalues in [γ,L].

In this context, @define the general template for iterative updates used in the heavy ball method.


xk+1=xkαkf(xk)+βk(xkxk1)

where βk>0 is some constant (and initially, a steepest descent update is performed to avoid the need for two starting points).

Suppose we wish to minimise the convex quadratic function

f(x)=12xAxbx

where A is a symmetric matrix with eigenvalues in [γ,L]. In the heavy ball method, we perform updates of the form

xk+1=xkαkf(xk)+βk(xkxk1)

@State a theorem about the convergence of the heavy ball method.


There exists a constant C>0 such that when the heavy ball method is applied with constant step lengths

αk=4(L+γ)2βk=LγL+γ

satisfies

||xkx||Cβk

for all k.

Suppose we wish to minimise the convex quadratic function

f(x)=12xAxbx

where A is a symmetric matrix with eigenvalues in [γ,L]. In the heavy ball method, we perform updates of the form

xk+1=xkαkf(xk)+βk(xkxk1)

There is a result that says there is some constant C>0 such that when the heavy ball method is applied with constant step lengths

αk=4(L+γ)2βk=LγL+γ

satisfies

||xkx||Cβk

for all k. This gives a result about the convergence of xk to x. Can you @State and @prove a theorem about the convergence of the objective value, possibly under some additional assumptions?


f(xk)f(x)f(x)(xkx)+L2||xkx||2(1)LC22β2k(2)LC22(12γL)2k(3)

where:

  • (1) follows from the fact f is L-smooth
  • (2) follows from the fact that f(x)=0 and by the above result
  • (3) follows from an approximation when Lγ

Nesterov acceleration for L-smooth γ-strongly convex functions

Suppose we wish to minimise a strongly convex objective f(x)

minxRnf(x)

In this context, @define the general template of the iterative updates used in Nesterov’s accelerated gradient method.


xk+1=xkαkf(xk+βk(xkxk1))+βk(xkxk1)

Suppose we wish to minimise a γ-strongly convex and L-smooth objective f(x)

minxRnf(x)

In this context, Nesterov’s accelerated gradient method uses updates of the form

xk+1=xkαkf(xk+βk(xkxk1))+βk(xkxk1)

@State candidate constant step lengths and give a result about the convergence of f(xk) in this case.


αk=1L,βk=LγL+γ

gives

f(xk)f(x)L+γ2(1γL)k||x0x||2

for all k1.

Nesterov acceleration for L-smooth convex functions

@State the @algorithm corresponding to Nesterov acceleration for the minimisation of an L-smooth convex function f at start point y0.


  • Initialisation:
    • Find zy0Rn and set α1:=||y0z||||f(y0)f(x)||
    • k:=0
    • λ0:=1
    • x1:=y0
  • Main body: While ||xkxk1||>tol:
    • Find the minimal iN such that
    • f(yk2iαk1f(yk))f(yk)2(i+1)αk1||f(yk)||2
    • αk:=2iαk1
    • xk:=ykαkf(yk)
    • λk+1:=12(1+4λk2+1)
    • yk+1:=xk+λk1λk+1(xkxk1)

Intuitively, this algorithm produces two sequences of iterates, xk and yk. The points xk are obtained as the minimisers of a quadratic upper bound function (@todo, which function specifically?), and the yk are approximate second order corrections to the xk.

Nesterov’s acceleration method for the minimisation of an L-smooth convex function f at start point y0 proceeds as follows:

  • Initialisation:
    • Find zy0Rn and set α1:=||y0z||||f(y0)f(x)||
    • k:=0
    • λ0:=1
    • x1:=y0
  • Main body: While ||xkxk1||>tol:
    • Find the minimal iN such that
    • f(yk2iαk1f(yk))f(yk)2(i+1)αk1||f(yk)||2
    • αk:=2iαk1
    • xk:=ykαkf(yk)
    • λk+1:=12(1+4λk2+1)
    • yk+1:=xk+λk1λk+1(xkxk1)

@Prove that αk is decreasing and that it stops to decrease once αkL1, and then motivate this process.


Clearly it’s decreasing because of the step αk:=2iαk1. Once αkL1, we have

f(ykαk1f(yk))f(yk)+f(yk),αk1f(yk)+L2||αk1f(yk)||2f(yk)αk12||f(yk)||2

Therefore the condition that

f(yk2iαk1f(yk))f(yk)2(i+1)αk1||f(yk)||2

is satisfied with i=0 and hence αk remains the same.

If L was known in advance, αk could be set to the optimal value α=1L in advance, but this algorithm learns the value of L over time. αL1 is the condition needed to ensure that

mk,αu(y)=f(yk)+f(yk),yyk+α12||yyk||2

is an upper bound on f(y).

Nesterov’s acceleration method for the minimisation of an L-smooth convex function f at start point y0 proceeds as follows:

  • Initialisation:
    • Find zy0Rn and set α1:=||y0z||||f(y0)f(x)||
    • k:=0
    • λ0:=1
    • x1:=y0
  • Main body: While ||xkxk1||>tol:
    • Find the minimal iN such that
    • f(yk2iαk1f(yk))f(yk)2(i+1)αk1||f(yk)||2
    • αk:=2iαk1
    • xk:=ykαkf(yk)
    • λk+1:=12(1+4λk2+1)
    • yk+1:=xk+λk1λk+1(xkxk1)

Can you @justify the update for xk:

  • xk:=ykαkf(yk)

xk is the global minimiser of the function

f~(y)=f(yk)+f(yk)(yyk)+12αk||yyk||2

which is an upper bound on f given that αkL1 (straightforwardly checked by differentiating).

Nesterov’s acceleration method for the minimisation of an L-smooth convex function f at start point y0 proceeds as follows:

  • Initialisation:
    • Find zy0Rn and set α1:=||y0z||||f(y0)f(x)||
    • k:=0
    • λ0:=1
    • x1:=y0
  • Main body: While ||xkxk1||>tol:
    • Find the minimal iN such that
    • f(yk2iαk1f(yk))f(yk)2(i+1)αk1||f(yk)||2
    • αk:=2iαk1
    • xk:=ykαkf(yk)
    • λk+1:=12(1+4λk2+1)
    • yk+1:=xk+λk1λk+1(xkxk1)

@State a result related to its convergence.


Suppose:

  • f:RnR be a convex L-smooth function
  • f has at least one finite minimiser xargmin f(x)
  • f has a finite minimum f=minf(x)

Then for all k1:

f(xk)f(x)4L||x0x||2(k+2)2

Nesterov’s acceleration method for the minimisation of an L-smooth convex function f at start point y0 proceeds as follows:

  • Initialisation:
    • Find zy0Rn and set α1:=||y0z||||f(y0)f(x)||
    • k:=0
    • λ0:=1
    • x1:=y0
  • Main body: While ||xkxk1||>tol:
    • Find the minimal iN such that
    • f(yk2iαk1f(yk))f(yk)2(i+1)αk1||f(yk)||2
    • αk:=2iαk1
    • xk:=ykαkf(yk)
    • λk+1:=12(1+4λk2+1)
    • yk+1:=xk+λk1λk+1(xkxk1)

@Prove that if

  • f:RnR be a convex L-smooth function
  • f has at least one finite minimiser xargmin f(x)
  • f has a finite minimum f=minf(x)

then:

  • The sequence of iterates (xk)kN satisfies f(xk)f+4L||x0x||2(k+2)2 for all kN, so
  • xk is ε-optimal for all kN(ε)=2L×||x0x||ε1

It is easiest to analyse the algorithm in terms of

pk=(λk1)(xk1xk)

We have:

yk+1=xk+λk1λk+1(xkxk1)

and

xk+1=yk+1αk+1f(yk+1)=xk+λk1λk+1(xkxk1)αk+1f(yk+1)

Hence by considering pk, we have

pk+1xk+1=(λk+11)(xkxk+1)xk+1=λk+1(xkxk+1)xk=(1λk)(xkxk1)xk+λk+1αk+1f(yk+1)(1)=pkxk+λk+1αk+1f(yk+1)

where (1) follows from the above equality. Hence

||pk+1xk+1+x||2=||pkxk+x||2+λk+12αk+12||f(yk+1)||2+2λk+1αk+1f(yk+1),pkxk+x=||pkxk+x||2+λk+12||f(yk+1)||2+2(λk+11)αk+1f(yk+1,pk)+2λk+1αk+1f(yk+1k),xxk+λk+11pk=||pkxk+x||2+λk+12αk+12||f(yk+1)||2+2(λk+11)αk+1f(yk+1),pk+2λk+1αk+1f(yk+1),xyk

where the last equality follows from the identity

yk+1=xk+λk1λk+1(xkxk1)=xk1λk+1pk

Now we wish to derive bounds on the last two terms on the right hand side of the big equation.

To bound 2λk+1αk+1f(yk+1),xyk (the second of the terms), note that by the convexity of f, we have

f(yk+1)ff(yk+1),xyk+1

The sufficient decrease condition yields that

f(yk+1)f(xk+1)αk+12||f(yk+1)||2

and so substituting this into the convexity equation, we have that

f(yk+1),xyk+1f(xk+1)f+αk+12||f(yk+1)||2

To bound 2(λk+11)αk+1f(yk+1),pk (the first of the terms), note that by the definition of yk+1, we have that

yk+1=xk+λk1αk+1(xkxk+1)=xkαk+11pk

And then convexity implies that

f(yk+1)+α1αk+1f(yk+1),pkf(xk)

so by f(yk+1)f(xk+1)αk+12||f(yk+1)||2 above, we see that

αk+12||f(yk+1)||2f(xk)f(xk+1)αk+11f(yk+1),pk

Combining everything so far, we have that

||pk+1xk+1+x||2||pkxk+x||22(λk+11)αk+1f(yk+1,pk)2λk+1αk+1(f(xk+1)f)+(λk+12λk+1)αk+12||f(yk+1)||22λk+1αk+1(f(xk+1)f)+2(λk+12λk+1)αk+1(f(xk)f(xk+1))=2αk+1λk2(f(xk+1)f)2αk+1λk+12(f(xk+1)f))2αkλk2(f(xk)f)2αk+1λk+12(f(xk+1)f)

(@todo, explain this monster).

Applying this inequality iteratively, we find

2αk+1λk+12(f(xk+1)f)2αk+1λk+12(f(xk+1)f)+||pk+1xk+1+x||22αkλk2(f(xk)f)+||pkxk+x||22α0λ02(f(x0)f)+||p0x0+x||2||y0x||2

where the last inequality follows from λ0=1, p0=(λ01)(x1x0)=0, and

||p0x0+x||2=||x0x||2=||y0α0f(y0)x||=||y0x||2+α02||f(y0)||22α0f(y0),y0xα||y0x||2+α02||f(y0)||22α0(f(x0)f+α02||f(y0)||2)=||y0x||22α0λ02(f(x0)f)

Hence this implies

2αk+1(1+k+12)2(f(xk+1)f)||y0x||2

and so

f(xk+1)f4L||y0x||2(k+3)2

as required.




Related posts