on
Optimizations
Optimization aims to achieve
\[w^*=\mbox{argmin}_w L(w)\]Where \(\omega\) is the weights of a model and \(L\) is the loss.
2 ways to compute gradient:
- Numeric approach - compute with definition of (partial) derivative
- Analytic approach - compute \(\nabla_w L\) in
In practice it’s always analytic approach, but numeric gradient could be used to check the correctness.
Checking with numeric gradient in PyTorch: torch.autograd.gradcheck
from torch.autograd import gradcheck
class MyFunction(Function):
# a customized function
def forward(ctx, tensor, constant):
# some forward computations
return output
def backward(ctx, grad_output):
# some backward computations
return grad_input, grad_weight, grad_bias
my_func = MyFunction.apply
# gradcheck takes a tuple of tensors as input, check if your gradient
# evaluated with these tensors are close enough to numerical
# approximations and returns True if they all verify this condition.
input = (torch.randn(20,20,dtype=torch.double,requires_grad=True), torch.randn(30,20,dtype=torch.double,requires_grad=True))
test = gradcheck(my_func, input, eps=1e-6, atol=1e-4)
print(test)
For more info see here.
GD
Gradient descent refers to the general optimization methods of updating values of the model’s parameters in the direction of the steepest descent.
GD requires all data to be loaded at the same time:
\[L(W)=\frac{1}{N}\sum_{i=1}^{N}L_i(x_i,y_i,W)+\lambda R(W)\\ \nabla_WL(W)=\frac{1}{N}\sum_{i=1}^{N}\nabla_WL_i(x_i,y_i,W)+\lambda \nabla_WR(W)\\\]Full sum is expensive when \(N\) is large.
Batch GD uses a minibatch of examples to approximate sum.
SGD
Stochastic gradient descent approximate expectation via sampling:
\[L(W)=\mathbb{E}_{(x,y) \sim p_{data}}[L(x,y,W)]+\lambda R(W)\\ \approx\frac{1}{N}\sum_{i=1}^{N}L(x_i,y_i,W)+\lambda R(W)\\\] \[\nabla_WL(W)=\nabla_W\mathbb{E}_{(x,y) \sim p_{data}}[L(x,y,W)]+\lambda \nabla_WR(W)\\ \approx\frac{1}{N}\sum_{i=1}^{N}\nabla_WL_i(x_i,y_i,W)+\lambda \nabla_WR(W)\\\]We can prove that SGD is an unbiased estimator of the gradients.
Recap: Unbiased estimator, and why the definition of sample variance has (n-1) instead of n
https://en.wikipedia.org/wiki/Bias_of_an_estimator
Suppose \(X_1,...,X_n\) are independent and identically distributed (i.i.d.) random variables with expectation \(\mu\) and variance \(\sigma^2\).
Suppose that sample mean is \(\overline{X}=\frac{1}{n}\sum_{i=1}^{n}X_i\)
and sample variance is \(S^2 = \frac{1}{n}\sum_{i=1}^{n}(X_i-\overline{X})^2\)
We know that \(\overline{X}\), \(\mu\) are constants, so \(\sum_{i=1}^{n}(\overline{X}-\mu)=n(\overline{X}-\mu)\);
- and \(\overline{X}-\mu=\frac{1}{n}\sum_{i=1}^{n}(X_i-\mu)\);
- and by Bienaymé formula, \(\mbox{Var}(\overline{X})=\frac{\sigma^2}{n}\); the definition of variance is \(\mbox{Var}(X)= E[(X-\mu)^2]\).
then
\[\begin{align*} E[S^2]&=E[\frac{1}{n}\sum_{i=1}^{n}(X_i-\overline{X})^2]\\ &= E[\frac{1}{n}\sum_{i=1}^{n}((X_i-\mu) - (\overline{X} - \mu))^2] \\ &= E[\frac{1}{n}\sum_{i=1}^{n}(X_i-\mu)^2 - \frac{2}{n}(\overline{X} - \mu)\sum_{i=1}^{n}(X_i-\mu) + \frac{1}{n}\sum_{i=1}^{n}(\overline{X}-\mu)^2] \\ &= E[\frac{1}{n}\sum_{i=1}^{n}(X_i-\mu)^2 - \frac{2}{n}(\overline{X} - \mu)\sum_{i=1}^{n}(X_i-\mu) + (\overline{X}-\mu)^2] \\ &= E[\frac{1}{n}\sum_{i=1}^{n}(X_i-\mu)^2 - \frac{2}{n}(\overline{X} - \mu)*n(\overline{X} - \mu) + (\overline{X}-\mu)^2] \\ &= E[\frac{1}{n}\sum_{i=1}^{n}(X_i-\mu)^2 - (\overline{X}-\mu)^2] \\ &= E[\frac{1}{n}\sum_{i=1}^{n}(X_i-\mu)^2] - E[(\overline{X}-\mu)^2] \\ &= \sigma^2 - E[(\overline{X}-\mu)^2] = (1-\frac{1}{n})\sigma^2 < \sigma^2 \end{align*}\]Therefore, \(S^2\) is a biased estimator of \(\sigma^2\). So we must counter the bias by multiplying \(S^2\) by \(\frac{n}{n-1}\), and the unbiased estimator of \(\sigma^2\) is \(S^2 = \frac{1}{n-1}\sum_{i=1}^{n}(X_i-\overline{X})^2\).
Why SGD over GD: SGD is preferred simply in the way that it descends more frequently than GD, therefore, faster. If a dataset has 10,000 examples, for GD it would be 10,000 examples per iteration (slow), while SGD with a batch size of 10 would be 1000x faster.
SGD + Momentum
Problems with SGD
-
High condition number: ratio of highest to smallest singular value of the Hessian matrix is large.
Recap: Hessian Matrix
Suppose \(f:\mathbb{R}^n\to\mathbb{R}\) is a function with vector input \(\textbf{x}\in \mathbb{R}^n\) and scalar output \(f(\textbf{x}\in \mathbb{R})\).
If all second partial derivatives of \(f\) exists and are continuous, then the Hessian matrix \(\textbf{H}\) of \(f\) is an \(n\times n\) matrix:
\[\textbf{H}_f= \begin{bmatrix} \frac{\partial^2f}{\partial x_1^2} & \frac{\partial^2f}{\partial x_1 \partial x_2} & ... & \frac{\partial^2f}{\partial x_1 \partial x_n}\\ \frac{\partial^2f}{\partial x_2^2} & \frac{\partial^2f}{\partial x_2^2} & ... & \frac{\partial^2f}{\partial x_2 \partial x_n}\\ ...&...& &...\\ \frac{\partial^2f}{\partial x_n \partial x_1} & \frac{\partial^2f}{\partial x_n \partial x_2} & ... & \frac{\partial^2f}{\partial x_n^2} \end{bmatrix}\]or, \((\textbf{H}_f)_{i,j}=\frac{\partial^2f}{\partial x_i \partial x_j}\)
Recap: Condition Number
In the field of numerical analysis, the condition number of a function with respect to an argument measures how much the output value of the function can change for a small change in the input argument. For example In a linear function \(Ax=b\), condition number measures the sensitivity of vector \(x\) in regard of a small perturbation of \(b\). A problem is said to be well conditioned if the condition number is small and ill conditioned if the condition number is large.
In the context of SGD, suppose we are to optimize \(f(x)\). According to Taylor series
\[\begin{align*} f(x)&=f(x_0)+(x-x_0)^Tg+\frac{1}{2}(x-x_0)^TH(x-x_0)+...\\ &\approx f(x_0)+(x-x_0)^Tg+\frac{1}{2}(x-x_0)^TH(x-x_0)\\ f(x_0-\epsilon g)&\approx f(x_0)-\epsilon g^T g + \frac{1}{2}\epsilon ^2 g^T H g \end{align*}\]Where \(x_0\) is the current location, \(\epsilon\) is the learning rate, \(g\) is gradient of \(f(x)\) at \(x_0\), and \(H\) is Hessian matrix at \(x_0\).
Property of real symmetric matrix
For any real symmetric matrix \(H\),
\[\forall {\bf v} \in \mathbb{R}^n ,\left\|{\bf v}\right\|=1\longrightarrow {\bf v}^T H {\bf v}\le\lambda_{\max}\]Proof.
\[H = Q\Lambda Q^T \mbox{(diagonalization)}\\ {\bf v}^T H{\bf v} = {\bf v}^T Q\Lambda Q^T{\bf v}\]Let \({\bf y}=Q^T {\bf v}\), then
\[\begin{align} {\bf v}^T H {\bf v} &= {\bf y}^T\Lambda{\bf y}\\ &= \lambda_{\max} y_1^2 + \lambda_2 y_2^2 + ... +\lambda_{\min} y_N^2 \\ &\le \lambda_{\max} y_1^2 + \lambda_{\max} y_2^2 + ... +\lambda_{\max} y_N^2 \\ &= \lambda_{\max} (y_1^2 + y_2^2 + ... + y_N^2) \\ &= \lambda_{\max} {\bf y}^T {\bf y} \\ \end{align}\]Because \(Q^{-1}=Q^T\), \(QQ^T=I\),
\[{\bf y}^T {\bf y}={\bf v}^TQQ^T {\bf v}={\bf v}^T{\bf v}\\ {\bf y}^T\Lambda{\bf y}\le\lambda_{\max} {\bf y}^T {\bf y}={\bf v}^T{\bf v}\\\]And because \({\bf v}^T H {\bf v} = {\bf y}^T\Lambda{\bf y}\), \(\left\|{\bf v}\right\| =1\) ,
\[\begin{align} {\bf v}^T H {\bf v}&\le\lambda_{\max} {\bf v}^T {\bf v}\\ {\bf v}^T H {\bf v}&\le \lambda_{\max} &\hfill \square \end{align}\]Similarly, \({\bf v}^T H {\bf v} \ge \lambda_{\min}\).
So, the optimal \(\epsilon=\epsilon ^*\) should be
\[\begin{align*} f(x_0-\epsilon g)\approx f(x_0)-\epsilon g^T g + \frac{1}{2}\epsilon ^2 g^T H g\\ \frac{\partial f(x_0-\epsilon ^* g)}{\partial\epsilon ^*}=0, \epsilon ^* = \frac{g^T g}{g^T H g} \ge \frac{1}{\lambda_{\max}} \end{align*}\]Similarly, \(\epsilon ^*\le \frac{1}{\lambda_{\min}}\). Therefore
\[\frac{1}{\lambda_{\max}} \le \epsilon ^* \le \frac{1}{\lambda_{\min}}\] - Local minimum & saddle point: not convex.
- Stochastic part introduces noise.
Improvement: SGD + momentum.
\[v_{t+1}=\rho v_t + \nabla f(x_t) \\ x_{t+1}=x_t - \alpha v_{t+1}\]