Training a neural network Using Newton’s Method
The aim of this Blog is to understand if Newton’s Method can be used for optimisation in training a Neural network and compared with Gradient Decent. Neural networks are used for example in financial services for forecast, machine learning and artificial intelligence, representing algorithms that imitate the brain of a human using a large amount of data as input. The aim is to represent the connection between neurons and synapses in our brain.
A simple Neural Network is usually made by an Input Layer, a Hidden Layer (𝑵𝒏), and an Output Layer.
Neural networks take some existing values and generate predictions based on these values or parameters. For example, if you want to train a machine to recognize a dog, you need to give the machine as much information about a picture of dogs and find a way to tell the machine that it is wrong when the guess is not the expected value and is right when the expected value is correct.
Example of a Neural Networks
In this example the values given to the Neural Network are width and length and based on these values, the next time is given only these two pieces of information should be able to guess if the object is a car or a motorbike.
In this example the Neural Network should be able to identify if a given object is a car or a motorbike by only receiving two inputs, width and length.
V= value (width, length) W = weight
B = bias term
P = prediction
The parameter of (w1, w2 and b) are random and to find the right one the cost function can help. The cost function takes in current data and predictions and tells how far our prediction is from the given data set.
The given data set has to be minimized to tell if our prediction is close to the data.
To get help in this part we are going o compare Gradient Descente and Newton’s method.
Gradient Descent is an optimisation Method, when fit a line with Linear Regression, it optimizes the intercept and Slope.
Example with a data set where the X-axis is the Weight and the Y-axies is the Height.
If we extend a line in the data set, we can use the line to predict some data. For example, if we have a weight of 1.5, we can say that the height is 1.9
The Gradient Descent help to fit the line to data by finding the optimal values for the intercept and the Slope.
Prediction of Y = Intercept + slope * Weight
We insert the 0.64 as slope using the Least Squares estimate and we use the
method of Gradient Descent to find the best value for the intercept.
Fist step is to pick a random value for the intercept, this value is what gradient decent will use to improve the guess.
Starting value = 0
Now our line has a starting value of zero
Prediction of Y = 0 + 0.64 * Weight
Now to see how well the line fits in the data we use the Sum of the Squared Residuals which is a type of loss function.
The first residual (x1, y2) (0.5, 1.4) give the prediction by plugging the Weight (X1) in the equation.
Prediction of Y = 0 + 0.64 * 0.5 = 0.32
The residual is given by calculating the Observed height and subtracting the Predicted Height
Residual = Height — prediction or
Residual = 1.4–0.32 = 1.1
Using the residual of 1.1 we need to square it and sum it with the remaining residual to get the Sum of squared residuals.
𝑺𝒖𝒎 𝒐𝒇 𝒔𝒒𝒖𝒂𝒓𝒆𝒅 𝒓𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔 = 𝑌𝑛 − (𝑦𝑛 − (𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 𝑠𝑙𝑜𝑝𝑒 ∗ 𝑤𝑒𝑖𝑔h𝑡)) 𝑺𝒖𝒎 𝒐𝒇 𝒔𝒒𝒖𝒂𝒓𝒆𝒅 𝒓𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔 = (1.1)2 + (0.4)2 + (1.3)2 = 3.1
The below graph shows how a change in the intercept change the Sum of the Squared Residuals, increase from 0 to 1 gives a lower Sum Squared Residual, increase from 1 to 2 see an increase in the Sum Squared Residual.
Gradient decent does less calculation when the value is far from the optimal solution and increase the number of calculation when getting closer to the best value.
Now we take the derivative of the equation of the curve (𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 — 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑)2 to determine the slope at any value for the intercept.
After taking the derivative of the sum of squared residuals, Gradient Descent can be used to find the lowest point for the Residuals. The value is found by taking small steps from the initial guess until it finds the best value. The closer the optimal value is from the intercept, the closer to zero is the slope. If the steps are to big the point may end up on the other side of the curve, if is step are too small it may take forever to arrive. To determines the Step Size Gradient Descent, multiply the Slope by a Learning rate for example 0.1.
Step Size = Slope * Learning Rate
Next intercept = Old Intercept — Step Size
The algorithm will stop by using either a max iteration for example 1000 or when it gets close to zero
Δ = diffrence between the predict and actual data point 𝑛 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡
1𝑛 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑟 𝑐𝑜𝑠𝑡 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛:
Find partial derivative respect to m and b to get the slope
def gradiant_d(x, y): m=0
n_iterations = 10000
n = len(x)
learning_r = 0.08
for i in range(n_iterations):
pred_y = m * x + b
cost = (1/n) * sum([vl**2 for vl in (y-pred_y)])
md = -(2/n)* sum(x*(y-pred_y)) bd = -(2/n)* sum(y-pred_y)
m = m — learning_r *md b = b — learning_r *bd
Newtons method is based on the observation that using a second derivative in addition to the first one can help to get a better approximation. The resulting function is no longer linear but quadratic.
To find the root it first starts by picking a random point (X1) and find out what the function evaluates at that value f(X1)
Now when f(x = X1) is going to be equal to the height at point, the slope is going to be the derivative of this function evaluate at that point.
Slope = f’(x = X1)
Using the slope of the line we can find the second point (X2)
The next point is X2, to find it is going to be X1 minus the ∆𝑥 or the ratio between
Now to find the next point X3 the process can be repeated
Newton’s method when applied to Optimization gives a way to find the minimal or maxima. It considered a 2nd order method because it uses not only the first derivative and the gradient but also a 2nd derivative 𝑓′′(𝑥) also called the Hassien in a multivariable.
Instead to find the derivative of the 𝑓(𝑥) it tries to find the one at the derivative of the function evaluated at X.
Conclusion Newtons Method vs Gradient Descent
Gradient decent is a first order function that use the derivative of that function to find the minimal. Newton’s method is a root finding algorithm that use a second order derivative to find the minimal of that function. A second order derivative can be faster only if is known and can be computed easily, but most of the case this will require a lot of computation and can be expensive. If a 𝑁 is require for the first derivative a 𝑁2will be required to find the second one.
Gradient Decent is sensitive to the learning rate, and in case of a stationary points the algorithms will continuities to run. Newtons Method instead require to elaborate
𝑓′(𝑥)/𝑓′′(𝑥) 𝑓𝑜𝑟 𝑓′(𝑥) = 𝑓′′(𝑥) = 0, mean the program will stop with a division by zero error.
Machine Learning Lecture 12 “Gradient Descent / Newton’s Method” -Cornell CS4780 SP17,
added by Kilian Weinberger. Available at URL [https://www.youtube.com/watch?v=o6FfdP2uYh4]
Rebentrost, P., 2022. ShieldSquare Captcha. [online] Iopscience.iop.org. Available at:
<https://iopscience.iop.org/article/10.1088/1367-2630/ab2a9e> [Accessed 2 January 2022].
Fletcher, R. (2005). “On the Barzilai–Borwein Method”. In Qi, L.; Teo, K.; Yang, X. (eds.). Optimization and
Control with Applications. Applied Optimization. Vol. 96. Boston: Springer. pp. 235–256
Avriel, Mordecai (2003). Nonlinear Programming: Analysis and Methods. Dover Publishing