Machine learning is all about finding that perfect function approximation, as Tom Mitchell mentioned in his very first lecture. Hence more often than not we end up minimizing some thing (error) and hence with an optimization problem. An extremely popular method for optimization is Gradient descent.
Basic discussion
It is (at least as far as I know):
- An algorithm to find optimum.
- Works with convex functions (with convex functions and constraints on step size, one can guarantee convergence)
Main intuition is to go down in the direction of maximum (negative??) “slope”. Lets take a single variable function f(x) = x^2. Let’s say to start with we have some value of x. What we need to do to go down to minima? Calculate the slope and go in the opposite direction. Just move “a little bit” so as to not miss the minima. With low enough “jump” (or step size), we would finally reach a point where the slope is 0. You have hit the minima.
For functions with more than one variable, the whole space can be thought of as a terrain in 1-D space. We can visualize it for a function for 2 variables as the whole thing would be in 3D. Khan academy video here explains a bit about gradient in general. Now we again do similar thing. Start with an arbitrary point, go in the direction of greatest gradient. For our function with 2 variable, you can imagine it as a marble rolling on the domain. It would roll off in the direction of highest slope (or generalized gradient). Similar thing would be true for n-1 variable function and for N-dimensional space.
Deeper:
The important thing is step size to be small. Else, it may happen that it completely misses the optima again and again. Keep step size small, and as you guessed it, it might take extremely long to converge. Thus one possible solution is to not keep step size constant but to vary it as well.
Few cool aspects of step sizes: line search
Still deeper:
Guarantees:
We can provide guarantee of convergence if the following happens:
- the function is convex.
- The step sizes are small enough or are varied (Have to confirm and update)
Limitation or constrains:
- The function has to be differentiable throughout (?? Have to confirm and update)
- Have to find other limitations. Probably some info about practical nuances and limitation of using GD.
What modification or variant of it is used in industry?
This is my favorite section since I can actually provide some more input than just state and give the information present on the net. I'll keep editing the post so as to enhance this section with the best knowledge I can provide.
Generally I have seen stochastic gradient descent being applied in Industry. SGD as it is called, has one implementation in Apache Spark too. But what is cool about it is that Spark has a distributed SGD. However, another implementation that I have seen being used is L-BFGS. Would soon come up with some in depth discussion about both of these and update this blog post.
If you find anything erroneous please comment and let me know. By that time I'll also keep on improving my knowledge and revising this post. Thanks for reading.