Saturday, November 19, 2016

The good old gradient descent

Machine learning is all about finding that perfect function approximation, as Tom Mitchell mentioned in his very first lecture. Hence more often than not we end up minimizing some thing (error) and hence with an optimization problem. An extremely popular method for optimization is Gradient descent. 

Basic discussion
It is (at least as far as I know): 
  • An algorithm to find optimum. 
  • Works with convex functions (with convex functions and constraints on step size, one can guarantee convergence)
Main intuition is to go down in the direction of maximum (negative??) “slope”. Lets take a single variable function f(x) = x^2. Let’s say to start with we have some value of x. What we need to do to go down to minima? Calculate the slope and go in the opposite direction. Just move “a little bit” so as to not miss the minima. With low enough “jump” (or step size), we would finally reach a point where the slope is 0. You have hit the minima. 

For functions with more than one variable, the whole space can be thought of as a terrain in 1-D space. We can visualize it for a function for 2 variables as the whole thing would be in 3D. Khan academy video here explains a bit about gradient in general. Now we again do similar thing. Start with an arbitrary point, go in the direction of greatest gradient. For our function with 2 variable, you can imagine it as a marble rolling on the domain. It would roll off in the direction of highest slope (or generalized gradient). Similar thing would be true for n-1 variable function and for N-dimensional space. 

Deeper:
The important thing is step size to be small. Else, it may happen that it completely misses the optima again and again. Keep step size small, and as you guessed it, it might take extremely long to converge. Thus one possible solution is to not keep step size constant but to vary it as well.  
Few cool aspects of step sizes: line search 

Still deeper:
Guarantees:
We can provide guarantee of convergence if the following happens:
  1. the function is convex. 
  2. The step sizes are small enough or are varied (Have to confirm and update)
Limitation or constrains:

  1. The function has to be differentiable throughout (?? Have to confirm and update)
  2. Have to find other limitations. Probably some info about practical nuances and limitation of using GD. 
What modification or variant of it is used in industry?
This is my favorite section since I can actually provide some more input than just state and give the information present on the net. I'll keep editing the post so as to enhance this section with the best knowledge I can provide. 
Generally I have seen stochastic gradient descent being applied in Industry. SGD as it is called, has one implementation in Apache Spark too. But what is cool about it is that Spark has a distributed SGD. However, another implementation that I have seen being used is L-BFGS. Would soon come up with some in depth discussion about both of these and update this blog post.

If you find anything erroneous please comment and let me know. By that time I'll also keep on improving my knowledge and revising this post. Thanks for reading.  

Friday, November 4, 2016

More about trees, boosted ones!

Hey,

Sorry for being a bit dormant, a lot of office related work cropped up. But then it also means a lot of learnings which I am itching to pen down. So continuing with what was being studied, decision trees, I would elaborate a bit more and probably hint what is generally done in industry regarding that. Rest assured no NDAs are being broken and I am not giving you highly confidential information. 

So let the learning begin:

So along with decision trees there are following terms which one hears a lot:
  1. Random Forests
  2. Gradient Boosted trees
So I decided to jump into these, with insights that I have from working in industry. Rest assured no secrets of my current (or former) company are being spilled out. It’s just the exposure which as made me aware of various algos/frameworks which I would be discussing. 

So decision trees were considered in the last post, which probably was written in another lifetime. Here is something with sounds similar to decision trees but in fact may be very different from it in terms of concept. Gradient boosted trees. Here is my current understanding of it, what I see in industry about it in general and other comments. Of course one can find more rigorous work/paper if one wants to deep dive into the complexities. 

Gradient boosted trees:
Hoping that the name is not too far off, one can see that there are three distinct words and associated concepts. Since life is a learning experience and I would keep on learning about these various concepts in more detail, this current being is according to my current understanding and would be updated as and when I learn more about them. So the words being:
Gradient, boosted and trees. 
Gradient: We can guess that it has something to do with gradient we deal with optimization. Hell, it is machine learning we are talking about, of course this would be the same gradient. 
Boosted: The word reminds me of a technique I had studied as part of my undergrad data mining course. Boosting of classifiers. More rigorous details below. 
Trees: Probably decision trees are being used somewhere. Let’s see where and how!

Boosting: a bit of dive deep
Idea: A question was asked: if we can indeed get a classifier which is week but better than random (say which gives results(accuracy) better than 0.5 for binary classification problem) can we somehow use few of such week classifiers to make a better, “stronger” one? An intuitive question to ask. And the answer? Yes! Well before you start celebrating that we have found a philosophers stone…. there are few ifs and they are quite big IFs. 

The basic: combine different week classifier which might be working well due to some limited dimensions (features). Combine them on bases of how well they work. Weighted average anyone?

What if you had decision trees as your basic models? Then you get “boosted trees”. But where does the word gradient come from?
The problem of boosting, i.e combining weaker classifiers to form a strong one can be thought of optimization problem. What are you trying to optimize? The error rate of the final classifier. What is your search space? Well if you take the problem of modeling as finding an approximate function, then the problem of boosting becomes that of finding make approximate functions and then combining them together. Thus what your search space is probably space of these approximate functions and you want to select a good function to approximate.  

Your final model is combination of weaker classifiers h(x). The impact of classifiers are weighted with gamma. 

With the intentions and bigger picture set, it’s time to dive into some sweet equations to formalize things a bit more. Updates on it soon. 
(To be continued…)


P.S: I started writing this post in the hope of talking about gradient boosted trees, their implementation in spark and xgboost. Looks like it it going to take a lot more than a single post. Thus rest of the topics would be dealt with in the next post.