Friday, November 4, 2016

More about trees, boosted ones!

Hey,

Sorry for being a bit dormant, a lot of office related work cropped up. But then it also means a lot of learnings which I am itching to pen down. So continuing with what was being studied, decision trees, I would elaborate a bit more and probably hint what is generally done in industry regarding that. Rest assured no NDAs are being broken and I am not giving you highly confidential information. 

So let the learning begin:

So along with decision trees there are following terms which one hears a lot:
  1. Random Forests
  2. Gradient Boosted trees
So I decided to jump into these, with insights that I have from working in industry. Rest assured no secrets of my current (or former) company are being spilled out. It’s just the exposure which as made me aware of various algos/frameworks which I would be discussing. 

So decision trees were considered in the last post, which probably was written in another lifetime. Here is something with sounds similar to decision trees but in fact may be very different from it in terms of concept. Gradient boosted trees. Here is my current understanding of it, what I see in industry about it in general and other comments. Of course one can find more rigorous work/paper if one wants to deep dive into the complexities. 

Gradient boosted trees:
Hoping that the name is not too far off, one can see that there are three distinct words and associated concepts. Since life is a learning experience and I would keep on learning about these various concepts in more detail, this current being is according to my current understanding and would be updated as and when I learn more about them. So the words being:
Gradient, boosted and trees. 
Gradient: We can guess that it has something to do with gradient we deal with optimization. Hell, it is machine learning we are talking about, of course this would be the same gradient. 
Boosted: The word reminds me of a technique I had studied as part of my undergrad data mining course. Boosting of classifiers. More rigorous details below. 
Trees: Probably decision trees are being used somewhere. Let’s see where and how!

Boosting: a bit of dive deep
Idea: A question was asked: if we can indeed get a classifier which is week but better than random (say which gives results(accuracy) better than 0.5 for binary classification problem) can we somehow use few of such week classifiers to make a better, “stronger” one? An intuitive question to ask. And the answer? Yes! Well before you start celebrating that we have found a philosophers stone…. there are few ifs and they are quite big IFs. 

The basic: combine different week classifier which might be working well due to some limited dimensions (features). Combine them on bases of how well they work. Weighted average anyone?

What if you had decision trees as your basic models? Then you get “boosted trees”. But where does the word gradient come from?
The problem of boosting, i.e combining weaker classifiers to form a strong one can be thought of optimization problem. What are you trying to optimize? The error rate of the final classifier. What is your search space? Well if you take the problem of modeling as finding an approximate function, then the problem of boosting becomes that of finding make approximate functions and then combining them together. Thus what your search space is probably space of these approximate functions and you want to select a good function to approximate.  

Your final model is combination of weaker classifiers h(x). The impact of classifiers are weighted with gamma. 

With the intentions and bigger picture set, it’s time to dive into some sweet equations to formalize things a bit more. Updates on it soon. 
(To be continued…)


P.S: I started writing this post in the hope of talking about gradient boosted trees, their implementation in spark and xgboost. Looks like it it going to take a lot more than a single post. Thus rest of the topics would be dealt with in the next post. 

No comments:

Post a Comment