Saturday, January 30, 2016

Decision trees and into to ML from CMU's course

Hey! Decision trees as a classifier for machine learning are I guess one of the best starting points. Probably because they are so close to our intuition. I'm going to watch this course video from CMU's ML class about decision trees. Let's embark on our journey to learn something new, disabuse ourselves from our wrong notions and learn from a great teacher.

So here is what I already know about decision trees (yeah I had done a ML course in my college as undergrad. Haven't touched decision trees since then though.)

What I already know:

  1. Supervised method to classify records/tuples/examples/points.
  2. Main aim is to map each record to a leaf node, the label corresponding to that leaf node is given to the class.
  3. Construction of tree is based on attribute values. from the root, branches come out based on attribute values.
  4. The main aim is splitting the space. By intuition of common sense, we would like to split based on an attribute which splits the input space best. By input space I mean feature space. 
  5. Since there can be many leaf nodes, decision trees are well suited for multi class problems unlike SVM or Logistic Regression. 
  6. Decision trees is inherently non linear classifier.
  7. They too suffer from over fitting/under fitting. Pruning is one way to deal with it.


Questions I have in mind, already!

  1. How is it different from rule based classifiers? Because in a way, each node is just a rule. And execution of these rules one after another is like traversing decision tree.
  2. Despite having so great advantages (inherently being non linear and suitable for multi class), why isn’t it used much? Random forest is used, but I don’t know much about them. 

Time to watch the lecture guys. This is the course website which I am following. 

And here is the video which I am about to watch:

So the video has been watched. Here are some initial comments and questions. Would come back later to improve upon them/neatly classify them and dive deep into them:

  1. The prof talks about altos which work with some labeled data and a lot of unlabelled data. What was he hinting at? Have to see….
  2. Structured v/s unstructured data….. got to know about the terms…. I guess seeing programs learn from unstructured data is what enthralls novices!! Where would some blob of text come? It is not exactly structured….. and marking or tagging few parts of the sentences would make it different from unstructured data as well. Though the labeling it into structured or unstructured is not that important but just for fun.
  3. The part of self adapting algorithms…. haven’t explored them much. Should see the part about optimizing matrix multiplication paper.
  4. Can any function be represented using decision tree if we are given the attributes take boolean value. The fact is that we can easily represent NAND and NOR gates using decision trees and since nand and nor gates can be used to model any boolean function , we can say yes it any boolean function can be represented. The cool thing is the guy in the video gives a more general and awesome answer. His answer of having N layers corresponding to N attributes and at each level number of branches depending on cardinality of what the attribute can take is elegant and more general than what I thought. The prof says that attributes take “discreet” values. For the numeric case, we can deal with them by binning them. But alas what about “text” features. Can’t use some attribute like “title of the book” for making a decision tree. There should be some way…. would need to check it out. 
  5. In the real example of decision tree which the prof showed (of some hospital) the leaf nodes were not pure and certain probabilities attached to them were shown. Hmm…. this is somewhat like pruning isn’t it? (I know the basics of machine learning…. so you can say I am cheating… but learning from such a great prof is a chance I cannot pass ;) )
  6. I had earlier mentioned that you would take an attribute on its efficacy to split your feature space well. I should have rather said that you would take an attribute its efficacy to split the classes (to predict) well. As in the split should be as homogenous as possible. This is kind of what entropy would be measuring. 
  7. So while choosing the next attribute to split, the entropy gain calculations of all the attributes can be done in parallel. Thus there is an element of parallelization involved which can make the algorithm a bit faster.