So some of you might be knowing, I was following this course of CMU on machine learning by Tom Mitchell and Nina Balcan and was going through few of the initial lectures. Although everything was going okayish, there was this one basic doubt of how MLE and MAP essentially differed from probabilistic based approaches for machine learning which use Bayes' theorem. Now some of those who know the answer and have it clear in there head would be thinking as to what foolish thing it is to get confused about. Well, this post is not for them. This post might be very basic for many of you, but still it was indeed the ahaa moment for me. So here I am writing as to how MLE is different from Naive Bayes (NB).
Note: the following was jotted down as "notes" in the default "Notes" app and hence lack proper formatting(subscripts/superscripts) and refinement in terms of language. But the ideas are probably clear and are worth sharing. I promise I will come back and properly format is, but am too tired after a full days of work to edit it right now. I just hope I made sense here, good night!
Maximum Likelihood estimate
Note: the following was jotted down as "notes" in the default "Notes" app and hence lack proper formatting(subscripts/superscripts) and refinement in terms of language. But the ideas are probably clear and are worth sharing. I promise I will come back and properly format is, but am too tired after a full days of work to edit it right now. I just hope I made sense here, good night!
Maximum Likelihood estimate
- You are given data.
- You need to determine a probability of some thing. (in some examples you may see that the final answer matching your intuition and you may tend to doubt why are you doing this lengthy process, but there would be cases when your intuition won’t be that strong and then using this MLE principal would help a lot).
- That probability is expressed as some parameters. These parameters are called probability parameters.
- If we can express our (already observed) data in terms of these parameters, then by a logic we can say that the value of parameters that maximizes the probability of data should be taken. Which just says argmax θ: P(D/θ). Thats the whole logic/hope of this.
- We express our data in terms of these parameters. This data expressed in the form of parameters is called maximum likelihood function.
- So finally just maximize this function. How do you do that? Differentiate and put it to 0. That surely works in the case if we have only one parameter. For more than one parameter, we use gradient decent, which I think is kind of generalization of the first case only.
Now, MLE is just to determine probability (from probability parameters) from given data. Now let’s forget about MAP. And we see that there is no meddling with Bayes' rule. This would help you understand NB without any confusion.
My confusion regarding NB and MLE.
- I just used to think that MAP and NB are just so similar. There also are in turn doing P(Y/X) from P(D/O).P(O)/P(D) and P(X/Y).P(Y)/P(X) respectively. So where exactly is the difference? Both are using Bays to find their probabilities.
- For NB (and actually all generative models), we are actually trying to determine P(Y/X) using Bays rule in terms of P(X/Y)P(Y) P(X) ko lite le liya. This method wouldn’t have actually needed any MLE or MAP or any other probability estimation techniques if we were given joint probability distributions. If that would have been the case, we could have simply determined the value of P(Xi/Y) just by counting the number of examples in training data. Same for P(Y). But alas as we saw, JPD is not that easy to come by. Thus we have to estimate P(X/Y) in other ways. One way to go for it is by having no assumptions and trying to learn all the probabilities (and thus probability parameters) that we may need.
- Now if our Y (number of classes) are k and if our X is a vector of features <X1, X2 …….Xn>, each of which are discreet and can take j values, then we would need to estimate (j^n)*K probabilities. Now we can use MLE (or MAP) to model these probabilities as one or more parameters each and then try to estimate it using MLE or MAP. But number of parameters (or probabilities to estimate is large).
- What if we assume that all Xi are independent. Then P(X/Y) can be decomposed into P(X1/Y).P(X2/Y).P(X3/Y)….P(Xn/Y). We can just estimate each individual little probabilities and compose the required ones to find a P(X/Y). With same assumptions for X and Y, the number of parameters we need to estimate now is: n*j*k. so, we are saving on this.
- But again we would use MLE or MAP to estimate these little probability parameters.
- So Bays theorem may help us with training a classifier, but just to find those individual little probabilities you are using MLE (or MAP).
- Now these MLE and MAP would still be valid for estimating probabilities of discriminative classifiers like logistic regression. In LR you won’t get confused with Bays rule, and just for determining those little probabilities you would use MLE or MAP. Same is for Linear Regression.
To be honest I thought that i would discuss MAP as well. But I guess I have covered the important differences between what MLE and MAP try to do vs what is present in NB. So I would leave the explicit explanation of MAP here.