Minimum Description Length Principle

We can interpret the definition of $h_{MAP}$ in terms of the basic concepts of information theory \[ \array{ h_{MAP} &= &\arg \max_{h \in H}P(D\,|\,h) P(h) \\ &= &\arg \max_{h \in H} \log_{2} P(D\,|\,h) + \log_{2} P(h) \\ &= &\arg \min_{h \in H} - \log_{2} P(D\,|\,h) - \log_{2} P(h) } \] This equation can be interpreted as a statement that short hypotheses are preferred!
- $- \log_{2} P(h)$ is length of $h$ under optimal code
- $- \log_{2} P(D\,|\,h)$ is length of $D$ given $h$ under optimal code
The minimum description length principle recomends choosing the hypothesis that minimizes the sum of these two description lengths.
If a representation of hypotheis is chosen so that the size of $h$ is $- \log_2 P(h)$ and if a representation for exceptions is chosen so that the encoding length of $D$ given $h$ is $- \log_2 P(D\,|\,h)$ then the MDL principle produces MAP hypotheses.