Attributes with Many Values
- The information gain tends to favor attributes with many
values.
- One approach: use $GainRatio$ instead
\[GainRatio(S,A) \equiv \frac{Gain(S,A)}{SplitInformation(S,A)} \]
\[ SplitInformation(S,A) \equiv - \sum_{i=1}^{c} \frac{|S_{i}|}{|S|} \log_{2}
\frac{|S_{i}|}{|S|} \]
where $S_{i}$ is subset of $S$ for which $A$ has value $v_{i}$
- The $SplitInformation$ term discourages the selection of
attributes with many uniformly distributed values.
José M. Vidal
.
23 of 25