Text Attributes
- Target concept: IsItInteresting? : Document $\rightarrow
\{\oplus,\ominus\}$
- Represent each document by vector of words, one attribute
per word position in document, where $a_i$ is the $i^{th}$ word
in the document. For simplicity $doc \equiv a_1, a_2 \ldots$.
- Let $w_k$ be the $k$th word in the English vocabulary.
- Use training examples to estimate
\[P(\oplus), P(\ominus), P(doc\,|\,\oplus), P(doc\,|\,\ominus)\]
- Naive Bayes conditional independence assumption
\[ P(doc\,|\,v_j) = \prod_{i=1}^{length(doc)} P(a_i=w_k \,|\, v_j) \]
where $P(a_i=w_k\,|\, v_j)$ is probability that word in position $i$ is
$w_k$, given $v_j$.
- Also, assume that $\forall_{i,m \in \mathcal{N}} P(a_i=w_k\,|\,v_j) =
P(a_m=w_k\,|\,v_j)$, i.e., any word any place.
José M. Vidal
.
22 of 39