Text Attributes

Target concept: IsItInteresting? : Document $\rightarrow \{\oplus,\ominus\}$
Represent each document by vector of words, one attribute per word position in document, where $a_i$ is the $i^{th}$ word in the document. For simplicity $doc \equiv a_1, a_2 \ldots$.
Let $w_k$ be the $k$th word in the English vocabulary.
Use training examples to estimate \[P(\oplus), P(\ominus), P(doc\,|\,\oplus), P(doc\,|\,\ominus)\]
Naive Bayes conditional independence assumption \[ P(doc\,|\,v_j) = \prod_{i=1}^{length(doc)} P(a_i=w_k \,|\, v_j) \] where $P(a_i=w_k\,|\, v_j)$ is probability that word in position $i$ is $w_k$, given $v_j$.
Also, assume that $\forall_{i,m \in \mathcal{N}} P(a_i=w_k\,|\,v_j) = P(a_m=w_k\,|\,v_j)$, i.e., any word any place.