This talk is based on

- Tom M. Mitchell. Machine Learning. [1] McGraw Hill. 1997. Chapter 6.
- and his slides [2].

- Bayesian learning methods provide useful learning algorithms and help us understand other learning algorithms.
- The practical learning algorithms are:
- Naive Bayes learning.
- Bayesian belief network learning—combines prior knowledge with observed data.

- Bayes reasoning provides the "gold standard" for evaluating other algorithms.

- Bayes theorem [3] states that
\[ P(h\,|\,D) = \frac{P(D\,|\,h) P(h)}{P(D)} \]
where
- $P(h)$ = prior probability of hypothesis $h$
- $P(D)$ = prior probability of training data $D$
- $P(h\,|\,D)$ = probability of $h$ given $D$, also known as the posterior probability of $h$.
- $P(D\,|\,h)$ = probability of $D$ given $h$

- Generally want the most probable hypothesis given the training data.
- The maximum a posteriori hypothesis is $h_{MAP}$ where \[ \array{ h_{MAP} & = \arg \max_{h \in H} P(h\,|\,D) \\ & = \arg \max_{h \in H} \frac{P(D\,|\,h) P(h)}{P(D)} \\ & = \arg \max_{h \in H}P(D\,|\,h) P(h) } \] we dropped $P(D)$ because its independent of $h$.
- If we assume that all hypothesis have the same prior probability, that is $\forall_{i \neq j}P(h_i) = P(h_j)$, then we can simplify even more and choose the maximum likelihood hypothesis \[h_{ML} = \arg \max_{h \in H} P(D\,|\,h) \]

- A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only $98%$ of the cases in which the disease is actually present, and a correct negative result in only $97%$ of the cases in which the disease is not present. Furthermore, $.008$ of the entire population have this cancer. \[ \array{ P(cancer) = .008 & P(\neg cancer) = .992 \\ P(\oplus \,|\, cancer) = .98 & P(\ominus \,|\, cancer) = .02 \\ P(\oplus \,|\, \neg cancer) = .03 & P(\ominus \,|\, \neg cancer) = .97} \]
- If a new patient comes in with a positive test result, whats the probability that he has cancer? \[ \array{ P(\oplus\,|\,cancer) P(cancer) = (.98).008 = .0078 \\ P(\oplus\,|\,\neg cancer) P(\neg cancer) = (.03).992 = .0298 } \] so $h_{MAP} = \neg cancer$.

- Product rule: probability $P(A \wedge B)$ of a conjunction of two events A and B: \[P(A \wedge B) = P(A\,|\,B) P(B) = P(B\,|\,A) P(A) \]
- Sum rule: probability of a disjunction of two events A and B: \[P(A \vee B) = P(A) + P(B) - P(A \wedge B) \]
- Theorem of total probability: if events $A_{1}, \ldots, A_{n}$ are mutually exclusive with $\sum_{i = 1}^{n} P(A_{i}) = 1$, then \[P(B) = \sum_{i = 1}^{n} P(B\,|\,A_{i}) P(A_{i})\]

- Brute-Force MAP Learning algorithm:
- For each hypothesis $h$ in $H$, calculate the posterior probability \[ P(h\,|\,D) = \frac{P(D\,|\,h) P(h)}{P(D)}\] where $D = (d_1\ldots d_m)$ is the set of target values from the set of examples $X = ((x_1,d_1) \cdots (x_m,d_m))$.
- Output the hypothesis $h_{MAP}$ with the highest posterior probability \[h_{MAP} = \argmax_{h \in H} P(h\,|\,D)\]
- This can be very slow since it applies BT to every $h\in H$.

- In concept learning we have an instance space $X$,
hypothesis space $H$, training examples $D$. The
*FindS*algorithm finds the most specific hypothesis from $VS_{H,D}$. - Assume fixed set of instances $\langle x_{1}, \ldots, x_{m}\rangle$
- Assume $D$ is the set of classifications $D = \langle c(x_{1}), \ldots, c(x_{m})\rangle$
- Choose $P(D\,|\,h)$ \[ P(D\,|\,h) = { \{ \array{ 1 & if \forall_{d_i \in D} d_i = h(x_i)\\ 0 & otherwise } } \]
- Choose $P(h)$ to be uniform distribution so that $P(h) = \frac{1}{|H|}$ for all $h$ in $H$
- Then we can say that \[ P(h\,|\,D) = { \{ \array{ \frac{1}{|VS_{H,D}|} & if h is consistend with D \\ 0 & otherwise. } } \]
- What happens in Brute force BCL is that all hypotheses start with with same probability then the inconsistent hypotheses drop to zero while the rest share equally.
- Every consistent hypothesis is a MAP hypothesis.

- We showed that, in the given setting, every hypothesis consistent with $D$ is a MAP hypothesis.
- Let a consistent learner be one that outputs a hypothesis with zero error over training examples.
*Every consistent learner outputs a MAP hypothesis*if we assume uniform prior over $H$ and deterministic noise-free examples.- For example, since
*Find-S*outputs a consistent hypothesis, it outputs a MAP hypothesis, all without manipulating probabilities!

- Consider any real-valued target function $f$.
- Training examples $\langle x_{i}, d_{i} \rangle$, where $d_{i}$ is noisy training value. $d_{i} = f(x_{i}) + e_{i}$, and $e_{i}$ is random variable (noise) drawn independently for each $x_{i}$ according to some Gaussian distribution with mean=0.
- The maximum likelihood hypothesis $h_{ML}$ we defined earlier as \[ \array{ h_{ML} &= \argmax_{h \in H} p(D\,|\,h) \\ &= \argmax_{h \in H} \prod_{i=1}^{m} p(d_{i}\,|\,h) } \] if we replace that probability with its equation and simplify we will eventually get \[ h_{ML} = \argmin_{h \in H} \sum_{i=1}^{m} \left(d_{i} - h(x_{i})\right)^{2} \]
- Therefore, $h_{ML}$ is the hypothesis that minimizes the sum of the squared errors, if observations are generated by adding Normal noise with zero main to the true data.
- Under these conditions
*any learning algorithm that minimizes the squared error will output a maximum likelihood hypothesis.*

- Consider predicting survival probability from patient data
- Training examples $\langle x_{i}, d_{i} \rangle$, where $d_{i}$ is 1 or 0
- Want to train neural network to output a probability given $x_i$ (not a 0 or 1)
- In this case can show \[ h_{ML} = \argmax_{h \in H} \sum_{i=1}^{m} d_{i} \ln h(x_{i}) + (1-d_{i}) \ln (1 - h(x_{i})) \] The negation of this quantity is known as the cross entropy.
- In order to maximize that we would need to do gradient ascent on it, wrt the edge weight. This weight update works out to be \[ w_{jk} \leftarrow w_{jk} + \Delta w_{jk}\] where \[ \Delta w_{jk} = \eta \sum_{i=1}^{m} (d_{i} - h(x_{i})) x_{ijk} \]
- This is the same rule used by Backpropagation except that Backpropagation multiplies by an extra term $h(x_i)(1 - h(x_i))$, which is the derivative of the sigmoid function.
- Backpropagation updates seek ML hypothesis under the assumption that training data can be modeled by Normal noise on the target function.
- Cross entropy updates seek ML hypothesis under the assumption that observed boolean value is a probabilistic function of input instance.

- We can interpret the definition of $h_{MAP}$ in terms of the basic concepts of information theory
\[
\array{
h_{MAP} &= &\arg \max_{h \in H}P(D\,|\,h) P(h) \\
&= &\arg \max_{h \in H} \log_{2} P(D\,|\,h) + \log_{2} P(h) \\
&= &\arg \min_{h \in H} - \log_{2} P(D\,|\,h) - \log_{2} P(h)
}
\]
This equation can be interpreted as a statement that short hypotheses
are preferred!
- $- \log_{2} P(h)$ is length of $h$ under optimal code
- $- \log_{2} P(D\,|\,h)$ is length of $D$ given $h$ under optimal code

- The minimum description length principle recomends choosing the hypothesis that minimizes the sum of these two description lengths.
*If a representation of hypotheis is chosen so that the size of $h$ is $- \log_2 P(h)$ and if a representation for exceptions is chosen so that the encoding length of $D$ given $h$ is $- \log_2 P(D\,|\,h)$ then the MDL principle produces MAP hypotheses.*

- So far we've sought the most probable hypothesis given the data $D$ (i.e., $h_{MAP}$)
- Given new instance $x$, what is its most probable classification?
- $h_{MAP}(x)$ is not the most probable classification!
- For example, given three possible hypotheses: \[ P(h_{1}\,|\,D)=.4, P(h_{2}\,|\,D)=.3, P(h_{3}\,|\,D)=.3 \] ($h_1$ is the MAP hypothesis) and given a new instance $x$, \[ h_{1}(x)=\oplus, h_{2}(x)=\ominus, h_{3}(x)=\ominus \] what is the most probable classification of $x$, $\oplus$ or $\ominus$?
- The MAP hypothesis ($h_1$) says it is $\oplus$, but if we consider all hypothesis they say that is is $\ominus$ with probability .6. So the most probable classification is $\ominus$.

- Bayes optimal classification \[ \arg \max_{v_{j} \in V} \sum_{h_{i} \in H} P(v_{j}\,|\,h_{i}) P(h_{i}\,|\,D)\]
- Example: \[ \array{ P(h_{1}\,|\,D)=.4, & P(\ominus \,|\,h_{1})=0, & P(\oplus \,|\,h_{1})=1 \\ P(h_{2}\,|\,D)=.3, & P(\ominus \,|\,h_{2})=1, & P(\oplus \,|\,h_{2})=0 \\ P(h_{3}\,|\,D)=.3, & P(\ominus \,|\,h_{3})=1, & P(\oplus \,|\,h_{3})=0 } \] therefore \[ \array{ \sum_{h_{i} \in H} P(\oplus \,|\,h_{i}) P(h_{i}\,|\,D) & = & .4 \\ \sum_{h_{i} \in H} P(\ominus \,|\,h_{i}) P(h_{i}\,|\,D) & = & .6 } \] and \[ \array{ \arg \max_{v_{j} \in V} \sum_{h_{i} \in H} P(v_{j}\,|\,h_{i}) P(h_{i}\,|\,D) & = & \ominus }\]

- Bayes optimal classifier provides best result, but can be expensive if many hypotheses.
- Gibbs algorithm:
- Choose one hypothesis at random, according to $P(h\,|\,D)$
- Use this to classify new instance

- Surprising fact: Assume target concepts are drawn at
random from $H$ according to priors on $H$. Then:
\[ E[error_{Gibbs}] \leq 2 E[error_{Bayes Optimal}] \]
So, if the learner correctly assumes a uniform prior distribution over $H$, then
- Pick any hypothesis from VS, with uniform probability
- Its expected error is no worse than twice Bayes optimal

- The naive bayes classifier is among the set of practical learning methods.
- It should be used when
- There is large set of training examples.
- The attributes that describe instances are conditionally independent given classification.

- It has been used in many applications such as diagnosis and the classification of text documents.

- Assumes all instances $x \in X$ are described by attributes $\langle a_{1}, a_{2} \ldots a_{n} \rangle$.
- Assume target function $f: X \rightarrow V$, where $V$ is some finite set.
- Most probable value of $f(x)$ given $x = \langle a_{1}, a_{2} \ldots a_{n} \rangle$ is: \[ \array{ v_{MAP} &= &\argmax_{v_{j} \in V} P(v_{j} \,|\, a_{1}, a_{2} \ldots a_{n}) \\ v_{MAP} &= &\argmax_{v_{j} \in V} \frac{P(a_{1}, a_{2} \ldots a_{n}\,|\,v_{j}) P(v_{j})}{P(a_{1}, a_{2} \ldots a_{n})} \\ v_{MAP} &= &\argmax_{v_{j} \in V} P(a_{1}, a_{2} \ldots a_{n}\,|\,v_{j}) P(v_{j}) } \]
- Naive Bayes assumes conditional independence given the target value, that is \[ P(a_{1}, a_{2} \ldots a_{n}\,|\,v_{j}) = \prod_{i} P(a_{i} \,|\, v_{j}) \]
- Naive Bayes classifier: \[ v_{NB} = \argmax_{v_{j} \in V} P(v_{j}) \prod_{i} P(a_{i} \,|\, v_{j}) \]

- Naive Bayes Algorithm has a learning and a classifying functions.

**Naive_Bayes_Learn(examples)**- For each target value $v_j$
- $\hat{P}(v_j) \leftarrow $ estimate $P(v_j)$
- For each attribute value $a_i$ of each attribute $a$ $\hat{P}(a_i\,|\,v_j) \leftarrow$ estimate $P(a_i\,|\,v_j)$

**Classify_New_Instance($x$)**\[ v_{NB} = \argmax_{v_{j} \in V} \hat{P}(v_{j}) \prod_{a_i \in x} \hat{P}(a_{i} \,|\, v_{j}) \]

- We are trying to determine if its time to PlayTennis given that \[\langle Outlook=sunny, Temp=cool, Humid=high, Wind=strong \rangle \]
- We want to compute \[ v_{NB} = \argmax_{v_{j} \in V} P(v_{j}) \prod_{i} P(a_{i} \,|\, v_{j}) \]
- Given this set of examples we can calculate that \[ P(PlayTennis = yes) = \frac{9}{14} = .64 \] \[ P(PlayTennis = no) = \frac{5}{14} = .36 \] similarly \[P(Wind = strong \,|\, PlayTennis = yes) = \frac{3}{9} = .33 \] \[P(Wind = strong \,|\, PlayTennis = no) = \frac{3}{5} = .6 \] using these and few more similar probabilities we can calculate \[P(yes) \cdot P(Outlook=sunny \,|\, yes)\cdot P(Temperature = cool \,|\, yes)\cdot P(Humidity = high \,|\, yes)\cdot P(Wind = strong \,|\, yes) = .005 \] \[P(no)\cdot P(Outlook=sunny \,|\, no)\cdot P(Temperature = cool \,|\, no)\cdot P(Humidity = high \,|\, no)\cdot P(Wind = strong \,|\, no) = .021 \]
- So \[ v_{NB} = no \]

- Conditional independence assumption is often violated \[ P(a_{1}, a_{2} \ldots a_{n}\,|\,v_{j}) = \prod_{i} P(a_{i} \,|\, v_{j}) \] but it works surprisingly well anyway.
- We don't need estimated posteriors $\hat{P}(v_j\,|\,x)$ to be correct; need only that \[\argmax_{v_{j} \in V} \hat{P}(v_{j}) \prod_{i} \hat{P}(a_{i} \,|\, v_{j}) = \argmax_{v_{j} \in V} P(v_{j}) P(a_{1} \ldots, a_n \,|\, v_{j}) \]
- Naive Bayes posteriors often unrealistically close to 1 or 0.
- If none of the training instances with target value $v_j$ have attribute
value $a_i$ Then
\[ \hat{P}(a_i\,|\,v_j) = 0 \text{, and...}\]
\[ \hat{P}(v_{j}) \prod_{i} \hat{P}(a_{i} \,|\, v_{j}) = 0 \]
Typical solution is Bayesian estimate for $\hat{P}(a_{i} \,|\, v_{j})$
\[ \hat{P}(a_{i} \,|\, v_{j}) \leftarrow \frac{n_{c} + mp}{n + m} \]
where
- $n$ is number of training examples for which $v=v_j$,
- $n_c$ number of examples for which $v=v_j$ and $a=a_i$
- $p$ is prior estimate for $\hat{P}(a_{i} \,|\, v_{j})$
- $m$ is weight given to prior (i.e. number of ``virtual'' examples)

- Can we learn to classify which news articles are of interest? which web pages are about astronomy?
- Naive Bayes is among most effective algorithms
- What attributes shall we use to represent text documents??

- Target concept: IsItInteresting? : Document $\rightarrow \{\oplus,\ominus\}$
- Represent each document by vector of words, one attribute per word position in document, where $a_i$ is the $i^{th}$ word in the document. For simplicity $doc \equiv a_1, a_2 \ldots$.
- Let $w_k$ be the $k$th word in the English vocabulary.
- Use training examples to estimate \[P(\oplus), P(\ominus), P(doc\,|\,\oplus), P(doc\,|\,\ominus)\]
- Naive Bayes conditional independence assumption \[ P(doc\,|\,v_j) = \prod_{i=1}^{length(doc)} P(a_i=w_k \,|\, v_j) \] where $P(a_i=w_k\,|\, v_j)$ is probability that word in position $i$ is $w_k$, given $v_j$.
- Also, assume that $\forall_{i,m \in \mathcal{N}} P(a_i=w_k\,|\,v_j) = P(a_m=w_k\,|\,v_j)$, i.e., any word any place.

- Given a set of Examples, and V.

- Collect all words and other tokens that occur in
Examples.
- $Vocabulary \leftarrow$ all distinct words and other tokens in $Examples$

- Calculate the required $P(v_{j})$ and $P(w_{k}\,|\,v_{j})$ probability terms.
- $docs_{j} \leftarrow $ subset of $Examples$ for which the target value is $v_{j}$
- $P(v_{j}) \leftarrow \frac{|docs_{j}|}{|Examples|}$
- $Text_{j} \leftarrow $ a single document created by concatenating all members of $docs_{j}$
- $n \leftarrow$ total number of words in $Text_{j}$ (counting duplicate words multiple times)
- for each word $w_{k}$ in $Vocabulary$
- $n_{k} \leftarrow$ number of times word $w_{k}$ occurs in $Text_{j}$
- $P(w_{k}\,|\,v_{j}) \leftarrow \frac{n_{k} + 1}{n + |Vocabulary|}$

- Given a Doc

- $positions \leftarrow$ all word positions in $Doc$ that contain tokens found in $Vocabulary$
- Return $v_{NB}$, where \[v_{NB} = \argmax_{v_{j} \in V} P(v_{j}) \prod_{i \in positions}P(a_{i}\,|\,v_{j}) \]

- This algorithm was shown to classify Usenet articles into their appropriate newsgroups with 89% accuracy.
- A similar approach was proposed by Paul Graham in A Plan for Spam [4]. Several implementations exist such as Spambayes [5].

- Naive Bayes assumption of conditional independence is too restrictive, but learning is intractable without some such assumptions. Hmmm.....
- Bayesian Belief networks describe conditional independence among subsets of variables
- They allow combining prior knowledge about (in)dependencies among variables with observed training data
- Also known as, Bayes nets.

- $X$ is conditionally independent of $Y$ given $Z$ if the probability distribution governing $X$ is independent of the value of $Y$ given the value of $Z$; that is, if \[ \forall_{x_i,y_j,z_k} P(X = x_i \,|\, Y = y_j, Z = z_k) = P(X = x_i \,|\, Z = z_k) \] more compactly, we write \[ P(X \,|\, Y,Z) = P(X \,|\, Z) \]
- For example, $Thunder$ is conditionally independent of $Rain$, given $Lightning$ \[ P(Thunder \,|\, Rain, Lightning) = P(Thunder \,|\, Lightning) \]
- Naive Bayes uses cond. indep. to justify \[ \array{ P(X,Y\,|\,Z) &= & P(X\,|\,Y,Z)\cdot P(Y\,|\,Z) \\ &= & P(X\,|\,Z)\cdot P(Y\,|\,Z) } \]

- The network represents a set of conditional independence assertions.
- Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors.
- It forms a directed acyclic graph.
- In general, \[P(y_1, \ldots, y_n) = \prod_{i=1}^{n} P(y_i \,|\, Parents(Y_i)) \] where $Parents(Y_i)$ denotes immediate predecessors of $Y_i$ in graph
- Joint distribution is fully defined by graph, plus the $P(y_i\,|\,Parents(Y_i))$.

- How can one infer the (probabilities of) values of one or more network variables, given observed values of others?
- Bayes net contains all information needed for this inference.
- If there is only one variable with an unknown value then easy to infer it.
- In the general case theproblem is NP hard.
- In practice, we can succeed in many cases:
- Exact inference methods work well for some network structures.
- Monte Carlo methods “simulate” the network randomly to calculate approximate solutions.

- Much of the research in Bayesian nets goes into how to use them. We are, instead, interested in learning them.
- The network structure might be
*known*or*unknown*. - Training examples might provide values of
*all*network variables, or just*some*. - If the structure known and we can observe all variables then its easy as training a Naive Bayes classifier.

- Suppose structure known, variables partially observable. For example we observe (ForestFire, Storm, BusTourGroup, Thunder), but not (Lightning, Campfire)
- We can use a method similar to training neural network with hidden units.
- Specifically, we will use gradient ascent.
- We want to converge to network $h$ that (locally) maximizes $P(D\,|\,h)$

- Let $w_{ijk}$ denote one entry in the conditional probability table for variable $Y_i$ in the network \[ w_{ijk} = P(Y_i=y_{ij}\,|\,Parents(Y_i) = \text{the list} u_{ik} \text{of values}) \] e.g., if $Y_i = Campfire$, then $u_{ik}$ might be $\langle Storm=T, BusTourGroup=F \rangle$
- Perform gradient ascent by repeatedly
- update all $w_{ijk}$ using training data $D$ \[w_{ijk} \leftarrow w_{ijk} + \eta \sum_{d \in D} \frac{P_h(y_{ij}, u_{ik} \,|\, d)}{w_{ijk}} \]
- then, re-normalize the $w_{ijk}$ to assure $\sum_{j} w_{ijk} = 1$ and $0 \leq w_{ijk} \leq 1$

- If some of the probabilities are not present in the data they can be derived via inference from the net.
- Again, this only finds locally optimal conditional probabilities.

- The EM algorithm can also be used to learn a Bayes net.

- Calculate probabilities of unobserved variables, assuming $h$.
- Calculate new $w_{ijk}$ to maximize $E[\ln P(D\,|\,h)]$ where $D$ now includes both observed and (calculated probabilities of) unobserved variables.

- When the structure of the network is unknown we can use algorithms that use greedy search to add/subtract edges and nodes. This is an active area of research.

- Data is only partially observable.
- Unsupervised clustering (target value unobservable).
- Supervised learning (some instance attributes unobservable).
- Can be used to:
- Train Bayesian Belief Networks.
- Unsupervised clustering (AUTOCLASS).
- Learning Hidden Markov Models.

- Imagine the examples are generated by choosing instances $x$ from $k$ Gaussians, with uniform probability.
- The learning task is to output a hypothesis that describes the means $\langle \mu_1, \ldots, \mu_k \rangle$ of the $k$ distributions.
- We don't know which instance $x_i$ was generated by which Gaussian.
- We would like to find a maximum likelihood hypothesis for those means, that is, the maximum likelihood estimates of $\langle \mu_1, \ldots, \mu_k \rangle$.
- Think of full description of each instance as $y_i = \langle x_i, z_{i1}, z_{i2}
\rangle$, where
- $z_{ij}$ is 1 if $x_i$ generated by $j$th Gaussian
- $x_i$ observable
- $z_{ij}$ unobservable

- EM Algorithm: Pick random initial $h = \langle \mu_1, \mu_2 \rangle$, then iterate

**E step:**Calculate the expected value $E[z_{ij}]$ of each hidden variable $z_{ij}$, assuming the current hypothesis $h = \langle \mu_1, \mu_2 \rangle$ holds. \[ \array{ E[z_{ij}] & = & \frac{p(x=x_i \,|\, \mu = \mu_j)}{\sum_{n=1}^{2} p(x = x_i \,|\, \mu=\mu_n)} \\ & = & \frac{e^{-\frac{1}{2 \sigma^2} (x_i - \mu_j)^2}}{\sum_{n=1}^{2} e^{-\frac{1}{2 \sigma^2} (x_i - \mu_n)^2}} } \]**M step:**Calculate a new maximum likelihood hypothesis $h' = \langle \mu_1', \mu_2' \rangle$, assuming the value taken on by each hidden variable $z_{ij}$ is its expected value $E[z_{ij}]$ calculated above. Replace $h = \langle \mu_1, \mu_2 \rangle$ by $h' = \langle \mu_1', \mu_2' \rangle$. \[ \mu_j \leftarrow \frac{\sum_{i=1}^m E[z_{ij}] x_i}{\sum_{i=1}^m E[z_{ij}]} \]

- Converges to local maximum likelihood $h$ and provides estimates of hidden variables $z_{ij}$
- In fact, local maximum in $E[\ln P(Y\,|\,h)]$
- $Y$ is complete (observable plus unobservable variables) data
- Expected value is taken over possible values of unobserved variables in $Y$

Given:

- Observed data $X=\{x_1, \ldots, x_m\}$
- Unobserved data $Z=\{z_1, \ldots, z_m\}$
- Parameterized probability distribution $P(Y\,|\,h)$, where
- $Y=\{y_1, \ldots, y_m\}$ is the full data $y_i = x_i \union z_i$ and $h$ are the parameters

- $h$ that (locally) maximizes $E[\ln P(Y\,|\,h)]$

- Define likelihood function $Q(h' \,|\, h)$ which calculates $Y = X \union Z$ using observed $X$ and current parameters $h$ to estimate $Z$ \[ Q(h' \,|\, h) \leftarrow E[ \ln P(Y \,|\, h') \,|\, h, X ] \]

**E step:**Calculate $Q(h'\,|\,h)$ using the current hypothesis $h$ and the observed data $X$ to estimate the probability distribution over $Y$. \[ Q(h' \,|\, h) \leftarrow E[ \ln P(Y \,|\, h') \,|\, h, X ] \]**M step:**Replace hypothesis $h$ by the hypothesis $h'$ that maximizes this $Q$ function. \[ h \leftarrow \argmax_{h'} Q(h' \,|\, h) \]

- Combine prior knowledge with observed data
- Impact of prior knowledge (when correct!) is to lower the sample complexity
- Active research areas
- Extend from boolean to real-valued variables
- Parameterized distributions instead of tables
- Extend to first-order instead of propositional systems
- More effective inference methods

- Machine Learning book at Amazon, http://www.amazon.com/exec/obidos/ASIN/0070428077/multiagentcom/
- Slides by Tom Mitchell on Machine Learning, http://www-2.cs.cmu.edu/~tom/mlbook-chapter-slides.html
- Wikipedia:Bayes Theorem, http://www.wikipedia.org/wiki/Bayes_theorem
- A Plan for Spam by Paul Graham, http://www.paulgraham.com/spam.html
- Spambayes Homepage, http://spambayes.sourceforge.net/

This talk available at http://jmvidal.cse.sc.edu/talks/bayesianlearning/

Copyright © 2009 José M. Vidal
.
All rights reserved.

25 February 2003, 02:16PM