Analytical Learning
1 Introduction
- So far, we have studied inductive learning methods.
- Induction fails when there is very little data. In fact,
CLT gives us a bound
- Q: Can we break that bound?
- Yes, if we re-state the learning problem.
1.1 New Learning Problem
- Learning algorithm accepts explicit prior knowledge as an
input, in addition to the training data.
- Inverted
deduction systems also use background knowledge, but they
use it to augment the description of instances.
\[\forall_{\langle x_{i}, f(x_i) \rangle \in D} B \wedge h \wedge
x_{i} \entails f(x_{i}) \]
This results in increasing the size of H.
- In explanation-based learning the prior
knowledge is used to reduce the size of H. EBL
assumes that
\[
\forall_{\langle x_{i}, f(x_i) \rangle \in D} B' \wedge x_{i} \entails f(x_{i})
\]
and outputs $h$ such that
\[
\forall_{\langle x_{i}, f(x_i) \rangle \in D} h \wedge x_{i} \entails f(x_{i})
\]\[
D \wedge B' \entails h
\]
1.2 EBL Example
- Want program to recognize "chessboard positions in which black will lose its queen within two moves."
- Because there are so many possible chessboards we would
nee to provide a lot of examples.
- And yet, humans can learn this concept really
quickly. Why?
- Humans appear to rely heavily on explaining the training example in terms of their prior knowledge.
1.3 Explanations
- The explanation tells us how to rationally
generalize from the details of the correct training example to
a correct general hypothesis.
- Namely, the features of the training example that are
mentioned in the explanation are the ones that should be
included in the hypothesis.
- What is the prior knowledge needed in the chess example?
- The knowledge about the legal moves of chess, the fact
that players alternate moves, and the goal is to capture
the king.
- Note that, in principle we could use this
knowledge to calculate the optimal move for every board, but
this is not practical in most situations (cf. deep blue [2]).
1.4 Inductive vs. Analytical Learning
- In inductive learning the learner is given $H$ and $D=
\{\langle x_1, f(x_1) \rangle,\ldots,\langle x_n,
f(x_n)\rangle\}$, where $f(x_i)$ is the target value for
instance $x_i$. Learner must pick $h \in H$ that is consistent
with $D$.
- In analytical learning the learner is given
$H$, $D$, and a domain theory $B$. Learner must
pick $h \in H$ that is consistent with both $D$, and
$B$.
- We say that $h$ is consistent with $B$ if
\[
\neg (B \rightarrow \neg h) \]
1.5 Example
- X: Each instance is a pair of objects describe by: Type,
Color, Volume, Owner, Material, Density, and On.
- H: Each one is a set of Horn clauses.
- Target Concept: SafeToStack(x,y)
- D: One example looks like:
SafeToStack(o1,o2),On(o1,o2), Type(o1,Box), Type(o2,
endtable), Color(o1,red), Color(o2,blue), Volume(o1, 2),
Owner(o1, Fred), Owner(o2, Louis), Density(o1, 0.3),
Material(o1, Cardboard), Material(o2, Wood)
- B: Domain Theory
SafeToStack(x,y) ← ¬Fragile(y)
SafeToStack(x,y) ← Lighter(x,y)
Lighter(x,y) ← Weight(x,wx) ∧ Weight(y,wy) ∧ LessThan(wx,wy)
Weight(x,w) ← Volume(x,v) ∧ Density(x,d) ∧ Equal(w,times(v,d))
Weight(x,5) ← Type(x, endtable)
Fragile(x) ← Material(x,Glass)
- Find h that is consistent with training examples and domain theory.
2 Learning With Perfect Domain Theories
- A perfect domain theory is correct and complete.
- A domain theory is correct if each of its
assertions is a truthful statement about the world.
- A domain the is complete wrt target concept and
X, if it covers every positive example in the instance
space.
- So, if we have a perfect domain theory, why do we need to
learn?
- Chess. Often the theory leads to too many deductions
(large breadth) making it impossible to find the optimal
strategy. The examples help to focus search.
- Perfect domain theories are often unrealistic, but,
learning in them is a first step before learning with
imperfect theories (next chapter).
- Prolog-EBG is an EBL learner. It uses sequential covering.
2.1 Prolog-EBG
Prolog-EGB(TargetConcept, TraningExamples, DomainTheory)
- LearnedRules = {}
- Pos = the positive examples from TraningExamples.
- for each PositiveExample in Pos that is not covered by LearnedRules do
- Explanation = an explanation in terms of
DomainTheory that Pos satisfies the TargetConcept.
- SufficientConditions = the most general set of
features of PositiveExample sufficient to satisfy the
TargetConcept according to the Explanation.
- LearnedRules = LearnedRules + {TargetConcept ← SufficientConditions}.
- return LearnedRules
2.1.1 Explaining the Example
- Give a proof, using the domain theory, that the (positive)
training satisfies the target concept.
- In our ongoing example the
positive example of SafeToStack(o1,o2) can be explained by
using the domain theory, as such:
1. Volume(o1,2) ∧ Density(o1,0.3) ∧ Equal(0.6, 2*0.3) → Weight(o1,0.6)
2. Type(o2,endtable) → Weight(o2,5)
3. Weight(o1, 0.6) ∧ LessThan(0.6, 5) ∧ Weight(o2,5) → Lighter(o1,o2)
4. Lighter(o1, o2) → SafeToStack(o1,o2)
- In Prolog-EGB this explanation is generated using backward
chaining search, as done by Prolog.
- Like Prolog, it halts when it finds a proof.
2.1.2 Analyze the Explanation
- Which of the many features of the objects are relevant to
the target concept?
- Those that appear in the explanation we just built.
- We can collect these and substitute x,y for o1,o2 to get
Volume(x,2) ∧ Density(x,0.3) ∧ Type(y,endtable) → SafeToStack(x,y)
So, we now have the most general set of features, right?
- Wrong! An even more general rule can be
obtained using a more careful analysis of the
explanation.
2.1.3 Weakest Preimage
- The weakest preimage of a conclusion C with
respect to a proof P is the most general set of assertions A,
such that A entails C according to P.
- Prolog-EGB computes the most general rule that can be
justified by the explanation by computing the weakest
preimage.
- It is calculated by using regression—work
iteratively backward through the explanation, computing the
weakest preimage at each step.
SafeToStack(x, y)
4. Lighter(o1, o2) → SafeToStack(o1,o2)
Lighter(x, y)
3. Weight(o1, 0.6) ∧ LessThan(0.6, 5) ∧ Weight(o2,5) → Lighter(o1,o2)
Weight(x,wx), LessThan(wx, wy), Weight(y,wy)
2. Type(o2,endtable) → Weight(o2,5)
Weight(x,wx), LessThan(wx,5), Type(y,endtable)
1. Volume(o1,2) ∧ Density(o1,0.3) ∧ Equal(0.6, 2*0.3) → Weight(o1,0.6)
Volume(x,vx), Density(x,dx), Equal(wx, vx*dx), LessThan(wx,5), Type(y,endtable)
- The final Horn clause has a body that corresponds to the
weakest preconditions, and the head is the concept:
SafeToStack(x,y) ← Volume(x, vx) ∧ Density(x,dx) ∧ Equal(wx, vx*dx) ∧ LessThan(wx,5) ∧ Type(y,endtable)
2.1.4 Inductive Bias of Prolog-EBG
- Since all the candidate hypotheses are generated from B it
follows that the inductive bias of Prolog-EBG is simply B,
right?
- Almost. We also have to consider how it chooses from
among the alternative clauses.
- Since it uses sequential covering by growing the Horn
clauses we can say that it prefers small sets of Horn
clauses.
- So, the inductive bias is B plus a preference for small
sets of maximally general Horn clauses.
- The inductive bias is largely determined by the
input domain theory, not the algorithm.
3 Thinking about EBL
- EBL as a theory-guided (rational) generalization of
examples.
- EBL as example-guided reformulation of theories. That is,
reformulating the domain theory into a more usable form.
- EBL as simply restating what the learner already
knows. Remember that what one knows in principle is very
different from what one knows in practice (f=ma).
3.1 Knowledge Level Learning
- In Prolog-EBG the $h$ follows (logically) directly from B
alone, independent of D. So, why do we need examples?
- Examples focus Prolog-EBG on generating rules that cover
the distribution of instances that occur.
- So, will it ever learn to classify an instance that
could not be classified by B?
- No. Since $B \entails h$ then any classification
entailed by $h$ is also entailed by $B$.
- OK, so this this a problem with all analytical
learning methods?
- No. For example, let B contain a statement like
GrandDaughter(sister(x),spouse(y)) ← GrandDaughter(x,y)
This rule does nothing until we have one example, then
it might identify added GrandDaughter().
- Another example is provided by assertions known as
determinations. If we are trying to identify
"people who speak Portuguese", a determinant might be
The language spoken by a person is
determined by their nationality.
4 EBL of Search Control Knowledge
- Search is an endemic problem in AI (iow, AI is
search). But, often the spaces we search are
huge.
- Search problem:
S: set of possible search states
O: set of operators that transform one state into another
G: is predicate over S. G(s) is true if s is a goal state.
- The largest scale attempts to apply EBL have addressed the
problem of learning to control search, AKA "speedup"
learning.
- Specifically, for each operator o learn "the set of states
for which o leads toward a goal state".
4.1 PRODIGY
- The PRODIGY [3] system uses EBL to
improve search. It does AI planning.
- It uses a means-ends planner that decomposes problems into
subgoals, solves them, then combines their solutions into a
solution for the full problem.
- At each step it faces the question "which operator should
I consider for solving this subgoal?"
- EBL in Prodigy chooses appropriate target concepts, such as
"the set of states in which subgoal A should be solved before
subgoal B". A sample rule learned is
SolveBefore(On(y,z), On(x,y)) ← NeedToSolve(On(y,z)) ∧ NeedToSolve(On(z,y))
- Prodigy learns this rule by first doing it wrong,
encountering the conflict, and explaining to itself why it had
to unstack the block it had just stacked.
4.2 SOAR
- The Soar [4] system implements even more problem-solving
strategies.
"Soar is a general cognitive architecture for
developing systems that exhibit intelligent
behavior. Researchers all over the world, both from the
fields of artificial intelligence and cognitive science, are
using Soar for a variety of tasks. It has been in use since
1983, evolving through many different versions to where it
is now Soar, Version 8.2."
- Soar is (better-than) a rule-based system.
- It also learns by explaining situations in which its
current search strategy leads to inefficiencies.
- Soar uses a variant of EBL called chunking to
extract the general conditions under which an explanation
applies.
4.3 The Allure of Numbers
- Prodigy and Soar have used EBL methods to great success,
and yet most search programs still use numerical evaluation
functions to guide search (like we did). Why?
- Sometimes the number of control rules that must be
learned is large. The more, the slower.
- Often it is intractable to construct an explanation for
the desired target concept. e.g., "states for which operator
A leads toward the optimal solution" in chess!
- I think it's also because EBL seems a lot harder to
implement. (It's not since Prolog, Soar, etc. already do it
for you).
URLs
- Machine Learning book at Amazon, http://www.amazon.com/exec/obidos/ASIN/0070428077/multiagentcom/
- Deep
Blue homepage, http://www.research.ibm.com/deepblue/
- Prodigy Homepage, http://www.cs.cmu.edu/~prodigy/
- Soar
Homepage, http://ai.eecs.umich.edu/soar/
This talk available at http://jmvidal.cse.sc.edu/talks/analyticallearning/
Copyright © 2009 José M. Vidal
.
All rights reserved.
10 April 2003, 12:16PM