Combining Inductive and Analytical Learning

This talk is based on

Tom M. Mitchell. Machine Learning. [1] McGraw Hill. 1997. Chapter 12.
and his slides [2].

	Inductive Learning	Analytical Learning
Goal	Hypothesis fits data	Hypothesis fits domain theory
Justification	Statistical inference	Deductive Inference
Advantages	Requires little prior knowledge	Learns from scarce data
Pitfalls	Scarce data, incorrect bias	Imperfect domain theory

They seem very complementary.
We want something in between!
Example: If your had to review a medical database and learn "symptoms for which drug X is more effective than Y" you would look at relevant attributes (temperature, not insurance), then refine using data.
Some domain theory (background knowledge) does squeeze into inductive methods, e.g., when choosing an encoding.

We want a learning method such that:

Given no domain theory it should be as good as purely inductive methods.
Given a perfect domain theory it should be as good as analytical methods.
Given imperfect domain theory and imperfect data it should combine the two and do batter than both inductive and analytical.
Accommodate an unknown level of error in training data.
Accommodate an unknown level of error in domain theory.

The learning problem:

D is the set of training examples, maybe with errors.
B is the domain theory, maybe with errors.
H is the space of candidate hypotheses.

Determine

A hypothesis that best fits the training examples and domain theory.

So, what does "best fits" mean?

User the old $error_D(d)$ definition—the proportion of examples from $D$ misclassified by $h$.
But, what about B?
Define $error_B(h)$ to be the probability that $h$ will disagree with B on the classification of a randomly drawn instance.Then find \[ h = \argmin_{h \in H} k_D error_D(h) + k_B error_B(h) \]
But, what do we set $k_D$ and $k_B$ to?
Which one is more reliable, the data or the theory?

As always, we view the problem as that of search over H where

H is the hypothesis space.
O is the set of operators (search steps).
G is the search objective.

We see that we have several possible approaches:

User prior knowledge to derive an initial hypothesis from which to begin the search.
Use prior knowledge to alter the objective of the hypothesis space search.
User prior knowledge to alter the available search steps.

The Knowledge-Based Artificial Neural Network (KBANN [3]) algorithm uses prior knowledge to derive hypothesis from which to begin search.
It first constructs a ANN that classifies every instance as the domain theory would.
So, if B is correct then we are done!
Otherwise, we use Backpropagation to train the network.

KBANN(domainTheory, trainingExamples)
domainTheory: set of propositional non-recursive Horn clauses

for each instance attribute create a network input.
for each Horn clause in domainTheory, create a network unit
1. Connect inputs to attributes tested by antecedents.
2. Each non-negated antecedent gets a weight W.
3. Each negated antecedent gets a weight -W
4. Threshold weight is -(n - .5), where n is the number of non-negated antecedents.
Make all other connections between layers, giving these very low weights.
Apply Backpropagation using trainingExamples

Domain theory
Cup ← Stable, Liftable, OpenVessel
Stable ← BottomIsFlat
Liftable ← Graspable, Light
Graspable ← HasHandle
OpenVessel ← HasConcavity, ConcavityPointsUp

Training Examples

	Cup				Non-Cups
BottomIsFlat	X	X	X	X	X	X	X			X
ConcavityPointsUp	X	X	X	X	X		X	X
Expensive	X		X				X		X
Fragile	X	X			X	X		X		X
HandleOnTop					X		X
HandleOnSide	X			X					X
HasConcavity	X	X	X	X	X		X	X	X	X
HasHandle	X			X	X		X		X
Light	X	X	X	X	X	X	X		X
MadeOfCeramic	X				X		X	X
MadeOfPaper				X					X
MadeOfStyrofoam		X	X			X				X

In classifying promoter regions in DNA: Backpropagation got 8/106 error rate, KBANN got 4/106.
It does typically generalize more accurately than backpropagation.

We can view it as search in H.
KBANN starts at a better stop.
As such, it likely to converge to a hypothesis that generalizes beyond the data in a way that is similar to domain theory predictions.
On the negative side, KBANN can only deal with propositional domain theories.

The TangentProp [4] algorithm incorporates the prior knowledge into the error criterion minimized by gradient descent.
Specifically, the prior knowledge is in the form of known derivatives of the target function.

$X$ are images of single handwritten characters.
Task is to correctly classify these characters.
B is "the target function is invariant to small rotations of the character in the image". How do we express this mathematically?
Define a transformation $s(\alpha, x)$ which rotates $x$ by $\alpha$ degrees. Then say \[ \frac{\partial f(s(\alpha,x_i))}{\partial \alpha} = 0 \] Then incorporate these into the error function. How?
Add an additional term to the error function \[ E = \sum_i \left[ (f(x_{i}) - \hat{f}(x_{i}))^{2} + \mu_{i} \sum_{j} \left(\frac{\partial f(s_j(\alpha, x_i))}{\partial \alpha} - \frac{\partial \hat{f}(s_j(\alpha, x_i))}{\partial{\alpha}} \right)_{(\alpha=0)}^{2} \right] \] then modify the gradient descent rule to use this. How? Not shown.

The value of $\mu$ must be chosen carefully by the designer.
TangentProp is not robust to errors in the prior knowledge (it throws off Backpropagation).
TangentProp's is searching for different (maybe) hypothesis from simple Backpropagation. It searches on a different path.

The Explanation-Based Neural Network (EBNN [5]) algorithm extends TangentProp.
It computes the derivatives itself.
The value of $\mu$ is chosen independently for each example.
It represents the domain theory with a collection of neural networks.
Then, learns the target function as another network.

There is one network for each of the Horn clauses in the domain theory.
EBNN uses the top network to calculate the partial derivative of the prediction with respect to each feature of the instance. (i.e., how much does the output change as I tweak BottomIsFlat?).
These derivatives are given to the bottom network which is trained with a variation of TangentProp.

EBNN has been shown to generalize more accurately than backpropagation, especially when training data is scarce.
It has been used to learn to control a simulated mobile robot.
EBNN, like Prolog-EBG, constructs explanations, but they are based on a domain theory consisting of neural networks rather than Horn clauses.
EBNN accommodates imperfect domain theories.
EBNN learns a fixed size network, so it might be unable to represent complex functions.

The FOCL [6] system is an extension of the purely inductive FOIL.
FOIL generates each candidate specialization by adding a new literal to the preconditions.
FOCL generates additional specializations based on the domain theory. How?
We say that a literal is operational if it is allowed to be used in describing an output hypothesis.
At each point in the search FOCL adds candidates by
1. Considering each operational literal for the precondition.
2. Creating an operational and logically sufficient condition for the target concept according to the domain theory. Prune any unnecessary preconditions.

Note that in order to get the rule we had to "unfold" the domain theory to get to operational literals.

FOCL theory-suggested specializations correspond to large steps in FOIL.
The bias is a preference for Horn clauses most similar to operations, sufficient conditions entailed by the domain theory.
Test results show that FOCL achieves a lower error than FOIL.

Combining Inductive and Analytical Learning

1 Motivation

1.1 What We Want

2 The Learning Problem

2.1 Best Fits?

2.2 Hypothesis Space Search

3 KBANN

3.1 KBANN Algorithm

3.2 KBANN Example

3.3 KBANN Example Network

3.4 After Training

3.5 KBANN Results

3.6 Hypothesis Space Search

4 TangentProp

4.1 TangentProp Example

4.2 TangentProp Search

5 EBNN

5.1 EBNN Example

5.2 EBNN Summary

6 FOCL

6.1 FOCL Example

6.2 FOCL Search

URLs