Learning Sets of Rules

We can learn sets of rules by using ID3 and then converting the tree to rules.
We can also use a genetic algorithm that encodes the rules as bit strings.
But, these only work with predicate rules (no variables).
They also consider the set of rules as a whole, not one rule at a time.

2 Rules

First-order predicate logic (calculus) [3] formalizes statements using predicates (boolean functions) and functions. Both can have variables.
A rule set can look like
IF Parent(x,y) THEN Ancestor(x,y)
IF Parent(x,z) ∧ Ancestor(z,y) THEN Ancestor(x,y)
Here, Parent(x,y) is a predicate that indicates that y is the parent of x.
These two rules form a recursive function which would be very hard to represent using a decision tree or propositional representation.
In Prolog [4], programs are set of first-order rules with the form as above (known as Horn clauses).
We can view the learning of rules as the learning of Prolog programs.

3 Sequential Covering

The idea in a sequential covering algorithm is to learn one rule, remove the data it covers, then repeat.



;Examples is the set of examples;

; e.g.

; 1- Wind(d1)=Strong, Humdity(d1)=Low, Outlook(d1)=Sunny, PlayTennis(d1)=No

; 2- Wind(d2)=Weak, Humdity(d2)=Med, Outlook(d2)=Sunny, PlayTennis(d2)=Yes

; 3- Wind(d3)=Med, Humdity(d3)=Med, Outlook(d3)=Rain, PlayTennis(d3)=No

;Target_attribute is the one wish to learn.

; e.g. PlayTennis(x)

;Attributes is the set of all possible attributes.

; e.g. Wind(x), Humidity(x), Outlook(x)

; Threshold is the desired performance.

;


Sequential_covering (Target_attribute, Attributes, Examples, Threshold) :

  Learned_rules = {}

  Rule = Learn-One-Rule(Target_attribute, Attributes, Examples)



  while Performance(Rule, Examples) > Threshold :

    Learned_rules = Learned_rules + Rule

    Examples = Examples - {examples correctly classified by Rule}

    Rule = Learn-One-Rule(Target_attribute, Attributes, Examples)



  Learned_rules = sort Learned_rules according to performance over Examples

  return Learned_rules

We require Learn-One-Rule to have high (perfect?) accuracy but not necessarily high coverage (i.e., when it makes a prediction it should be true).
Since it performs a greedy search it is not guaranteed to find the best or smallest set of rules that cover the training examples.

3.1 Learn-One-Rule

Idea: organize the hypothesis space search in general to specific fashion.
Start with most general rule precondition, then greedily add the attribute that most improves performance measured over the training examples.

3.1.1 Learn-One-Rule Algorithm


Learn-One-Rule(target_attribute, attributes, examples, k)

;Returns a single rule that covers some of the Examples

  best-hypothesis = the most general hypothesis

  candidate-hypotheses = {best-hypothesis}

  while candidate-hypothesis
    ;Generate the next more specific candidate-hypotheses

    all-constraints = all "att.=val." contraints

    new-candidate-hypotheses = all specializations of candidate-hypotheses by adding all-constraints

    remove from new-candidate-hypotheses any that are duplicates, inconsistent, or not maximally specific

    ;Update best-hypothesis

    best-hypothesis = $\argmax_{h \in \text{new-candidate-hypotheses}}$ Performance(h,examples,target_attribute)

    ;Update candidate-hypotheses

    candidate-hypotheses = the k best from new-candidate-hypotheses according to Performance.

  prediction = most frequent value of target_attribute from examples that match best-hypothesis

  return IF best-hypothesis THEN prediction





Performance(h, examples, target_attribute)

  h-examples = the set of examples that match h

  return - Entropy(h-examples) wrt target_attribute

3.1.2 Learn-One-Rule Example

Day	Outlook	Temp	Humid	Wind	PlayTennis
D1	Sunny	Hot	High	Weak	No
D2	Sunny	Hot	High	Strong	No
D3	Overcast	Hot	High	Weak	Yes
D4	Rain	Mild	High	Weak	Yes
D5	Rain	Cool	Low	Weak	Yes
D6	Rain	Cool	Low	Strong	No
D7	Overcast	Cool	Low	Strong	Yes
D8	Sunny	Mild	High	Weak	No
D9	Sunny	Cool	Low	Weak	Yes
D10	Rain	Mild	Low	Weak	Yes
D11	Sunny	Mild	Low	Strong	Yes
D12	Overcast	Mild	High	Strong	Yes
D13	Overcast	Hot	Low	Weak	Yes
D14	Rain	Mild	High	Strong	No

best-hypothesis = IF T THEN PlayTennis(x) = Yes
candidate-hypotheses = {best-hypothesis}
all-constraints = {Outlook(x)=Sunny, Outlook(x)=Overcast, Temp(x)=Hot, ......}
new-candidate-hypotheses = {IF Outlook=Sunny THEN PlayTennis=YES, IF Outlook=Overcast THEN PlayTennis=YES, ...}
best-hypothesis = IF Outlook=Sunny THEN PlayTennis=YES
candidate-hypotheses = {IF Outlook=Sunny THEN PlayTennis=YES, IF Outlook=Sunny THEN PlayTennis=YES...}
.....

3.1.3 Learn One Summary

Can be generalized to multi-valued target functions.
There are other ways to define Performance(), besides using entropy.

3.2 CN2

The CN2 [5] system by Clark and Niblett uses a subroutine similar to Learn-One-Rule. CN2 is described by the authors as
This algorithm inductively learns a set of propositional if...then... rules from a set of training examples. To do this, it performs a general-to-specific beam search through rule-space for the "best" rule, removes training examples covered by that rule, then repeats until no more "good" rules can be found.
As such, CN2 is a sequential covering algorithm.

3.3 Other Variations

Might want to learn only rules that cover positive examples and include a default negative classification—good when there are very few positive.
Learn-One-Rule can be modified to accept argument that specifies target value of interest.
AQ algorithm learns a disjunctive set of rules that together cover the target function, but
- seeks rules that cover a particular target value
- and does a beam search but using a single positive example to focus this search.

3.4 Other Performance Measures

How to gage performance?

Relative frequency: Let $n$ be the number of examples the rule matches and $n_c$ be the number that it classifies correctly. \[\frac{n_c}{n}\]
m-estimate of accuracy: Let $p$ be the prior probability that a random example will be correctly classified correctly, let $m$ be the weight. \[\frac{n_c + mp}{n+m}\]
Entropy: Let $S$ be the set of examples that match the rule precondition, $c$ be the number of distinct values the target function make take on, and $p_i$ the proportion of examples for which the target function takes the $i$th value \[-\text{Entropy}(S) = \sum_{i=1}^c p_i\log_2 p_i\]

4 Learning Rule Sets Summary

A sequential covering algorithm learns one rule at a time, removing the covered examples and repeating with the rest.
Meanwhile, simultaneous covering algorithms like ID3 learn the entire set of disjuncts simultaneously.
Which is better?
ID3 chooses attributes by comparing the partitions of the data they generate.
CN2 chooses among alternative attribute-value pairs by comparing the subsets of data they cover.
Thus, CN2 makes a larger number of independent choices. So it is better if there is plenty of data.

5 Learning Rule Sets Summary

Learn-One-Rule searches from general to specific.
Find-S searches from specific to general.
There are many maximally specific, but only one maximally general.
Learn-One-Rule is generate-then-test search.
It could be example-driven, where individual training examples constrain the generation of hypotheses.
In generate-then-test each choice in the search is based on the hypothesis performance over many examples, so impact of noisy data is minimize. Noise can have large impact in example-driven.

6 Learning First-Order Rules

Try to learn sets of rules such as
∀ x, y : Ancestor(x,y) ← Parent(x,y)
∀ x, y : Ancestor(x,y) ← Parent(x,z) ∧ Ancestor(z,y)
This looks a lot like a Prolog program.

Ancestor(x,y) :- Parent(x,y)
Ancestor(x,y) :- Parent(x,z), Ancestor(z,y)
Which is why inductive learning of first-order rules is often referred to as inductive logic programming.
Classifying web page A:
course(A) ← has-word(A, instructor) ∧ ¬has-word(A, good) ∧ link-from(A, B) ∧ has-word(B, problem) ∧ ¬link-from(B, C)

6.1 First-Order Logic Definitions

Every expression is composed of constants, variables, predicates, and functions.
A term is any constant, or variable, or any function applied to any term.
A literal is any predicate (or its negation) applied to any set of terms.
Female(Mary), ¬Female(x)
A ground literal is a literal that does not contain any variables.
A clause is any disjunction of literals whose variables are universally quantified.
∀ x : Female(x) ∨ Male(x)
A Horn clause is an expression of the form
H ← L1 ∧ L2 ∧ ... ∧ Ln
For any A and B
A ← B ⇒ A ∨ ¬ B
so the horn clause above can be re-written as
H ∧ ¬L1 ∨ ¬L2 ∨ ... ∨ ¬Ln
A substitution is any function that replaces variables by terms e.g., {x/3, y/z} replaces x for 3 and y for z.
A unifying substitution θ for two literals L1 and L2 is one where
L1θ = L2θ

6.2 Learning First-Order Horn Clauses

Say we are trying to learn the concept Daughter(x,y) from examples.
Each person is described by the attributes: Name, Mother, Father, Male, Female.
Each example is a pair of instances, say a and b:
Name(a) = Sharon, Mother(a) = Louise, Father(a) = Bob, Male(a) = False, Female(a) = True
Name(b) = Bob, Mother(b) = Nora, Father(b) = Victor, Male(b) = True, Female(b) = False, Daughter(a,b) = True
If we give a bunch of these examples to CN2 or C4.5 they will output a set of rules like:
IF Father(a) = Bob ∧ Name(b) = Bob ∧ Female(a) THEN Daughter(a,b)
A first-order learner would output more general rules like
IF Father(x) = y ∧ Female(x) THEN Daughter(x,y)

6.3 FOIL

FOIL(target-predicate, predicates, examples)
  pos = those examples for which target-predicate is true
  teg = those examples for which target-predicate is false
  learnedRules = {}
  while pos do
    ;learn a new rule
    newRule = the rule that predicts target-predicate with no preconditions
    newRuleNeg = neg
    while newRuleNeg do
      ;add a new literal to specialize newRule
      candidateLiterals = candidate new literals for newRule based on predicates
      bestLiteral = $\argmax_{l \in \text{candidateLiterals}}$ Foil-gain(l, newRule)
      add bestLiteral to preconditions of newRule
      newRuleNeg = subset of newRuleNeg that satisfies newRuleNeg preconditions
    learnedRules = learnedRules + newRule
    pos = pos - {member of pos covered by newRule}
  return learnedRules

Its a natural extension of Sequential-Covering and Learn-One-Rule.
Each iteration of the outer loop adds a new rule to its disjunctive hypothesis Learned-rules (specific-to-general).
In the inner loop we add conjunctions that form the preconditions of the rule (general-to-specific hill-climbing).

6.4 Generating Candidate Specializations

How do we generate Candidate-literals?
Suppose the current rule (NewRule) is P(x1,x2,...xk) ←L1∧...∧Ln.
FOIL considers new literals Ln+1 of the following form
- Q(v1,...,vr) where Q is any predicate in Predicates and vi are variables, at least one of them must already exist in the rule.
- Equal(x,y) where x and y are variables already in the rule.
- The negation of any of these two.

6.5 FOIL Example

Say we are tying to predict the Target-predicate GrandDaughter(x,y).
FOIL begins with
NewRule = GrandDaughter(x,y) ←
To specialize it, generate these candidate additions to the preconditions:
Equal(x,y), Female(x), Female(y), Father(x,y), Father(y.x), Father(x,z), Father(z,x), Father(y,z), Father(z,y)
and their negations.
FOIL might greedily select Father(x,y) as most promising, then
NewRule = GrandDaughter(x,y) ← Father(y,z).
Foil now considers all the literals from the previous step as well as:
Female(z), Equal(z,x), Equal(z,y), Father(z,w), Father(w,z)
and their negations.
Foil might select Father(z,x), and on the next step Female(y) leading to
NewRule = GrandDaughter (x,y) ← Father(y,z) ∧ Father(z,x) ∧ Female(y)
If this covers only positive examples it terminates the search for further specialization.
FOIL now removes all positive examples covered by this new rule. If more are left then the outer loop continues.

6.6 Guiding Search in FOIL

Again, we are trying to learn set of rules for Target-predicate = GrandDaughter(x,y).
Let the Examples contain
GrandDaughter(Victor,Sharon), Father(Sharon, Bob), Father(Tom, Bob), Female(Sharon), father(Bob, Victor)
To select best specialization, FOIL considers ways to bind the variables. With 4 constants (Victor, Sharon, Tom, and Bob) and 2 variables (x,y) we have 4*4 = 16 possible bindings.
{x/Victor, y/Sharon} is the only positive example binding for the rule "GrandDaughter(x,y) ←". The other 15 are negative bindings.
At each step, each rule is evaluated based on the sets of positive and negative variable bindings. Which one do we pick?

6.7 Foil-Gain

We select the literal with biggest gain \[ \text{Foil-Gain}(L,R) \equiv t \left( \log_{2}\frac{p_{1}}{p_{1}+n_{1}} - \log_{2}\frac{p_{0}}{p_{0}+n_{0}} \right) \] Where
- $L$ is the candidate literal to add to rule $R$
- $p_0$ = number of positive bindings of $R$
- $n_0$ = number of negative bindings of $R$
- $p_1$ = number of positive bindings of $R+L$
- $n_1$ = number of negative bindings of $R+L$
- $t$ is the number of positive bindings of $R$ also covered by $R+L$
It's interesting to note that $-\log_{2}\frac{p_{0}}{p_{0}+n_{0}}$ is the optimal number of bits needed to indicate the class of a positive binding covered by $R$

6.8 Learning Recursive Rule Sets

If we include the target predicate in Predicates then FOIL will consider it.
This allows for the formation of recursive rules
Ancestor(x,y) ← Parent(x,y)
Ancestor(x,y) ← Parent(x,z) ∧ Ancestor(z,y)
So, it is possible.

6.9 FOIL Summary

FOIL extends CN2 to handle first-order formulas.
It does a general-to-specific search search, adding a single new literal to the preconditions at each step.
The Foil-Gain function is used to select the best literal.
It can learn recursive rules.
If data is noisy, the search will continue until some trade-off occurs between rule accuracy, coverage, and complexity.
FOIL stops when the length of the length of the rule is larger than the data.
FOIL also post-prunes each rule it learns, using the same strategy as ID3 (both by Quinlan).

7 Induction As Inverted Deduction

Induction is finding $h$ such that \[\forall_{\langle x_{i}, f(x_i) \rangle \in D} B \wedge h \wedge x_{i} \entails f(x_{i}) \] where $x_i$ is $i$th training instance
$f(x_i)$ is the target function value for $x_i$
$B$ is other background knowledge
Design an inductive algorithm by inverting the operators for automated deduction.

7.1 Inverted Example

Target concept is Child(u,v) s.t. the child of u is v.
We are given the single positive example ($x_i$)
Child(Bob,Sharon)
and the data is described by
Male(Bob), Female(Sharon), Father(Sharon,Bob).
We have the general background knowledge ($B$)
Parent(u,v) ← Father(u,v).
So we have:
$x_i$: Male(Bob), Female(Sharon), Father(Sharon,Bob)
$f(x_i)$: Child(Bob,Sharon)
$B$: Parent(u,v) ← Father(u,v)
What satisfies $\forall_{\langle x_{i}, f(x_i) \rangle \in D} B \wedge h \wedge x_{i} \entails f(x_{i})$?
$h_{1}$: Child(u,v) ← Father(v,u)
$h_{2}$: Child(u,v) ← Parent(v,u)
Notice that $h_1$ does not require $B$.
This process of augmenting the set of predicates, based on background knowledge, is often referred to as constructive induction.

7.2 Induction and Deduction

The relationship between these two has been known for a while.

Induction is, in fact, the inverse operation of deduction, and cannot be conceived to exist without the corresponding operation, so that the question of relative importance cannot arise. Who thinks of asking whether addition or subtraction is the more important process in arithmetic? But at the same time much difference in difficulty may exist between a direct and inverse operation; ... it must be allowed that inductive investigations are of a far higher degree of difficulty and complexity than any questions of deduction...
(Jevons 1874)

There are many well-known algorithms for deduction in first-order logic. Can we reverse them?

7.3 Inverse Entailment

An inverse entailment operator O(B,D) takes the training data D and background knowledge B as input and outputs a hypothesis $h$ such that \[ O(B,D) = h \text{ where } \forall_{\langle x_{i},f(x_{i})\rangle \in D } (B \wedge h \wedge x_{i}) \entails f(x_{i}) \]
Of course, there will usually be many $h$ that satisfy this.
A common heuristic is to use the Minimum Description Length principle.

7.4 Inverse Entailment Pros and Cons

Pros:

Subsumes earlier idea of finding $h$ that “fits” training data
Domain theory $B$ helps define meaning of “fit” the data \[ B \wedge h \wedge x_{i} \entails f(x_{i}) \]
Suggests algorithms that search $H$ guided by $B$

Cons:

Doesn't allow for noisy data. Consider \[ \forall_{\langle x_{i},f(x_{i}) \rangle \in D} (B \wedge h \wedge x_{i}) \entails f(x_{i}) \]
First order logic gives a huge hypothesis space $H$. This leads to over-fitting and the intractability of calculating all acceptable $h$'s.
The complexity of space search increases as the background knowledge B is increased.

7.5 Resolution Rule

The resolution rule is a sound [6] and complete [7] rule for deductive inference in first-order logic.
The rule is:
P ∨ L
¬L ∨ R
--------
P ∨ R
More generally
1. Given initial clauses C1 and C2, find a literal L from C1 such that ¬L is in C2.
2. Form the resolvent C by including all literals from C1 and C2 except for L and ¬L:
  C = (C1 - {L}) ∪ (C2 - {¬L})

7.6 Inverting Resolution

The inverse entailment operator must derive C2 given the resolvent C and C1.
Say C = A ∨ B, and C1= B ∨ D. How do we derive C2 s.t. C1 ∧ C2 → C?
Find the L that appears in C1 and not in C, then form C2 by including the following literals
C2 = (C - (C1 - {L})) ∪ {¬ L}
so
C2 = A ∨ ¬D
C2 can also be A ∨ ¬D ∨ B.
In general, inverse resolution can produce multiple clauses C2.

7.7 Learning With Inverted Resolution

Use inverse entailment to construct hypotheses that, together with the background information, entail the training data.
Use sequential covering algorithm to iteratively learn a set of Horn clauses in this way.

Select a training example that is not yet covered by learned clauses.
Use inverse resolution rule to generate candidate hypothesis h that satisfies B ∧ h ∧ x → f(x), where B = background knowledge plus any learned clauses.

This is example-driven search.
If multiple candidate hypotheses then choose one with highest accuracy over the other examples.

7.8 First-Order Resolution

In general
1. Find a literal L1 from clause C1, L2 from clause C2, and substitution θ such that L1 θ = ¬ L2 θ.
2. Form the resolvent C by including all literals from C1θ and C2θ, except for L1θ and ¬L2θ. More precisely, the set of literals occurring in the conclusion C is
  C = (C1 - {L1})θ ∪(C2 - {L2})θ
That is, θ is a unifying substitution for L1 and ¬ L2.
For example, let C1 = White(x) ← Swan(x) and C2 = Swan(Fred).
C1 can be re-written as White(x) ∨ ¬ Swan(x)
Then L1 = ¬ Swan(x), L2 = Swan(Fred) if θ ={x/Fred}
So C = White(Fred).

7.9 Inverting First-Order Resolution

θ can be uniquely factored into θ1 and θ2 where θ1 contains all the substitutions involving variables from C1, and θ2 for C2.
So now,
C = (C1 - {L1})θ1 ∪ (C2 - {L2})θ2
which can be re-written as
C - (C1 - {L1})θ1 = (C2 - {L2})θ2
then by definition we have that L2 = ¬L1 θ1 θ2^-1 so
C2 = (C - (C1 - {L1}θ1)θ2^-1 ∪ {¬L1 θ1 θ2^-1}
In applying this we will find multiple choices for L1, θ1, and θ2.

7.10 Inverted First-Order Example

Target concept = GrandChild(x,y)
D = GrandChild(Bob, Shanon)
B = {Father(Shanon,Tom), Father(Tom,Bob)}.
C = GrandChild(Bob,Shanon)
C1 = Father(Shanon, Tom)
L1 = Father(Shanon, Tom)
θ1^-1 = {}
θ2^-1 = {Shanon/x}
Then, the resulting C2 is
GrandChild(Bob,x) ∨ ¬Father(x,Tom)
This inferred clause may now be used as the conclusion C for a second inverse resolution.

7.11 Inverse Resolution Summary

Generate hypothesis $h$ that satisfy the constraint \[ B \wedge h \wedge x_i \entails f(x_i) \]
Many might be generated, but they all satisfy the equation, unlike FOIL.
Still, search is often unfocused and inefficient because it only considers a small fraction of the data when generating the hypothesis.

8 Generalization, θ-Subsumption, and Entailment

more-general-than: Given two boolean functions $h_i(x)$ and $h_j(x)$ we say that $h_i$ is more general than $h_j$ if $\forall x: h_j(x) \entails h_i(x)$ (used by Candidate-Elimination).
θ-subsumption: Clause C1 θ-subsumes C2 iff there exists a θ such that C1θ ⊆ C2 (i.e., the set of literals is a subset).
Entailment C1 entails C2 iff C2 follows deductively from C1.
more-general-than is a special case of θ-subsumption which is a special case of entailment.

9 PROGOL

The idea in the PROGOL system is to reduce the combinatorial explosion by generating the most specific acceptable $h$.
User specifies $H$ by stating predicates, functions, and forms of arguments allowed for each.
PROGOL uses sequential covering algorithm. For each $\langle x_i, f(x_i) \rangle$ it finds the most specific hypothesis $h_i$ s.t. $B \wedge h_i \wedge x_{i} \entails f(x_{i})$. (actually, it considers only $k$-step entailment).
It then conducts a general-to-specific search bounded by specific hypothesis $h_i$, choosing hypotheses with minimum description length.