This talk is based on

- Tom M. Mitchell. Machine Learning. [1] McGraw Hill. 1997. Chapter 8.
- and his slides [2].

- So far, we have considered algorithms that search for a hypothesis that matches all the training data.
- Instance-based algorithms simply store the presented data.
- When a new query arrives they try to match it with similar instances and use them to classify.
- For example, given 10 fossil remains, each of a different species, I might try to classify a new fossil remain by finding the one that looks most similar.

- Assume all instances correspond to point in n-dimensional space.
- Store all training examples $\langle x_i, f(x_i) \rangle$.
- Given that an instance $x$ is described by its feature vector \[ \langle a_1(x),a_2(x)\ldots a_n(x)\rangle \] we define the distance as \[ d(x_i,x_j) \equiv \sqrt{\sum_{r=1}^{n}(a_r(x_i)-a_r(x_j))^2} \]
**Nearest Neighbor:**To classify a new instance $x_q$ first locate nearest training example $x_n$, then estimate $\hat{f}(x_q) \leftarrow f(x_n)$**$k$-Nearest Neighbor**Take a vote among the $k$ nearest neighbors, if discrete $f$. If continuous $f$ then take mean \[\hat{f}(x_{q}) \leftarrow \frac{\sum_{i=1}^{k}f(x_{i})}{k}\]

- When the instances are points in Euclidean space.
- When there are few attributes (why?)
- When there is a lot of training data
**Advantages**- Training is very fast
- Learn complex target functions
- Don't lose information

**Disadvantages**- Slow at query time
- Easily fooled by irrelevant attributes

- The hypothesis space implicitly considered by the 1-Nearest neighbor algorithm is given by a Voronoi diagram.
- What does the picture look like for the k-Nearest?

- Consider $p(x)$ defines probability that instance $x$ will be labeled 1 (positive) versus 0 (negative).
**Nearest neighbor:**As number of training examples goes to $\infty$, approaches Gibbs Algorithm (i.e., with probability $p(x)$ predict 1, else 0)**k-Nearest neighbor:**As number of training examples goes to $\infty$ and $k$ gets large, approaches Bayes optimal (i.e., if $p(x)>.5$ then predict 1, else 0)

- We might want to weight the nearer neighbors more heavily: \[ \hat{f}(x_{q}) \leftarrow \frac{\sum_{i=1}^{k} w_{i} f(x_{i})}{\sum_{i=1}^{k} w_{i}} \] where \[ w_{i} \equiv \frac{1}{d(x_{q}, x_{i})^{2}} \] and $d(x_{q}, x_{i})$ is distance between $x_{q}$ and $x_{i}$.
- Using this function it makes sense to use all examples instead of just $k$.
- If all training examples are used we call the algorithm a global method. If only the nearest are use then its a local method.

- The inductive bias of the kNN corresponds to the assumption that the classification of an instance will be most similar to other instances that are nearby in Euclidean distance.
- This means that
*all attributes are used*. - A problem: 20 attributes but only 2 are relevant to the target function.
- This is the curse of dimensionality
- One approach is to weigh each attribute differently.
- Stretch $j$th axis by weight $z_j$, where $z_1, \ldots, z_n$ chosen to minimize prediction error.
- Use cross-validation to automatically choose weights $z_1, \ldots, z_n$.
- Note setting $z_j$ to zero eliminates this dimension altogether (another approach).

- Before we go on, we need some definitions:
- Regression means approximating a real-valued target function.
- Residual is the error $\hat{f}(x) - f(x)$ in approximating the target function.
- Kernel function is the function of distance that is used to determine the weight of each training example. That is, $K$ where \[ w_i = K(d(x_i, x_q))\]

- The kNN approximates the target function for a single query point.
- Locally Weighted Regression is a generalization of that approach. It constructs an explicit approximation to $f$ over a local region surrounding $x_q$.
- We can try to form an explicit approximation $\hat{f}(x)$ for the region surrounding $x_q$.
- Assume that \[ \hat{f}(x) \equiv w_0 + w_1 a_1(x) + \cdots + w_n a_n(x) \]

- We can minimize the squared error over $k$ nearest neighbors \[E_{1}(x_q) \equiv \frac{1}{2} \sum_{x \in k nearest nbrs of x_q} (f(x) - \hat{f}(x))^2 \]
- We can minimize the instance-weighted squared error over all neighbors \[E_{2}(x_q) \equiv \frac{1}{2} \sum_{x \in D} (f(x) - \hat{f}(x))^2 K(d(x_{q}, x)) \]
- Or, we can combine these two \[E_{2}(x_q) \equiv \frac{1}{2} \sum_{x \in k nearest nbrs of x_q} (f(x) - \hat{f}(x))^2 K(d(x_{q}, x)) \]

- Eq. 2 two is more aesthetically pleasing but Eq. 3 is a good approximation with costs that depend on $k$.

- If we choose Eq. 3, we have to re-derive a gradient descent rule using the same arguments as for neural nets.
- As such, we can adjust the weights of $\hat{f}(x)$ with \[ \Delta w_i \equiv \eta \sum_{x\in k nearest nbrs of x_q} K(d(x_q,x))(f(x)-\hat{f}(x))a_j(x) \]
- This equation will do gradient descent on the weights of $\hat{f}(x)$ so as to minimize its error against $f(x)$.
- There are many other much more efficient methods to fit linear functions to a set of fixed training examples (see book for references).
- Locally Weighted Regression often uses linear or quadratic functions because more complex functions are too hard to fit and only provide marginal benefits, at best.

- Instead of using a linear function, we try to learn a function of the form \[ \hat{f}(x) = w_0 + \sum_{u=1}^{k} w_u K_u(d(x_u,x)) \] where each $x_u$ is an instance from $X$ where the kernel function $K_u(d(x_u,x))$ is defined so that it decreases as the distance $d(x_u,x)$ increases.
- $k$ is a user-defined constant that specifies the number of kernel functions to be included.
- Note that, even thought $\hat{f}(x)$ is a global approx to $f(x)$, the contribution from each of the $K_u$ terms is localized to a region near $x_u$.
- It is common to choose $K$ to be a Gaussian centered around $x_u$ \[ K_u(d(x_u,x)) = e^{\frac{d^2(x_{u},x)}{2\sigma_u^2}} \]
- It has been shown that this $\hat{f}(x)$ can approximate any function with arbitrarily small error, given enough $k$ and given that each $\sigma^2$ can be separately specified.

- We can view the new $\hat{f}(x)$ as a two-layer network were the first layer computes the various $K$ and the second does the linear combination of them.
- The network can be trained in two stages:

- The number $k$ of hidden units is determined and each unit $u$ is defined by choosing the values of $x_u$ and $\sigma_u^2$ that define its kernel function $K_u(d(x_u,x))$ (e.g., with EM).
- The weights $w_u$ are trained to maximize the fit of the network to the data using the global squared error criterion.

- Since the $K_u$ are fixed for the second step it can be trained with efficient linear function fit methods.

- In CBR the instances are represented by rich symbolic descriptions, not by points in space.
- Useful when we can't represent instances as points in space.
- The representations often use symbolic logic
representations. The examples consist of a series of facts
((user-complaint error53-on-shutdown) (cpu-model PowerPC) (operating-system Windows) (network-connection PCIA) (memory 48meg) (installed-applications Excel Netscape VirusScan) (disk 1gig) (likely-cause ???))

- The CADET [3] system was designed to assist in the design of mechanical devices.
- It stores 75 examples of mechanical devises.
- Each training example consists of a pair [Structure, Function].
- A query provides the functions and asks for any structure that has a similar function.
- The structure is a diagram of the device. The function is a high-level description of what it does.

- CADET first tries to find an exact match.
- If no match is found then try to match various subgraphs of the desired functional specification (that is, match parts).
- If multiple cases match different subgraphs then maybe the entire query can be pieced together.
- If still no match, then elaborate existing examples in order to create graphs that might match. For example \[A \rightarrow B\] becomes \[A \rightarrow x \rightarrow B\]

- CADET represents instances with rich symbolic descriptions so it needs a new similarity metric (e.g., size of largest shared subgraph).
- Both can use multiple examples in classification, but they combine them in different ways. CADET must have special rules for combining.
- CBR, in general, shows a tight coupling between case retrieval, knowledge-based reasoning, and problem solving.
- CBR is useful for simple matching of queries such as help-desk questions.

- CBR continues to be an area of research.
- There are many successful commercial applications. It is useful for help-desk-type situations where the same few problems keep cropping up (only a few points on H).
- More info on the CBR web [4].

- Instance-based methods are also known as lazy learning because they do not generalize until needed.
- All the other learning methods we have seen (and even radial basis function networks) are eager learning methods because they generalize before seeing the query.
- The eager learner must create a global approximation.
- The lazy learner can create many local approximations.
- If they both use the same $H$ then, in effect, the lazy can represent more complex function.
- For example, if $H$ consists of linear function then a lazy learner can construct what amounts to a non-linear global function.

- Machine Learning book at Amazon, http://www.amazon.com/exec/obidos/ASIN/0070428077/multiagentcom/
- Slides by Tom Mitchell on Machine Learning, http://www-2.cs.cmu.edu/~tom/mlbook-chapter-slides.html
- CADET Homepage, http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cadet/ftp/docs/CADET.html
- CBR Web, http://www.cbr-web.org/

This talk available at http://jmvidal.cse.sc.edu/talks/instancelearning/

Copyright © 2009 José M. Vidal
.
All rights reserved.

04 March 2003, 11:13AM