Gradient Descent

The delta rule is designed to converge if the examples are not linearly separable.
It does a gradient descent on the hypothesis space.
Consider a simpler linear unit, where $o = w_{0} + w_{1} x_{1} + \cdot \cdot \cdot + w_{n} x_{n}$ Let's learn $w_{i}$ 's that minimize the squared error $E (\overset{⇀}{w}) \equiv \frac{1}{2} \sum_{d \in D} (t_{d} - o_{d})^{2}$ Where $D$ is set of training examples
Lets try to minimize the error.