It does a gradient descent over the entire network weight
vector.
It can be generalized to handle arbitraty directed
graphs.
It will find a local min, which might not be the global
min. In practice, this has not been a large problem (local min
are good enough and large dimension means more escape
routes).
A popular technique is to add momemtum
As always, it minimizes the error over the training
examples. Will it generalize?
It can be very slow to train.
But, once trained, it can make new categorizations very
fast.
Inductive Bias: is very hard to quantify, but can
be characterized as smooth interpolation between data
points.