Learning To Predict Probabilities

Consider predicting survival probability from patient data
Training examples $\langle x_{i}, d_{i} \rangle$, where $d_{i}$ is 1 or 0
Want to train neural network to output a probability given $x_i$ (not a 0 or 1)
In this case can show \[ h_{ML} = \argmax_{h \in H} \sum_{i=1}^{m} d_{i} \ln h(x_{i}) + (1-d_{i}) \ln (1 - h(x_{i})) \] The negation of this quantity is known as the cross entropy.
In order to maximize that we would need to do gradient ascent on it, wrt the edge weight. This weight update works out to be \[ w_{jk} \leftarrow w_{jk} + \Delta w_{jk}\] where \[ \Delta w_{jk} = \eta \sum_{i=1}^{m} (d_{i} - h(x_{i})) x_{ijk} \]
This is the same rule used by Backpropagation except that Backpropagation multiplies by an extra term $h(x_i)(1 - h(x_i))$, which is the derivative of the sigmoid function.
Backpropagation updates seek ML hypothesis under the assumption that training data can be modeled by Normal noise on the target function.
Cross entropy updates seek ML hypothesis under the assumption that observed boolean value is a probabilistic function of input instance.