Learning To Predict Probabilities
- Consider predicting survival probability from patient data
- Training examples $\langle x_{i}, d_{i} \rangle$, where
$d_{i}$ is 1 or 0
- Want to train neural network to output a probability given
$x_i$ (not a 0 or 1)
- In this case can show
\[ h_{ML} = \argmax_{h \in H} \sum_{i=1}^{m} d_{i} \ln h(x_{i}) + (1-d_{i})
\ln (1 - h(x_{i})) \]
The negation of this quantity is known as the cross entropy.
- In order to maximize that we would need to do gradient
ascent on it, wrt the edge weight. This weight update works
out to be
\[ w_{jk} \leftarrow w_{jk} + \Delta w_{jk}\]
where
\[ \Delta w_{jk} = \eta \sum_{i=1}^{m} (d_{i} - h(x_{i})) x_{ijk} \]
- This is the same rule used by Backpropagation except that
Backpropagation multiplies by an extra term $h(x_i)(1 -
h(x_i))$, which is the derivative of the sigmoid
function.
- Backpropagation updates seek ML hypothesis under the
assumption that training data can be modeled by Normal noise
on the target function.
- Cross entropy updates seek ML hypothesis under the
assumption that observed boolean value is a probabilistic
function of input instance.
José M. Vidal
.
10 of 39