Evaluating Performance
How good is your model?
Measure using a loss function. Evaluate on a test pair $(x, y)$ and see if you match.
\[loss(y, f(x)) = \begin{cases}
1 & \text{if } y \neq f(x) \\
0 & \text{if } y = f(x)
\end{cases}\]
This is called 0/1 loss, aka accuracy.
Your homework computes this already!
309 / 469 0.6588486140724946
Other measures: precision and recall
Easy to understand explanation
scikit-learn's classification_report()
calculates this for you:
precision recall f1-score support
ADDRESS 0.00 0.00 0.00 0
GRAPHIC 0.00 0.00 0.00 7
HEADL 0.50 1.00 0.67 1
ITEM 0.80 0.08 0.15 50
NNHEAD 0.81 0.92 0.86 171
PTEXT 0.50 0.85 0.63 110
QUOTED 0.88 0.35 0.50 107
SIG 0.64 0.70 0.67 23
TABLE 0.00 0.00 0.00 0
micro avg 0.66 0.66 0.66 469
macro avg 0.46 0.43 0.39 469
weighted avg 0.73 0.66 0.63 469
How to read this:
- Precision: out of everything you labeled as ADDRESS, how many were actually addresses?
- Recall: out of all the addresses, how many did you label as ADDRESS?
- F score = harmonic mean of precision and recall $2 * \frac{p * r}{p + r}$
- Support: how many data points had this label
Averages
- micro average (averaging the total true positives, false negatives and false positives)
- macro average (averaging the unweighted mean per label)
- weighted average (averaging the support-weighted mean per label)