Evaluating Performance

How good is your model?

Measure using a loss function. Evaluate on a test pair $(x, y)$ and see if you match.

\[loss(y, f(x)) = \begin{cases} 1 & \text{if } y \neq f(x) \\ 0 & \text{if } y = f(x) \end{cases}\]

This is called 0/1 loss, aka accuracy.

Your homework computes this already!

309 / 469  0.6588486140724946

Other measures: precision and recall

Easy to understand explanation

scikit-learn's classification_report() calculates this for you:

              precision    recall  f1-score   support

     ADDRESS       0.00      0.00      0.00         0
     GRAPHIC       0.00      0.00      0.00         7
       HEADL       0.50      1.00      0.67         1
        ITEM       0.80      0.08      0.15        50
      NNHEAD       0.81      0.92      0.86       171
       PTEXT       0.50      0.85      0.63       110
      QUOTED       0.88      0.35      0.50       107
         SIG       0.64      0.70      0.67        23
       TABLE       0.00      0.00      0.00         0

   micro avg       0.66      0.66      0.66       469
   macro avg       0.46      0.43      0.39       469
weighted avg       0.73      0.66      0.63       469

How to read this:

Precision: out of everything you labeled as ADDRESS, how many were actually addresses?
Recall: out of all the addresses, how many did you label as ADDRESS?
F score = harmonic mean of precision and recall $2 * \frac{p * r}{p + r}$
Support: how many data points had this label

Averages

micro average (averaging the total true positives, false negatives and false positives)
macro average (averaging the unweighted mean per label)
weighted average (averaging the support-weighted mean per label)