Learning from Data

Split data into training set, development/validation set, and test set.

The basic workflow is

Train your model on the training set.
Tune model hyperparameters on dev set.
Evaluate on test set (only once!)

Aside: Parameters are internal to the model; they're what make your model work. Hyperparameters are external to the model; they specify how your model learns. For example, if your model is your brain, parameters are the cells, neurons, synapses, etc., and hyperparameters are how long to study, how much to study, where to study, etc.

An Analogy

You are studying for a math exam. The teacher gives you a list of practice problems like

2 + 5 = 7
3 - 4 = -1
1 + 2 = 3

and tells you that you will be tested on similar problems at the of the semester.

The practice problems are the training data.

The exam is the test data.

You can approach this in a few different ways:

You glance at the problems, but you just can't be bothered to learn simple addition and subtraction. You decide that you will write '5' as the answer to every question you are given. You're bound to get at least some correct, right?

This is called underfitting, where you don't actually learn from the data.

The memorizer

You memorize the answer to every practice problem. So you know that 2 + 5 = 7. However, when you go to take the test, you see 2 + 6. That was not one of the practice problems, so you don't know how to solve it.

This is called overfitting, where you do well on things you have seen but do poorly on things you haven't seen. Your model does not generalize.

The studious student

You set aside some of the problems as a pretest. You work through your practice problems, then test yourself with the pretest. You review the problems you missed and figure out how to get them right the next time.

You have separated your data into a train and dev set. Using this approach, your model should generalize better to unseen (test) data.

Cross Validation

Split your data into multiple train/dev sets. For 5-fold cross validation:

TTTTD
TTTDT
TTDTT
TDTTT
DTTTT

clf = DecisionTreeClassifier()
X = [extract_features(x) for x in trainX]
print(cross_val_score(clf, trainX, trainY, cv=5))

Common values are 10 and $n$. Using $n$ is called leave-one-out cross validation.

Learning from Data

An Analogy

The consistent student

The memorizer

The studious student

Cross Validation