Learning from Data
Split data into training set, development/validation set, and test set.
The basic workflow is
- Train your model on the training set.
- Tune model hyperparameters on dev set.
- Evaluate on test set (only once!)
Aside: Parameters are internal to the model; they're what make your model work. Hyperparameters are external to the model; they specify how your model learns. For example, if your model is your brain, parameters are the cells, neurons, synapses, etc., and hyperparameters are how long to study, how much to study, where to study, etc.
An Analogy
You are studying for a math exam. The teacher gives you a list of practice problems like
2 + 5 = 7
3 - 4 = -1
1 + 2 = 3
and tells you that you will be tested on similar problems at the of the semester.
The practice problems are the training data.
The exam is the test data.
You can approach this in a few different ways:
The consistent student
You glance at the problems, but you just can't be bothered to learn simple addition and subtraction. You decide that you will write '5' as the answer to every question you are given. You're bound to get at least some correct, right?
This is called underfitting, where you don't actually learn from the data.
The memorizer
You memorize the answer to every practice problem. So you know that 2 + 5 = 7. However, when you go to take the test, you see 2 + 6. That was not one of the practice problems, so you don't know how to solve it.
This is called overfitting, where you do well on things you have seen but do poorly on things you haven't seen. Your model does not generalize.
The studious student
You set aside some of the problems as a pretest. You work through your practice problems, then test yourself with the pretest. You review the problems you missed and figure out how to get them right the next time.
You have separated your data into a train and dev set. Using this approach, your model should generalize better to unseen (test) data.
Cross Validation
Split your data into multiple train/dev sets. For 5-fold cross validation:
- TTTTD
- TTTDT
- TTDTT
- TDTTT
- DTTTT
clf = DecisionTreeClassifier()
X = [extract_features(x) for x in trainX]
print(cross_val_score(clf, trainX, trainY, cv=5))
Common values are 10 and $n$. Using $n$ is called leave-one-out cross validation.