What to Save from an Experiment
Everything.
(I jest, but really)
This is especially important if your model takes forever to train. The must haves are:
Bare Necessities
- Train, validation, and test scores for each run
- Final model weights (or the entire model)
- Raw prediction probabilities
- Good for checking where your model struggles and to double-check your final scores
- Loss from each epoch (I usually average the loss over batches)
A+
- Model checkpoint every x epochs
- Sample IDs for each train/val/test split
- The runtime settings
- This is usually a combination of dumping the
args
and registering buffers for my model (pytorch-specific)
- This is usually a combination of dumping the
The above are guidelines, but in order to make sure you save everything you need it is best to run a miniature version of the entire experiment.
Run a very small subset of the experiment (10 batches, 1 epoch, whatever. Preferably something that takes less than 5 minutes so you can quickly iterate over bugs) and then plot results and ask questions. Do you want to be able to see the loss over each batch? Do you want a different metric (accuracy, F1, etc)? Good. Now go back to your code, re-run the tester model and iterate until you feel confident. THEN you can confidently run the entire experiment.