What to Save from an Experiment

30 August 2021

Tags: Advice

Everything.

(I jest, but really)

This is especially important if your model takes forever to train. The must haves are:

Bare Necessities

Train, validation, and test scores for each run
Final model weights (or the entire model)
Raw prediction probabilities
- Good for checking where your model struggles and to double-check your final scores
Loss from each epoch (I usually average the loss over batches)

A+

Model checkpoint every x epochs
Sample IDs for each train/val/test split
The runtime settings
- This is usually a combination of dumping the args and registering buffers for my model (pytorch-specific)

The above are guidelines, but in order to make sure you save everything you need it is best to run a miniature version of the entire experiment.

Run a very small subset of the experiment (10 batches, 1 epoch, whatever. Preferably something that takes less than 5 minutes so you can quickly iterate over bugs) and then plot results and ask questions. Do you want to be able to see the loss over each batch? Do you want a different metric (accuracy, F1, etc)? Good. Now go back to your code, re-run the tester model and iterate until you feel confident. THEN you can confidently run the entire experiment.