Monitor Training

While the training is running, we can collect training data to monitor the progress or to plot various graphs.

Once the training has started, we can monitor the training in real-time using/stats endpoint.

curl http://localhost:1004/api/v1/stats

Expected response:

{
"AP": {
"face": 21.42,
"person": 12.54
},
"ETA": 0,
"architecture": "SSD",
"backbone": "ResNet10",
"batch_size": 32,
"classes": [
"face",
"person"
],
"cur_epoch": 3,
"elapsed": 1298.49595,
"end_time": "unknown",
"estimated_gpu_power_consumption": 248.3848484,
"et": 0,
"flow_steps": {
"start": "completed",
"download_pretrained_model": "completed",
"gen_tfrecords": "completed",
"import_dataset": "completed",
"monitor": "started",
"training": "started"
},
"loss": 0.23554,
"losses": {
"1": 3.45251,
"2": 1.34353,
"3": 0.84652
},
"mAP": 16.975,
"model_size": "54.20 MB",
"num_gpus": 8,
"pid": {
"flask": 30521,
"metaflow_shell": 30623,
"tlt_train": 30841
},
"resolution": {
"height": 300,
"padding": false,
"width": 300
},
"st": 1596022472.4209566,
"start_time": "2020-07-29 11:34:32",
"state": "training",
"total_epoch": 50,
"total_gpu_power_consumed": 12.24324323,
"training_state": "progress",
"training_time": 0
}

Let's summarise what happened above, the model training is running successfully and we can see real-time training progress through the API.