Stop Training

Many times, the model doesn't improve any further while iterating on the same data. Hence it's a good idea to stop the training to avoid overfitting.

While stopping the running training container is always an option, there are few reasons why you shouldn't do that.

  • It will terminate the running jobs instantly, hence the application will not get a chance to update information that is collected at the end of the training.

  • The latest data inside the shared project directory may get corrupted.

  • GPU resources may get blocked.

The better way is to send a stop signal to the container through the API. This way the container will take care of all the things and stop gracefully.

This can be done by sending a post request to /api/v1/stop endpoint.

curl -X POST http://localhost:1004/api/v1/stop

Expected responses:

  • {"processing": "kill signal accepted"} and status code 202.

  • {"success": "flow stopped"} and status code 200.

Once you sent the stop signal it will return a status code of 202 this indicates that the container has initiated the process to stop the training.

You can check again after a few seconds with the same post request you will receive a status code of200after this, you can stop the container by issuing thedocker stopcommand if the container isn't stopped on its own.

docker stop tlt-ssd-resnet18

Once the container is stopped, you can get your trained model files from the project directory i.e., /home/user/project.