API endpoints used to control training flow.

Training has three endpoints which can be triggered through the API

Run Training

  • Running the training is defined as sending a post request to the trainer API which will handle the job of generating all the required config files and directories, execute and track TLT CLI commands to start and monitor the training.

  • Everything runs on multiprocess and is governed by MetaFlow.

  • API responds back with 200 status code with a message {"success": "flow started"} and tracks spawned process Id's.

  • In case the API receives multiple run training requests it check for the registered Pid's and if those processes are running then it responds with 200 status code with a message{"warning":"flow running"}.

  • This ensures that already running training is not disturbed and no new training gets started.

Training status

  • Other application communicating with the API to start training, can get the training progress and various stats to show in the UI or to take necessary actions.

  • The /stats endpoint allows getting the training stats.

  • It responds with a status code of 200 and JSON data containing all the training specific information such as current epoch, training accuracy, time elapsed etc.

Stop Training

  • Stopping the training can be tricky as the training is started with a command-line line tool tlt-train.

  • Metaflow governs these processes and keeps the track of the Id's of these Processes.

  • When stop command is received, a background thread is started which continuously sends post request to /kill endpoint until the training is terminated.

  • If the API receives multiple stop requests, it just ignores them and responds with status code 202 and message {"processing": "kill signal accepted"} as long the background thread is running.

  • In the end when the training is actually stopped and it's registered PID doesn't exist any longer the endpoint responds with status code of 200 with a message {"success": "flow stopped"}.