Once trainer receives stop signal, a background thread is started that sends kill signal continuously. This Endpoint handle the task of terminating training.

Kill Training

This endpoint allows you to kill the training process.
200: OK
Training has stopped successfully.
{"success": "flow stopped"}
202: Accepted
Kill signal has been accepted.
{"processing": "kill signal accepted"}
  • A background thread started by /stop endpoint upon receiving the stop request keeps on sending kill request to this endpoint.

  • This endpoint keeps on sending kill signal to the parent group of TLT training processes.

  • os.killpg(os.getpgid(tlt_train_pid), signal.SIGKILL)

  • This ensures all the processes spawned by tlt-train command executed by MetaFlow is stopped.

  • In the end, Metaflow manages the task of updating the information that needs to be collected at the end of every training such a time elapsed, end time, training state etc.

  • In future updates, this endpoint may be removed and the background thread started by /stop endpoint will be responsible for directly communicating and sending kill signal to the training processes.