Distributing training jobs with Training operator

In previous sections of this workshop, you trained the fraud model directly in a notebook and then in a pipeline. In this section, you learn how to train the model by using Training operator. Training operator is a tool used for scalable distributed training of machine learning (ML) models created with various ML frameworks such as PyTorch.

This section demonstrates how you can use Training operator to distribute the training of a machine learning model across multiple CPUs. While distributing training is not necessary for a simple model, applying it to the example fraud model is a good way for you to learn how to use distributed training for more complex models that require more compute power, such as multiple GPUs across multiple machines.

In your notebook environment, open the 9_distributed_training_kfto.ipynb file and follow the instructions directly in the notebook. The instructions guide you through setting authentication, initializing Training operator client and submitting PyTorchJob.

Optionally, if you want to view the Python code for this section, you can find it in the kfto-scripts/train_pytorch_cpu.py file.

For more information about PyTorchJob training on Training operator, see the Training operator PyTorchJob guide.