Distributing training jobs with the Training Operator
The Training Operator is a tool for scalable distributed training of machine learning (ML) models created with various ML frameworks, such as PyTorch
.
You can use the Training Operator to distribute the training of a machine learning model across many hardware resources.
In your notebook environment, open the 9_distributed_training_kfto.ipynb
file and follow the instructions directly in the notebook. The instructions guide you through setting authentication, initializing the Training Operator client, and submitting a PyTorchJob
.
Optionally, you can view the complete Python code in the kfto-scripts/train_pytorch_cpu.py
file.

For more information about PyTorchJob training with the Training Operator, see the Training Operator PyTorchJob guide.