Automating workflows with data science pipelines

In previous sections of this workshop, you used a notebook to train and save your model. Optionally, you can automate these tasks by using Red Hat OpenShift AI pipelines. Pipelines offer a way to automate the execution of multiple notebooks and Python code. By using pipelines, you can execute long training jobs or retrain your models on a schedule without having to manually run them in a notebook.

In this section, you create a simple pipeline by using the GUI pipeline editor. The pipeline uses the notebook that you used in previous sections to train a model and then save it to S3 storage.

Your completed pipeline should look like the one in the 6 Train Save.pipeline file.

To explore the pipeline editor, complete the steps in the following procedure to create your own pipeline. Alternately, you can skip the following procedure and instead run the 6 Train Save.pipeline file.

Prerequisites

You configured a pipeline server as described in Enabling data science pipelines.
If you configured the pipeline server after you created your workbench, you stopped and then started your workbench.

Create a pipeline

Open your workbench’s JupyterLab environment. If the launcher is not visible, click + to open it.
Click Pipeline Editor.

You’ve created a blank pipeline.
Set the default runtime image for when you run your notebook or Python code.
1. In the pipeline editor, click Open Panel.
2. Select the Pipeline Properties tab.
3. In the Pipeline Properties panel, scroll down to Generic Node Defaults and Runtime Image. Set the value to Tensorflow with Cuda and Python 3.11 (UBI 9).
Select File → Save Pipeline.

Add nodes to your pipeline

Add some steps, or nodes in your pipeline. Your two nodes will use the 1_experiment_train.ipynb and 2_save_model.ipynb notebooks.

From the file-browser panel, drag the 1_experiment_train.ipynb and 2_save_model.ipynb notebooks onto the pipeline canvas.
Click the output port of 1_experiment_train.ipynb and drag a connecting line to the input port of 2_save_model.ipynb.
Save the pipeline.

Specify the training file as a dependency

Set node properties to specify the training file as a dependency.

If you don’t set this file dependency, the file is not included in the node when it runs and the training job fails.

Click the 1_experiment_train.ipynb node.
In the Properties panel, click the Node Properties tab.
Scroll down to the File Dependencies section and then click Add.
Set the value to data/*.csv which contains the data to train your model.
Select the Include Subdirectories option.
Save the pipeline.

Create and store the ONNX-formatted output file

In node 1, the notebook creates the models/fraud/1/model.onnx file. In node 2, the notebook uploads that file to the S3 storage bucket. You must set models/fraud/1/model.onnx file as the output file for both nodes.

Select node 1.
Select the Node Properties tab.
Scroll down to the Output Files section, and then click Add.
Set the value to models/fraud/1/model.onnx.
Repeat steps 2-4 for node 2.
Save the pipeline.

Configure the connection to the S3 storage bucket

In node 2, the notebook uploads the model to the S3 storage bucket.

You must set the S3 storage bucket keys by using the secret created by the My Storage connection that you set up in the storing-data-with-connections.adoc[Storing data with connections] section of this workshop.

You can use this secret in your pipeline nodes without having to save the information in your pipeline code. This is important, for example, if you want to save your pipelines - without any secret keys - to source control.

The secret is named aws-connection-my-storage.

If you named your connection something other than My Storage, you can obtain the secret name in the OpenShift AI dashboard by hovering over the help (?) icon in the Connections tab.

The aws-connection-my-storage secret includes the following fields:

AWS_ACCESS_KEY_ID
AWS_DEFAULT_REGION
AWS_S3_BUCKET
AWS_S3_ENDPOINT
AWS_SECRET_ACCESS_KEY

You must set the secret name and key for each of these fields.

Procedure

Remove any pre-filled environment variables.
1. Select node 2, and then select the Node Properties tab.
  
  Under Additional Properties, note that some environment variables have been pre-filled. The pipeline editor inferred that you need them from the notebook code.
  
  Since you don’t want to save the value in your pipelines, remove all of these environment variables.
2. Click Remove for each of the pre-filled environment variables.
Add the S3 bucket and keys by using the Kubernetes secret.
1. Under Kubernetes Secrets, click Add.
2. Enter the following values and then click Add.
  - Environment Variable: AWS_ACCESS_KEY_ID
    
    Secret Name: aws-connection-my-storage
    
    Secret Key: AWS_ACCESS_KEY_ID
Repeat Step 2 for each of the following Kubernetes secrets:
- Environment Variable: AWS_SECRET_ACCESS_KEY
  - Secret Name: aws-connection-my-storage
  - Secret Key: AWS_SECRET_ACCESS_KEY
- Environment Variable: AWS_S3_ENDPOINT
  - Secret Name: aws-connection-my-storage
  - Secret Key: AWS_S3_ENDPOINT
- Environment Variable: AWS_DEFAULT_REGION
  - Secret Name: aws-connection-my-storage
  - Secret Key: AWS_DEFAULT_REGION
- Environment Variable: AWS_S3_BUCKET
  - Secret Name: aws-connection-my-storage
  - Secret Key: AWS_S3_BUCKET
Select File → Save Pipeline As to save and rename the pipeline. For example, rename it to My Train Save.pipeline.

Run the Pipeline

Upload the pipeline on your cluster and run it. You can do so directly from the pipeline editor. You can use your own newly created pipeline or the pipeline in the provided 6 Train Save.pipeline file.

Procedure

Click the play button in the toolbar of the pipeline editor.
Enter a name for your pipeline.
Verify that the Runtime Configuration: is set to Data Science Pipeline.

Click OK.

If you see an error message stating that "no runtime configuration for Data Science Pipeline is defined", you might have created your workbench before the pipeline server was available.

To address this situation, you must verify that you configured the pipeline server and then restart the workbench.

Follow these steps in the OpenShift AI dashboard:

Check the status of the pipeline server:
1. In your Fraud Detection project, click the Pipelines tab.
  - If you see the Configure pipeline server option, follow the steps in Enabling data science pipelines.
  - If you see the Import a pipeline option, the pipeline server is configured. Continue to the next step.
Restart your Fraud Detection workbench:
1. Click the Workbenches tab.
2. Click Stop and then click Stop workbench.
3. After the workbench status is Stopped, click Start.
4. Wait until the workbench status is Running.
Return to your workbench’s JupyterLab environment and run the pipeline.

In the OpenShift AI dashboard, open your data science project and expand the newly created pipeline.
Click View runs.
Click your run and then view the pipeline run in progress.

The result should be a models/fraud/1/model.onnx file in your S3 bucket which you can serve, just like you did manually in the Preparing a model for deployment section.

Next step

(Optional) Running a data science pipeline generated from Python code