Hints

Good Granite Questions

Details

What are possible bottlenecks for response time to an AI model deployed on OpenShift?

How can I improve the response time of an application deployed in Red Hat Openshift AI?

Hint #1

Click to reveal Hint #1

Well, behind the scenes, we are using some "serverless" features for model serving, so may be it would be faster with ZERO replicas of the model… What do you think?

Hint #2

Click to reveal Hint #2

Ha, the previous hint was indeed a trap. If you scale down the replicas to zero for the served model, it’s the action of sending a request to the model that wakes it up. Did you measure the time it takes to make a prediction when that also triggers the model restart? And then, did you measure how long it takes to get the prediction after the model is properly started?

Ok, so … if reducing the number of replicas from 1 down to zero made things worse, maybe doing the opposite would make things better? (Hint hint!)

Hint #3

Click to reveal Hint #3

Another approach could be to change the size of the container serving the model. However, is bigger always necessarily better? Go back down to a single replica, and change the memory and CPU requests and limits before re-running some of the tests.

If performance improves, by how much? If performance is unchanged, over-using resources will increase costs without improving throughput. Is that desirable?

Hint #4

Click to reveal Hint #4

If we really can’t get the prediction performance to get below a certain threshold (say 0.5s) using CPU, then maybe we do need a GPU. But since we don’t have any in this cluster, it’s a moot point.

Hint #5

Click to reveal Hint #5

What other bottlenecks could be slowing us down, and how would you test these theories?

Maybe the network is now too slow and requests can’t come in fast enough?
Maybe using the route is adding network hops and slowing things down?
Maybe the client itself is unable to deliver the amount of parallelism required. How could we change that? What tools may be better suited than raw python coding?