Hints

Hint #1

Click to reveal Hint #1

Well, behind the scenes, we are using some "serverless" features for model serving, so may be it would be faster with ZERO replicas of the model…​ What do you think?

Hint #2

Click to reveal Hint #2

Ha, the previous hint was indeed a trap. If you scale down the replicas to zero for the served model, it’s the action of sending a request to the model that wakes it up. Did you measure the time it takes to make a prediction when that also triggers the model restart? And then, did you measure how long it takes to get the prediction after the model is properly started?

Ok, so …​ if reducing the number of replicas from 1 down to zero made things worse, maybe doing the opposite would make things better? (Hint hint!)

Hint #3

Click to reveal Hint #3

Another approach could be to change the size of the container serving the model. However, is bigger always necessarily better? Go back down to a single replica, and change the memory and CPU requests and limits before re-running some of the tests.

If performance improves, by how much? If performance is unchanged, over-using resources will increase costs without improving throughput. Is that desirable?

Hint #4

Click to reveal Hint #4

If we really can’t get the prediction performance to get below a certain threshold (say 0.5s) using CPU, then maybe we do need a GPU. But since we don’t have any in this cluster, it’s a moot point.

Hint #5

Click to reveal Hint #5

What other bottlenecks could be slowing us down, and how would you test these theories? * Maybe the network is now too slow and requests can’t come in fast enough? * Maybe using the route is adding network hops and slowing things down? * Maybe the client itself is unable to deliver the amount of parallelism required. How could we change that? What tools may be better suited than raw python coding? *