Hints Hint #1 Click to reveal Hint #1 Well, behind the scenes, we are using some "serverless" features for model serving, so may be it would be faster with ZERO replicas of the model… What do you think? Hint #2 Click to reveal Hint #2 Ha, the previous hint was indeed a trap. If you scale down the replicas to zero for the served model, it’s the action of sending a request to the model that wakes it up. Did you measure the time it takes to make a prediction when that also triggers the model restart? And then, did you measure how long it takes to get the prediction after the model is properly started? Ok, so … if reducing the number of replicas from 1 down to zero made things worse, maybe doing the opposite would make things better? (Hint hint!) Hint #3 Click to reveal Hint #3 Another approach could be to change the size of the container serving the model. However, is bigger always necessarily better? Go back down to a single replica, and change the memory and CPU requests and limits before re-running some of the tests. If performance improves, by how much? If performance is unchanged, over-using resources will increase costs without improving throughput. Is that desirable? Hint #4 Click to reveal Hint #4 If we really can’t get the prediction performance to get below a certain threshold (say 0.5s) using CPU, then maybe we do need a GPU. But since we don’t have any in this cluster, it’s a moot point. Hint #5 Click to reveal Hint #5 What other bottlenecks could be slowing us down, and how would you test these theories? * Maybe the network is now too slow and requests can’t come in fast enough? * Maybe using the route is adding network hops and slowing things down? * Maybe the client itself is unable to deliver the amount of parallelism required. How could we change that? What tools may be better suited than raw python coding? * 6.2 Execution 6.4 Debrief