Hello Llama Stack!

Goal

In this module, you will launch the Llama Stack server using a model served by Ollama, and make your first LLM request using the llama-stack-client CLI.

Prerequisites

  • Ollama installed on your local machine (https://ollama.com/download)

  • Docker or Podman installed

  • Python 3.10+ (only for installing the CLI)

Step 1: Start the Ollama Server

First, check if Ollama is already running:

ollama ps

If Ollama is running, you’ll see output like this:

NAME                     ID       SIZE      PROCESSOR    UNTIL

Step 2: Load the Model

Llama Stack does not dynamically load models from Ollama. You need to preload and keep the model in memory.

Run the following command:

ollama run llama3.2:3b-instruct-fp16 --keepalive 60m

After a few seconds you should be presented with a prompt where you can chat with the model.

Enter the folling to quit the chat:

/bye

To confirm the model is running and in memory run:

ollama ps

You should see llama3.2:3b-instruct-fp16 listed.

Step 3: Start the Llama Stack Server

First, export the necessary environment variables:

export LLAMA_STACK_MODEL="meta-llama/Llama-3.2-3B-Instruct"
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
export LLAMA_STACK_PORT=8321
export LLAMA_STACK_SERVER=http://localhost:$LLAMA_STACK_PORT

Now run the Llama Stack container using Podman:

  • Linux

  • macOS

podman run -it --network host \
  llamastack/distribution-ollama:0.2.0 \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$LLAMA_STACK_MODEL \
  --env OLLAMA_URL=http://localhost:11434
podman run -it \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  llamastack/distribution-ollama:0.2.0 \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$LLAMA_STACK_MODEL \
  --env OLLAMA_URL=http://host.containers.internal:11434

You may be prompted to select the image, if so, select the image from docker.io.

The Llama Stack server will start and listen on port 8321.

Step 4: Hello World with the CLI

In a second terminal, create and activate a Python virtual environment

Before creating the virtual environment, ensure you’re using Python 3.11 or later. You can check your current Python version with:

python3 --version

If this version of python is < 3.10, we will need to use a different version. You can find the availble versions by running

ls /usr/bin/python*

For example if you see python3.12 listed run

sudo update-alternatives --install /usr/bin/python3 python /usr/bin/python3.12 2

Then run the following to make this version the active version.

sudo update-alternatives --config python

And choose the option for python 3.12 e.g. option 2.

You can now create the python virtual environment.

python3 -m venv llama_env
source llama_env/bin/activate

Install the Llama Stack client CLI:

pip install llama-stack-client==0.2.2

Configure the CLI to point to your running server:

llama-stack-client configure --endpoint http://localhost:8321

Now, send your first prompt using the CLI:

llama-stack-client \
  inference chat-completion \
  --message "hello, what model are you?" \
  --model-id "meta-llama/Llama-3.2-3B-Instruct"

Sample Output

ChatCompletionResponse(completion_message=CompletionMessage(
    content="Hello! I'm a Meta LLaMA 3.2 3B Instruct model. How can I assist you today?",
    role='assistant',
    stop_reason='end_of_turn',
    tool_calls=[]),
    logprobs=None,
    metrics=[
        Metric(metric='prompt_tokens', value=12.0, unit=None),
        Metric(metric='completion_tokens', value=24.0, unit=None),
        Metric(metric='total_tokens', value=36.0, unit=None)
    ]
)

This shows a typical structured response from the model via the CLI. You may see different content depending on model version and configuration.

Summary

You have:

  • Started Ollama and preloaded the Llama 3.2 3B model

  • Launched the Llama Stack server in a container

  • Sent a basic prompt using the Llama Stack CLI

You are now ready to build more advanced Llama Stack applications using either the CLI or Python SDK!