Hello Llama Stack!
Goal
In this module, you will launch the Llama Stack server using a model served by Ollama, and make your first LLM request using the llama-stack-client
CLI.
Prerequisites
-
Ollama installed on your local machine (https://ollama.com/download)
-
Docker or Podman installed
-
Python 3.10+ (only for installing the CLI)
Step 1: Start the Ollama Server
First, check if Ollama is already running:
ollama ps
If Ollama is running, you’ll see output like this:
NAME ID SIZE PROCESSOR UNTIL
Step 2: Load the Model
Llama Stack does not dynamically load models from Ollama. You need to preload and keep the model in memory.
Run the following command:
ollama run llama3.2:3b-instruct-fp16 --keepalive 60m
After a few seconds you should be presented with a prompt where you can chat with the model.
Enter the folling to quit the chat:
/bye
To confirm the model is running and in memory run:
ollama ps
You should see llama3.2:3b-instruct-fp16
listed.
Step 3: Start the Llama Stack Server
First, export the necessary environment variables:
export LLAMA_STACK_MODEL="meta-llama/Llama-3.2-3B-Instruct"
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
export LLAMA_STACK_PORT=8321
export LLAMA_STACK_SERVER=http://localhost:$LLAMA_STACK_PORT
Now run the Llama Stack container using Podman:
podman run -it --network host \
llamastack/distribution-ollama:0.2.0 \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$LLAMA_STACK_MODEL \
--env OLLAMA_URL=http://localhost:11434
podman run -it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
llamastack/distribution-ollama:0.2.0 \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$LLAMA_STACK_MODEL \
--env OLLAMA_URL=http://host.containers.internal:11434
You may be prompted to select the image, if so, select the image from docker.io.
The Llama Stack server will start and listen on port 8321
.
Step 4: Hello World with the CLI
In a second terminal, create and activate a Python virtual environment
Before creating the virtual environment, ensure you’re using Python 3.11 or later. You can check your current Python version with:
python3 --version
If this version of python is < 3.10, we will need to use a different version. You can find the availble versions by running
ls /usr/bin/python*
For example if you see python3.12 listed run
sudo update-alternatives --install /usr/bin/python3 python /usr/bin/python3.12 2
Then run the following to make this version the active version.
sudo update-alternatives --config python
And choose the option for python 3.12 e.g. option 2.
You can now create the python virtual environment.
python3 -m venv llama_env
source llama_env/bin/activate
Install the Llama Stack client CLI:
pip install llama-stack-client==0.2.2
Configure the CLI to point to your running server:
llama-stack-client configure --endpoint http://localhost:8321
Now, send your first prompt using the CLI:
llama-stack-client \
inference chat-completion \
--message "hello, what model are you?" \
--model-id "meta-llama/Llama-3.2-3B-Instruct"
Sample Output
ChatCompletionResponse(completion_message=CompletionMessage(
content="Hello! I'm a Meta LLaMA 3.2 3B Instruct model. How can I assist you today?",
role='assistant',
stop_reason='end_of_turn',
tool_calls=[]),
logprobs=None,
metrics=[
Metric(metric='prompt_tokens', value=12.0, unit=None),
Metric(metric='completion_tokens', value=24.0, unit=None),
Metric(metric='total_tokens', value=36.0, unit=None)
]
)
This shows a typical structured response from the model via the CLI. You may see different content depending on model version and configuration.