Llama-stack CLI

Goal

This module introduces the llama-stack-client command-line interface (CLI), which allows you to interact with the Llama Stack server without writing any code. You will learn how to list models, run inferences, inspect the server, and more.

Prerequisites

Llama Stack server running (see: Llama-stack Helloworld)
Python 3.10+ (to install the CLI tool)
Python virtual environment created (see: Llama-stack Helloworld)

Step 1: Install the CLI

Install the CLI via pip:

pip install fire==0.7.1
pip install requests==2.32.5
pip install llama-stack-client==0.2.8

Step 2: Configure the CLI

Run the following to configure the endpoint for your Llama Stack server:

llama-stack-client configure

Example interaction:

> Enter the endpoint of the Llama Stack distribution server: http://localhost:8321
> Enter the API key (leave empty if no key is needed): <<leave empty>>

Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:8321

Step 3: List Available Models

You can use the CLI to view models registered with the server:

llama-stack-client models list

Example output:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ model_type   ┃ identifier                           ┃ provider_resource_id         ┃ metadata  ┃ provider_id ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ llm          │ meta-llama/Llama-3.2-3B-Instruct     │ llama3.2:3b-instruct-fp16    │           │ ollama      │
└──────────────┴──────────────────────────────────────┴──────────────────────────────┴───────────┴─────────────┘

Step 4: Run an Inference

Send a message to the model using this command:

llama-stack-client \
  inference chat-completion \
  --message "hello, what model are you?" \
  --model-id "meta-llama/Llama-3.2-3B-Instruct"

Example output (this will be different to your response, this is an LLM response after all!):

ChatCompletionResponse(completion_message=CompletionMessage(
    content="Hello! I'm a Meta LLaMA 3.2 3B Instruct model. How can I assist you today?",
    role='assistant',
    stop_reason='end_of_turn',
    tool_calls=[]),
    logprobs=None,
    metrics=[
        Metric(metric='prompt_tokens', value=12.0, unit=None),
        Metric(metric='completion_tokens', value=24.0, unit=None),
        Metric(metric='total_tokens', value=36.0, unit=None)
    ]
)

Step 5: Inspect Server Info

You can inspect the current server configuration and metadata:

llama-stack-client inspect version

Sample output:

VersionInfo(version='0.2.9')

Step 6: List Providers

View a list of providers registered in your Llama Stack environment:

llama-stack-client providers list

Sample output:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ API          ┃ Provider ID            ┃ Provider Type                  ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ inference    │ ollama                 │ remote::ollama                 │
│ vector_io    │ faiss                  │ inline::faiss                  │
│ safety       │ llama-guard            │ inline::llama-guard            │
│ agents       │ meta-reference         │ inline::meta-reference         │
│ telemetry    │ meta-reference         │ inline::meta-reference         │
│ eval         │ meta-reference         │ inline::meta-reference         │
│ datasetio    │ huggingface            │ remote::huggingface            │
│ datasetio    │ localfs                │ inline::localfs                │
│ scoring      │ basic                  │ inline::basic                  │
│ scoring      │ llm-as-judge           │ inline::llm-as-judge           │
│ scoring      │ braintrust             │ inline::braintrust             │
│ tool_runtime │ brave-search           │ remote::brave-search           │
│ tool_runtime │ tavily-search          │ remote::tavily-search          │
│ tool_runtime │ code-interpreter       │ inline::code-interpreter       │
│ tool_runtime │ rag-runtime            │ inline::rag-runtime            │
│ tool_runtime │ model-context-protocol │ remote::model-context-protocol │
│ tool_runtime │ wolfram-alpha          │ remote::wolfram-alpha          │
└──────────────┴────────────────────────┴────────────────────────────────┘

Summary

In this module, you:

Installed and configured the llama-stack-client CLI
Listed available models and providers
Ran your first LLM inference with a single command
Inspected server metadata using built-in tools

Next, try using Llama-stack Playground for an interactive Jupyter-based experience.