Llama-stack CLI

Goal

This module introduces the llama-stack-client command-line interface (CLI), which allows you to interact with the Llama Stack server without writing any code. You will learn how to list models, run inferences, inspect the server, and more.

Prerequisites

Step 1: Install the CLI

Install the CLI via pip:

pip install llama-stack-client==0.2.2

Step 2: Configure the CLI

Run the following to configure the endpoint for your Llama Stack server:

llama-stack-client configure

Example interaction:

> Enter the endpoint of the Llama Stack distribution server: http://localhost:8321

Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:8321

Step 3: List Available Models

You can use the CLI to view models registered with the server:

llama-stack-client models list

Example output:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ model_type   ┃ identifier                           ┃ provider_resource_id         ┃ metadata  ┃ provider_id ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ llm          │ meta-llama/Llama-3.2-3B-Instruct     │ llama3.2:3b-instruct-fp16    │           │ ollama      │
└──────────────┴──────────────────────────────────────┴──────────────────────────────┴───────────┴─────────────┘

Step 4: Run an Inference

Send a message to the model using this command:

llama-stack-client \
  inference chat-completion \
  --message "hello, what model are you?" \
  --model-id "meta-llama/Llama-3.2-3B-Instruct"

Example output (this will be different to your response, this is an LLM response after all!):

ChatCompletionResponse(completion_message=CompletionMessage(
    content="Hello! I'm a Meta LLaMA 3.2 3B Instruct model. How can I assist you today?",
    role='assistant',
    stop_reason='end_of_turn',
    tool_calls=[]),
    logprobs=None,
    metrics=[
        Metric(metric='prompt_tokens', value=12.0, unit=None),
        Metric(metric='completion_tokens', value=24.0, unit=None),
        Metric(metric='total_tokens', value=36.0, unit=None)
    ]
)

Step 5: Inspect Server Info

You can inspect the current server configuration and metadata:

llama-stack-client inspect version

Sample output:

VersionInfo(version='0.2.0')

Step 6: List Providers

View a list of providers registered in your Llama Stack environment:

llama-stack-client providers list

Sample output:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ API          ┃ Provider ID            ┃ Provider Type                  ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ inference    │ ollama                 │ remote::ollama                 │
│ vector_io    │ faiss                  │ inline::faiss                  │
│ safety       │ llama-guard            │ inline::llama-guard            │
│ agents       │ meta-reference         │ inline::meta-reference         │
│ telemetry    │ meta-reference         │ inline::meta-reference         │
│ eval         │ meta-reference         │ inline::meta-reference         │
│ datasetio    │ huggingface            │ remote::huggingface            │
│ datasetio    │ localfs                │ inline::localfs                │
│ scoring      │ basic                  │ inline::basic                  │
│ scoring      │ llm-as-judge           │ inline::llm-as-judge           │
│ scoring      │ braintrust             │ inline::braintrust             │
│ tool_runtime │ brave-search           │ remote::brave-search           │
│ tool_runtime │ tavily-search          │ remote::tavily-search          │
│ tool_runtime │ code-interpreter       │ inline::code-interpreter       │
│ tool_runtime │ rag-runtime            │ inline::rag-runtime            │
│ tool_runtime │ model-context-protocol │ remote::model-context-protocol │
│ tool_runtime │ wolfram-alpha          │ remote::wolfram-alpha          │
└──────────────┴────────────────────────┴────────────────────────────────┘

Summary

In this module, you:

  • Installed and configured the llama-stack-client CLI

  • Listed available models and providers

  • Ran your first LLM inference with a single command

  • Inspected server metadata using built-in tools

Next, try using Llama-stack Playground for an interactive Jupyter-based experience.