Metadata-Version: 2.4
Name: run-gauntlet
Version: 0.2.4
Summary: Multi-provider harness for testing whether AI agents can use product docs and tools.
Author: Gauntlet Contributors
Keywords: ai,agents,evaluation,cli,llms
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: httpx<1.0,>=0.27
Requires-Dist: python-dotenv<2.0,>=1.0
Requires-Dist: requests<3.0,>=2.31
Requires-Dist: beautifulsoup4<5.0,>=4.12
Requires-Dist: playwright<2.0,>=1.44
Requires-Dist: rank-bm25<1.0,>=0.2.2
Provides-Extra: server
Requires-Dist: fastapi<1.0,>=0.115; extra == "server"
Requires-Dist: uvicorn[standard]<1.0,>=0.30; extra == "server"
Requires-Dist: psycopg[binary,pool]<4.0,>=3.2; extra == "server"
Requires-Dist: numpy<3.0,>=1.26; extra == "server"
Provides-Extra: dev
Requires-Dist: build<2.0,>=1.2; extra == "dev"
Requires-Dist: pytest<9.0,>=8.0; extra == "dev"
Requires-Dist: ruff<1.0,>=0.6; extra == "dev"
Requires-Dist: twine<7.0,>=5.0; extra == "dev"

# Gauntlet

Gauntlet is a harness for answering a practical question:

**Can real external agents learn and use your product from docs, specs, and tools alone and, if not, exactly where does adoption break?**

This project does not try to simulate an idealized agent in a perfect environment. It tries to simulate the messy reality:

- different model providers
- different behavior styles
- ambiguous docs
- brittle tool contracts
- recovery after failure
- noisy execution traces

The result is a pipeline that can:

1. run an agent against a product task using only the provided docs and tools
2. record the execution automatically
3. judge the run afterward
4. produce a report showing what happened, where it drifted, and what to fix

## What Gauntlet Is Actually Doing

Gauntlet is organized as three layers.

### Layer 1: Product Context And Execution Primitives

Layer 1 is the substrate.

It knows how to:

- load `llms.txt` and `llms-full.txt`
- shape docs into a manifest and retrievable chunks
- run generic execution tools like:
  - HTTP
  - CLI
  - Python code
  - Python SDK invocation

This is where the product-facing context lives.

### Layer 2: Agent Execution

Layer 2 is the agent loop.

It takes:

- a task
- a provider/model
- an optional persona
- docs sources

Then it:

- prompts the model
- lets it call tools
- captures step-by-step execution
- returns a final answer or failure
- writes a structured Layer 2 run record

This is the “can the agent do it?” layer.

### Layer 3: Judgment

Layer 3 is the post-run judge.

It consumes Layer 2 run records and produces:

- a normalized view of the run
- extracted evidence
- retrieved docs context
- batch-level cross-persona context
- a judged result
- a single HTML + JSON report

This is the “what happened, why, and how do we improve adoption?” layer.

## Why This Exists

Most product teams ask some version of:

- “Will agents be able to use our product?”
- “Are our docs good enough for AI agents?”
- “If an agent fails, is it our product, our docs, or just the model?”

Gauntlet is built to make those failures attributable.

The goal is not merely pass/fail.

The goal is to separate:

- product problems
- docs problems
- runtime/tooling problems
- harness problems
- model capability limits

## Core Ideas

### Personas Matter

A single clean run is not enough.

Gauntlet can run multiple built-in personas that stress different failure modes:

- `methodical`: follows docs literally
- `impatient`: optimizes for speed
- `chaotic`: reorders and perturbs flows
- `confused`: exposes clarity gaps
- `long-running`: stresses session continuity
- `adversarial`: pushes boundaries and validation
- `parallel`: stresses concurrency and state isolation
- `recovery`: intentionally fails, then tries to recover

These are not cosmetic personas. They change how the agent behaves and what kinds of product adoption failures become visible.

### Docs Are First-Class Inputs

Gauntlet is designed around product docs, especially:

- `llms.txt`
- `llms-full.txt`

The agent is expected to use those as its operating context, and Layer 3 uses them again to judge whether the run:

- had enough documentation
- consulted the docs
- followed the docs
- or drifted away from the intended path

### Every Run Becomes An Artifact

By default, `gauntlet chat` creates a batch folder automatically:

```text
artifacts/gauntlet_runs/gauntlet_###/
```

That folder contains:

- one Layer 2 run record per persona
- one Layer 3 JSON report
- one Layer 3 HTML report

So one command maps to one investigation bundle.

## Installation

Gauntlet is a Python package.

```bash
python -m pip install -e .
```

Or just install dependencies and run with `python -m`.

Current package metadata is in `pyproject.toml`.

## Environment

Gauntlet loads environment variables through Layer 1’s environment loading.

Depending on what you run, you may need some of:

- `GEMINI_API_KEY`
- `OPENAI_API_KEY`
- `OLLAMA_BASE_URL`
- product-specific keys required by the product or API you are testing

Examples:

- Layer 2 provider calls use the provider API keys
- product workflows may need the target product's own API credentials

## Quick Start

Run a single task with a single persona:

```bash
gauntlet chat \
  --provider gemini \
  --model gemini-2.5-flash \
  --docs https://example.com/llms.txt \
  --docs-full https://example.com/llms-full.txt \
  "Use the documented API to complete the onboarding task and report what worked."
```

Run all personas:

```bash
gauntlet chat \
  --provider gemini \
  --model gemini-2.5-flash \
  --docs https://example.com/llms.txt \
  --docs-full https://example.com/llms-full.txt \
  --persona all \
  "Use the documented quickstart to create a resource, inspect it, and clean it up."
```

After completion, Gauntlet prints the batch folder and Layer 3 report paths.

## Common Commands

### List Personas

```bash
gauntlet personas
```

Show one persona in detail:

```bash
gauntlet personas recovery
```

### Run With Debug Logs

```bash
gauntlet chat \
  --provider gemini \
  --model gemini-2.5-flash \
  --docs https://example.com/llms.txt \
  --docs-full https://example.com/llms-full.txt \
  --debug \
  "Use the documented API to complete the requested workflow"
```

### Runtime Controls For Longer Workflows

Layer 2 checkpoints after every step. If a run reaches `--max-steps`, Gauntlet continues from the latest checkpoint instead of starting over, up to `--max-restarts` times.

```bash
gauntlet chat \
  --provider gemini \
  --model gemini-2.5-flash \
  --docs https://example.com/llms.txt \
  --docs-full https://example.com/llms-full.txt \
  --max-steps 8 \
  --max-restarts 3 \
  --max-model-calls 40 \
  --max-runtime-seconds 900 \
  "Complete a longer browser workflow"
```

Resume a saved Layer 2 record from its latest checkpoint:

```bash
gauntlet chat \
  --provider gemini \
  --resume-run artifacts/gauntlet_runs/gauntlet_001/layer2_run.json
```

Inspect a replay-friendly step outline without re-running tools:

```bash
gauntlet replay artifacts/gauntlet_runs/gauntlet_001/layer2_run.json
```

### Override The Output Directory

By default, Gauntlet creates `artifacts/gauntlet_runs/gauntlet_<n>/`.

If you want a different batch folder:

```bash
gauntlet chat \
  --provider gemini \
  --model gemini-2.5-flash \
  --docs https://example.com/llms.txt \
  --docs-full https://example.com/llms-full.txt \
  --run-record artifacts/my_custom_batch \
  "Use the documented quickstart to complete the requested workflow"
```

### Run Layer 3 Directly

If you already have Layer 2 run records, you can judge them directly:

```bash
python -m gauntlet.layer3.cli \
  'artifacts/gauntlet_runs/gauntlet_003/layer2_run_*.json' \
  --output-dir artifacts/judge_runs \
  --name replay_batch
```

### Use An Optional Layer 3 Judge Model

Layer 3 defaults to a deterministic judge path. You can optionally configure a model-backed judge:

```bash
python -m gauntlet.layer3.cli \
  'artifacts/gauntlet_runs/gauntlet_003/layer2_run_*.json' \
  --output-dir artifacts/judge_runs \
  --name llm_judged_batch \
  --judge-provider gemini \
  --judge-model gemini-2.5-pro \
  --judge-fallback-deterministic
```

The same judge flags are available through `gauntlet chat`, and they will be used automatically when Layer 3 runs after Layer 2.

## Output Structure

Typical batch folder:

```text
artifacts/gauntlet_runs/gauntlet_003/
  layer2_run_default.json
  layer2_run_methodical.json
  layer2_run_impatient.json
  ...
  gauntlet_003.json
  gauntlet_003.html
```

### Layer 2 Run Records

These contain:

- provider/model/persona metadata
- docs sources
- step-by-step tool execution
- tool results
- final response
- terminal failure info if present

### Layer 3 Report

The Layer 3 report is the interesting part.

It includes:

- batch summary
- issue breakdowns
- top recommendations
- per-persona outcomes
- execution timeline
- documentation evidence
- trace evidence
- successful path
- reproduction path
- cross-agent comparison

The HTML report is intended to be human-readable. The JSON report is intended to be machine-readable.

## How To Think About The Reports

A “completed” run is not automatically a good run.

Gauntlet tries to surface:

- clean success
- recovered success
- suspect success
- hard failure

What matters is whether the agent:

- used the product correctly
- used the docs correctly
- produced an answer supported by evidence

The most valuable output is often not “it failed,” but:

> it succeeded only after a noisy recovery path that a real external agent might never find

That is adoption signal.

## Project Layout

```text
gauntlet/
  cli.py                  # top-level CLI
  providers.py            # provider adapters
  run_orchestrator.py     # shared CLI/server run pipeline
  layer1/                 # docs + tool primitives
  layer2/                 # agent execution loop
  layer3/                 # judgment and reporting
  server/                 # optional FastAPI API
artifacts/
  gauntlet_runs/          # automatic end-to-end run bundles
```

Key files:

- `gauntlet/cli.py`
- `gauntlet/run_orchestrator.py`
- `gauntlet/providers.py`
- `gauntlet/layer2/agent_runner.py`
- `gauntlet/layer2/personas.py`
- `gauntlet/layer3/cli.py`
- `gauntlet/layer3/reasoning_judge.py`

## Current Status

Gauntlet is still evolving. The pipeline is real and usable, but this is not trying to hide the experimental nature of the work.

What is already real:

- multi-provider execution
- multi-persona runs
- automatic Layer 2 recording
- automatic Layer 3 reporting
- deterministic and optional model-backed judgment paths

What still needs continued refinement:

- docs retrieval quality
- citation precision
- judge quality
- attribution quality
- richer product-specific mission libraries

## The Spirit Of The Project

Most agent evals ask:

> “Did the model solve the task?”

Gauntlet asks a more useful question:

> “If a serious external agent tried to adopt this product from the docs alone, where would it break, how would it break, and what should we fix first?”

That is what this repo is for.
