> ## Documentation Index
> Fetch the complete documentation index at: https://braintrust.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# bt eval

> Run JavaScript and Python evaluation files against Braintrust

Run evaluation files against Braintrust. Supports JavaScript and Python.

<Note>
  `bt eval` is currently macOS and Linux only.
</Note>

## File selection

* `bt eval` — discover and run all eval files in the current directory (recursive)
* `bt eval tests/` — discover eval files under a specific directory
* `bt eval "tests/**/*.eval.ts"` — glob pattern
* `bt eval a.eval.ts b.eval.ts` — one or more explicit files

Files inside `node_modules`, `.venv`, `venv`, `site-packages`, `dist-packages`, and `__pycache__` are excluded from automatic discovery. Explicit paths and globs bypass these exclusions.

## Runtime configuration

<Tabs>
  <Tab title="TypeScript" icon="https://img.logo.dev/typescriptlang.org?token=pk_BdcHD9e5SCW3j1rnJkNyMQ">
    Requires Node.js 18.19.0+ or 20.6.0+. Bun 1.0+ and Deno with Node compatibility mode are also supported.

    By default, `bt eval` auto-detects a runner from your project (`tsx`, `vite-node`, `ts-node`, then `ts-node-esm`). Set one explicitly with `--runner` / `BT_EVAL_RUNNER`:

    ```bash theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
    bt eval --runner vite-node tutorial.eval.ts
    bt eval --runner tsx tutorial.eval.ts
    ```

    `bt eval` automatically resolves locally installed binaries from `node_modules/.bin`, so you can write `--runner tsx` instead of `--runner ./node_modules/.bin/tsx` (for example). If you see ESM or top-level await errors, try `--runner vite-node`.
  </Tab>

  <Tab title="Python" icon="https://img.logo.dev/python.org?token=pk_BdcHD9e5SCW3j1rnJkNyMQ">
    Use `--language py` to force language detection. By default, if `VIRTUAL_ENV` is set, `bt` uses that virtualenv's Python; otherwise it searches `PATH` for `python3` or `python`. To use a specific interpreter, set `BT_EVAL_PYTHON_RUNNER` to its name or path (e.g. `python3.11`). The `--num-workers` flag controls concurrency for Python execution.

    ```bash theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
    bt eval my_eval.py
    bt eval --language py --num-workers 4 my_eval.py
    ```
  </Tab>
</Tabs>

## Sampling modes

Run a subset of your evaluation data as a non-final smoke run to catch obvious regressions before committing to the full dataset.

```bash theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
bt eval --first 20 qa.eval.ts          # First 20 examples, non-final
bt eval --sample 20 qa.eval.ts         # Random 20 examples, non-final
bt eval --sample 20 --sample-seed 7 qa.eval.ts  # Reproducible random sample
bt eval qa.eval.ts                     # Full dataset, final
```

<Note>
  When `--first` or `--sample` is used, the experiment summary is labeled as non-final in Braintrust. Omitting both flags runs the full dataset and marks the summary as final.
</Note>

## Flags

| Flag                             | Env var               | Description                                                                                                                                                                                                          |
| -------------------------------- | --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--runner <RUNNER>`              | `BT_EVAL_RUNNER`      | Runner binary (`tsx`, `bun`, `ts-node`, `python`, etc.)                                                                                                                                                              |
| `--language <LANG>`              | `BT_EVAL_LANGUAGE`    | Force language: `js` or `py`                                                                                                                                                                                         |
| `--filter <PATTERN>`             | `BT_EVAL_FILTER`      | Run only evaluators matching the pattern                                                                                                                                                                             |
| `--first <N>`                    | `BT_EVAL_FIRST`       | Run only the first N examples (non-final smoke run)                                                                                                                                                                  |
| `--sample <N>`                   | `BT_EVAL_SAMPLE`      | Run a deterministic random sample of N examples (non-final smoke run)                                                                                                                                                |
| `--sample-seed <S>`              | `BT_EVAL_SAMPLE_SEED` | Integer seed for `--sample` (default: `0`)                                                                                                                                                                           |
| `--param <KEY=VALUE>`            | `BT_EVAL_PARAMS_JSON` | Pass a named parameter into evaluators that declare a parameters schema (repeatable; also accepts a JSON object string)                                                                                              |
| `--matrix-param <KEY=V1,V2,...>` |                       | Run one experiment per Cartesian-product combination of parameter values (repeatable). Requires exactly one evaluator (use `--filter` to select it). Incompatible with `--watch`, `--dev`, and `--list`              |
| `--watch` / `-w`                 | `BT_EVAL_WATCH`       | Re-run when input files change                                                                                                                                                                                       |
| `--no-send-logs`                 | `BT_EVAL_LOCAL`       | Run without sending results to Braintrust                                                                                                                                                                            |
| `--num-workers <N>`              |                       | Worker threads for Python execution                                                                                                                                                                                  |
| `--verbose`                      |                       | Show full errors and stderr from eval files                                                                                                                                                                          |
| `--list`                         |                       | List evaluators without running them                                                                                                                                                                                 |
| `--jsonl`                        |                       | Output one JSON summary per evaluator (for scripts). See also the global `--json` flag ([overview](/reference/cli/overview#global-flags)), which formats all CLI output as JSON rather than per-evaluator summaries. |
| `--terminate-on-failure`         |                       | Stop after the first failing evaluator                                                                                                                                                                               |
| `--dev`                          |                       | Start a local web server for browser-based eval development (default port: 8300)                                                                                                                                     |

## Summary output

When using `--jsonl` or reading SSE output, each evaluator summary object includes these fields:

| Field         | Type                                | Description                                                                    |
| ------------- | ----------------------------------- | ------------------------------------------------------------------------------ |
| `runMode`     | `"full"` \| `"first"` \| `"sample"` | How the eval was run                                                           |
| `isFinal`     | `boolean`                           | Whether this is a final (full-dataset) run                                     |
| `runLabel`    | `string`                            | Human-readable description of the run mode                                     |
| `sampleCount` | `number`                            | Number of examples sampled (only present when `--first` or `--sample` is used) |
| `sampleSeed`  | `number`                            | Seed used for random sampling (only present when `--sample` is used)           |

## Parameters

`--param` overrides values for evaluators that declare a `parameters` schema via `loadParameters()` (TypeScript) or `load_parameters()` (Python). This is the same parameters system used by [remote evals](/evaluate/remote-evals), where parameters are version-tracked in Braintrust and appear as UI controls in the playground. See [Create evaluation parameters](/evaluate/write-parameters) for how to define and load parameters.

Each evaluator only receives the keys it declares. Extra keys are silently filtered, so a single command can target multiple evaluators with different schemas without errors.

```bash theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
bt eval --param model=gpt-4o --param count=5 my.eval.ts
bt eval --param '{"model":"gpt-4o","count":5}' my.eval.ts
```

Parameters are validated against the evaluator's declared schema before execution. Evaluators without a `parameters` schema are unaffected.

## Parameter matrix

`--matrix-param` works with the same parameters system as `--param`. Specify multiple values for one or more parameters and `bt eval` runs one experiment per combination, naming each `<experiment-name> [key=value, ...]`.

```bash theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
# Sweep a single parameter across three values
bt eval --matrix-param model=gpt-4o,gpt-4o-mini,o1-mini my.eval.ts

# Sweep two parameters — runs 2 × 3 = 6 experiments
bt eval --matrix-param model=gpt-4o,gpt-4o-mini --matrix-param temperature=0.0,0.5,1.0 my.eval.ts
```

For values that contain commas, use a JSON array:

```bash theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
bt eval --matrix-param model='["gpt-4o","claude-3-5-haiku-20241022"]' my.eval.ts
```

`--matrix-param` requires exactly one evaluator to be selected. If your file exports multiple evaluators, use `--filter` to narrow down to one. It is not supported for eval files that export `btEvalMain`.

## Passing arguments to the eval file

Use `--` to forward extra arguments to the eval file via `process.argv`:

```bash theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
bt eval foo.eval.ts -- --description "Prod" --shard 1/4
```

## Running in CI

Set `BRAINTRUST_API_KEY` instead of using OAuth login:

```yaml theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
# GitHub Actions example
- name: Run evals
  env:
    BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
  run: bt eval tests/
```

Use `--no-input` and `--json` for non-interactive output:

```bash theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
BRAINTRUST_API_KEY=... bt eval tests/ --no-input --json
```
