> ## Documentation Index
> Fetch the complete documentation index at: https://braintrust.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# LLM-as-a-judge scorers and classifiers

> Use language models to score or classify AI outputs based on natural language criteria like tone, helpfulness, or intent.

LLM-as-a-judge scorers and classifiers use a language model to evaluate outputs based on natural language criteria. A scorer returns a numeric score, while a classifier returns a categorical label. They are best for subjective judgments like tone, helpfulness, or creativity that are difficult to encode in deterministic code.

You can define LLM-as-a-judge scorers in three places:

* **Inline in SDK code**: Define scorers directly in your evaluation scripts for local development or application-specific logic.
* **Pushed via CLI**: Define scorers in TypeScript or Python files and push them to Braintrust for team-wide sharing and automatic evaluation of production logs.
* **Created in UI**: Build scorers in the Braintrust web interface for rapid prototyping and simple configurations.

Most teams prototype in the UI, then push production-ready scorers via the CLI. See [Scorers overview](/evaluate/write-scorers#where-to-define-scorers-and-classifiers) for guidance.

## Score spans

Span-level scorers evaluate individual operations or outputs. Use them for measuring single LLM responses, checking specific tool calls, or validating individual outputs. Each matching span receives an independent score.

Your prompt template can reference these variables:

* `{{input}}`: The input to your task
* `{{output}}`: The output from your task
* `{{expected}}`: The expected output (optional)
* `{{metadata}}`: Custom metadata from the test case

<Tabs className="tabs-border">
  <Tab title="SDK" icon="code">
    Use scorers inline in your evaluation code:

    <CodeGroup dropdown>
      ```typescript llm_scorer.eval.ts theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      import { Eval } from "braintrust";
      import { LLMClassifierFromTemplate } from "autoevals";
      import OpenAI from "openai";

      const client = new OpenAI();

      const MOVIE_DATASET = [
        {
          input:
            "A detective investigates a series of murders based on the seven deadly sins.",
          expected: "Se7en",
        },
        {
          input:
            "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O.",
          expected: "Inception",
        },
      ];

      async function task(input: string): Promise<string> {
        const response = await client.responses.create({
          model: "gpt-5-mini",
          input: [
            {
              role: "system",
              content:
                "Based on the following description, identify the movie. Reply with only the movie title.",
            },
            { role: "user", content: input },
          ],
        });
        return response.output_text ?? "";
      }

      const correctnessScorer = LLMClassifierFromTemplate({
        name: "Correctness",
        promptTemplate: `You are evaluating a movie-identification task.

      Output (model's answer): {{output}}
      Expected (correct movie): {{expected}}

      Does the output correctly identify the same movie as the expected answer?
      Consider alternate titles (e.g. "Harry Potter 1" vs "Harry Potter and the Sorcerer's Stone") as correct.

      Return only "correct" if the output is the right movie (exact or equivalent title).
      Return only "incorrect" otherwise.`,
        choiceScores: {
          correct: 1,
          incorrect: 0,
        },
        useCoT: true,
      });

      Eval("Movie Matcher", {
        data: MOVIE_DATASET,
        task,
        scores: [correctnessScorer],
      });
      ```

      ```python eval_llm_scorer.py theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      from braintrust import Eval
      from autoevals import LLMClassifier
      from openai import OpenAI

      client = OpenAI()

      MOVIE_DATASET = [
          {
              "input": "A detective investigates a series of murders based on the seven deadly sins.",
              "expected": "Se7en",
          },
          {
              "input": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O.",
              "expected": "Inception",
          },
      ]


      def task(input):
          response = client.responses.create(
              model="gpt-5-mini",
              input=[
                  {
                      "role": "system",
                      "content": "Based on the following description, identify the movie. Reply with only the movie title.",
                  },
                  {"role": "user", "content": input},
              ],
          )
          return response.output_text


      correctness_scorer = LLMClassifier(
          name="Correctness",
          prompt_template="""You are evaluating a movie-identification task.

      Output (model's answer): {{output}}
      Expected (correct movie): {{expected}}

      Does the output correctly identify the same movie as the expected answer?
      Consider alternate titles (e.g. "Harry Potter 1" vs "Harry Potter and the Sorcerer's Stone") as correct.

      Return only "correct" if the output is the right movie (exact or equivalent title).
      Return only "incorrect" otherwise.""",
          choice_scores={
              "correct": 1,
              "incorrect": 0,
          },
          model="gpt-5-mini",
      )

      Eval(
          "Movie Matcher",
          data=MOVIE_DATASET,
          task=task,
          scores=[correctness_scorer],
      )
      ```

      ```ruby eval_llm_scorer.rb theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      require "braintrust"
      require "openai"

      Braintrust.init

      client = OpenAI::Client.new(api_key: ENV.fetch("OPENAI_API_KEY", nil))
      judge_client = OpenAI::Client.new(api_key: ENV.fetch("OPENAI_API_KEY", nil))

      MOVIE_DATASET = [
        {
          input: "A detective investigates a series of murders based on the seven deadly sins.",
          expected: "Se7en",
        },
        {
          input: "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O.",
          expected: "Inception",
        },
      ]

      correctness_scorer = Braintrust::Scorer.new("correctness") do |output:, expected:|
        response = judge_client.chat.completions.create(
          model: "gpt-5-mini",
          messages: [{
            role: "user",
            content: <<~PROMPT
              You are evaluating a movie-identification task.

              Output (model's answer): #{output}
              Expected (correct movie): #{expected}

              Does the output correctly identify the same movie as the expected answer?
              Consider alternate titles (e.g. "Harry Potter 1" vs "Harry Potter and the Sorcerer's Stone") as correct.

              Return only "correct" if the output is the right movie (exact or equivalent title).
              Return only "incorrect" otherwise.
            PROMPT
          }]
        )

        verdict = response.choices.first.message.content.to_s.strip.downcase
        {name: "Correctness", score: verdict == "correct" ? 1.0 : 0.0}
      end

      Braintrust::Eval.run(
        project: "Movie Matcher",
        cases: MOVIE_DATASET,
        task: lambda do |input:|
          response = client.chat.completions.create(
            model: "gpt-5-mini",
            messages: [
              {role: "system", content: "Based on the following description, identify the movie. Reply with only the movie title."},
              {role: "user", content: input}
            ]
          )
          response.choices.first.message.content || ""
        end,
        scorers: [correctness_scorer]
      )

      OpenTelemetry.tracer_provider.shutdown
      ```

      ```csharp llm_scorer.cs theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      using System.Text.Json;
      using Braintrust.Sdk;
      using Braintrust.Sdk.Eval;
      using Braintrust.Sdk.OpenAI;
      using OpenAI;
      using OpenAI.Chat;

      sealed class CorrectnessScorer(string openAIApiKey) : IScorer<string, string>
      {
          public string Name => "Correctness";

          private readonly ChatClient _chatClient =
              new OpenAIClient(openAIApiKey).GetChatClient("gpt-5-mini");

          public async Task<IReadOnlyList<Score>> Score(TaskResult<string, string> taskResult)
          {
              var prompt = $$"""
                  You are evaluating a movie-identification task.

                  Output (model's answer): {{taskResult.Result}}
                  Expected (correct movie): {{taskResult.DatasetCase.Expected}}

                  Does the output correctly identify the same movie as the expected answer?
                  Consider alternate titles (e.g. "Harry Potter 1" vs "Harry Potter and the Sorcerer's Stone") as correct.

                  Reply with JSON only: {"score": 1, "reasoning": "..."} for correct, or {"score": 0, "reasoning": "..."} for incorrect.
                  """;

              var completion = await _chatClient.CompleteChatAsync([new UserChatMessage(prompt)]);
              using var json = JsonDocument.Parse(completion.Value.Content[0].Text);
              var score = json.RootElement.GetProperty("score").GetDouble();
              var reasoning = json.RootElement.GetProperty("reasoning").GetString() ?? "";

              return [new Score(Name, score, new Dictionary<string, object> { { "reasoning", reasoning } })];
          }
      }

      class Program
      {
          static readonly DatasetCase<string, string>[] MovieDataset =
          [
              DatasetCase.Of("A detective investigates a series of murders based on the seven deadly sins.", "Se7en"),
              DatasetCase.Of("A thief plants an idea into a CEO's mind through dream-sharing technology.", "Inception"),
          ];

          static async Task Main(string[] args)
          {
              var openAIApiKey = Environment.GetEnvironmentVariable("OPENAI_API_KEY")!;
              var braintrust = Braintrust.Sdk.Braintrust.Get();
              var activitySource = braintrust.GetActivitySource();
              var openAIClient = BraintrustOpenAI.WrapOpenAI(activitySource, openAIApiKey);

              async Task<string> MovieTask(string input)
              {
                  var response = await openAIClient.GetChatClient("gpt-5-mini").CompleteChatAsync(
                      new SystemChatMessage("Identify the movie from the description. Reply with the title only."),
                      new UserChatMessage(input));
                  return response.Value.Content[0].Text;
              }

              var eval = await braintrust
                  .EvalBuilder<string, string>()
                  .Name("Movie Matcher")
                  .Cases(MovieDataset)
                  .TaskFunction(MovieTask)
                  .Scorers(new CorrectnessScorer(openAIApiKey))
                  .BuildAsync();

              var result = await eval.RunAsync();
              Console.WriteLine(result.CreateReportString());
          }
      }
      ```
    </CodeGroup>
  </Tab>

  <Tab title="CLI" icon="terminal">
    Define TypeScript or Python scorers in code and push to Braintrust:

    <CodeGroup dropdown>
      ```typescript title="llm_scorer.ts" theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      import braintrust from "braintrust";

      const project = braintrust.projects.create({ name: "my-project" });

      project.scorers.create({
        name: "Helpfulness scorer",
        slug: "helpfulness-scorer",
        description: "Evaluate helpfulness of response",
        tags: ["quality"],
        messages: [
          {
            role: "user",
            content:
              'Rate the helpfulness of this response: {{output}}\n\nReturn "A" for very helpful, "B" for somewhat helpful, "C" for not helpful.',
          },
        ],
        model: "gpt-5-mini",
        useCot: true,
        choiceScores: {
          A: 1,
          B: 0.5,
          C: 0,
        },
        metadata: {
          __pass_threshold: 0.7,
        },
      });
      ```

      ```python title="llm_scorer.py" theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      import braintrust

      project = braintrust.projects.create(name="My project")

      project.scorers.create(
          name="Helpfulness scorer",
          slug="helpfulness-scorer",
          description="Evaluate helpfulness of response",
          tags=["quality"],
          messages=[
              {
                  "role": "user",
                  "content": 'Rate the helpfulness of this response: {{output}}\n\nReturn "A" for very helpful, "B" for somewhat helpful, "C" for not helpful.',
              }
          ],
          model="gpt-5-mini",
          use_cot=True,
          choice_scores={
              "A": 1,
              "B": 0.5,
              "C": 0,
          },
          metadata={"__pass_threshold": 0.7},
      )
      ```
    </CodeGroup>

    Push to Braintrust using the [`bt` CLI](/reference/cli/quickstart):

    <CodeGroup>
      ```bash TypeScript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      bt functions push llm_scorer.ts
      ```

      ```bash Python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      bt functions push llm_scorer.py
      ```
    </CodeGroup>
  </Tab>

  <Tab title="UI" icon="mouse-pointer-2">
    1. Go to [**<Icon icon="triangle" /> Scorers**](https://www.braintrust.dev/app/~/scorers) > **+ Scorer**.
    2. Enter a scorer name and slug.
    3. Select **LLM-as-a-judge**.
    4. Configure:
       * **Prompt**: Instructions for evaluating the output
       * **Model**: Which model to use as judge
       * **Choice scores**: Map model choices (A, B, C) to numeric scores
       * **Use CoT**: Enable chain-of-thought reasoning for complex evaluations
    5. Click **Save as custom scorer**.
  </Tab>
</Tabs>

## Score traces

Trace-level scorers evaluate entire execution traces including all spans and conversation history. Use these for assessing multi-turn conversation quality, overall workflow completion, or when your scorer needs access to the full execution context. The scorer runs once per trace.

In an experiment, a scorer evaluates the trace whenever its prompt uses thread variables such as `{{thread}}`, since the full trace is available to every scorer. For [online scoring](/evaluate/score-online#create-scoring-rules), you also set the rule's **Scope** field to **Trace**, which controls whether the scorer runs once per trace or per matching span.

Prompt templates for trace-level scorers support the following reserved variables:

| Variable                 | Type   | Description                                                                              |
| ------------------------ | ------ | ---------------------------------------------------------------------------------------- |
| `{{input}}`              | any    | Input from the root span                                                                 |
| `{{output}}`             | any    | Output from the root span                                                                |
| `{{expected}}`           | any    | Expected output from the root span (optional)                                            |
| `{{metadata}}`           | object | Metadata from the root span                                                              |
| `{{thread}}`             | text   | Full conversation rendered as human-readable text (excludes system messages for scorers) |
| `{{thread_with_system}}` | text   | Full conversation including system messages, rendered as human-readable text             |
| `{{thread_count}}`       | number | Total number of messages in the thread                                                   |
| `{{first_message}}`      | object | First message in the thread                                                              |
| `{{last_message}}`       | object | Last message in the thread                                                               |
| `{{user_messages}}`      | array  | All user/human messages only                                                             |
| `{{assistant_messages}}` | array  | All assistant messages only                                                              |
| `{{human_ai_pairs}}`     | array  | Turn pairs — each item has `{human, assistant}`                                          |

Use `{{thread}}` to pass the full conversation to a judge model as formatted text. For scorers, `{{thread}}` omits system messages so the rubric isn't polluted by your application's system prompt. Use `{{thread_with_system}}` when the judge needs the system prompt as context. For non-scorer prompts, the two variables are equivalent because both include system messages. `{{input}}`, `{{output}}`, `{{expected}}`, and `{{metadata}}` come from the root span of the trace.

<Note>
  Trace-level scoring requires TypeScript SDK v2.2.1+, Python SDK v0.5.6+, or Ruby SDK v0.2.1+.
</Note>

<Tabs className="tabs-border">
  <Tab title="SDK" icon="code">
    Use scorers inline in your evaluation code:

    <CodeGroup dropdown>
      ```typescript trace_llm_scorer.eval.ts theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      import { Eval, wrapOpenAI, wrapTraced, type Scorer } from "braintrust";
      import OpenAI from "openai";

      const client = new OpenAI();
      const wrappedClient = wrapOpenAI(new OpenAI());

      const SUPPORT_DATASET = [
        { input: "My order hasn't arrived yet. Order #12345." },
        { input: "I need help resetting my password." },
      ];

      const callLLM = wrapTraced(async function callLLM(messages: Array<{ role: string; content: string }>) {
        const response = await wrappedClient.chat.completions.create({
          model: "gpt-5-mini",
          messages,
        });
        return response.choices[0].message.content || "";
      });

      async function supportTask(input: string): Promise<string> {
        const messages: Array<{ role: string; content: string }> = [
          { role: "system", content: "You are a helpful customer support agent." }
        ];

        messages.push({ role: "user", content: input });
        const response1 = await callLLM(messages);
        messages.push({ role: "assistant", content: response1 });

        messages.push({ role: "user", content: "Can you provide more details?" });
        const response2 = await callLLM(messages);
        messages.push({ role: "assistant", content: response2 });

        messages.push({ role: "user", content: "Thank you for your help!" });
        const response3 = await callLLM(messages);

        return response3;
      }

      const conversationCoherence: Scorer = async ({ trace }) => {
        if (!trace) return null;

        const thread = await trace.getThread();
        const threadText = thread
          .map(msg => `${msg.role}: ${msg.content}`)
          .join("\n\n");

        const response = await client.responses.create({
          model: "gpt-5-mini",
          input: [
            {
              role: "user",
              content: `Evaluate the coherence of this customer support conversation:

      ${threadText}

      Rate the conversation coherence:
      - "A" for highly coherent with natural flow and consistent context
      - "B" for mostly coherent with minor gaps or context issues
      - "C" for incoherent, disjointed, or lost context

      Return only the letter (A, B, or C).`,
            },
          ],
        });

        const rating = response.output_text?.trim().toUpperCase() || "C";
        const choiceScores = { A: 1, B: 0.6, C: 0 };
        const score = choiceScores[rating as keyof typeof choiceScores] ?? 0;

        return {
          name: "Conversation coherence",
          score,
          metadata: { rating, thread_length: thread.length },
        };
      };

      Eval("Support Conversation Quality", {
        data: SUPPORT_DATASET,
        task: supportTask,
        scores: [conversationCoherence],
      });
      ```

      ```python eval_trace_llm_scorer.py theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      from braintrust import Eval, wrap_openai, traced
      from openai import AsyncOpenAI, OpenAI

      client = OpenAI()
      wrapped_client = wrap_openai(AsyncOpenAI())

      SUPPORT_DATASET = [
          {"input": "My order hasn't arrived yet. Order #12345."},
          {"input": "I need help resetting my password."},
      ]


      @traced
      async def call_llm(messages):
          response = await wrapped_client.chat.completions.create(
              model="gpt-5-mini",
              messages=messages,
          )
          return response.choices[0].message.content or ""


      async def support_task(input):
          messages = [
              {"role": "system", "content": "You are a helpful customer support agent."}
          ]

          messages.append({"role": "user", "content": input})
          response1 = await call_llm(messages)
          messages.append({"role": "assistant", "content": response1})

          messages.append({"role": "user", "content": "Can you provide more details?"})
          response2 = await call_llm(messages)
          messages.append({"role": "assistant", "content": response2})

          messages.append({"role": "user", "content": "Thank you for your help!"})
          response3 = await call_llm(messages)

          return response3


      async def conversation_coherence(input, output, expected, trace=None):
          if not trace:
              return None

          thread = await trace.get_thread()
          thread_text = "\n\n".join([f"{msg['role']}: {msg['content']}" for msg in thread])

          response = client.responses.create(
              model="gpt-5-mini",
              input=[
                  {
                      "role": "user",
                      "content": f"""Evaluate the coherence of this customer support conversation:

      {thread_text}

      Rate the conversation coherence:
      - "A" for highly coherent with natural flow and consistent context
      - "B" for mostly coherent with minor gaps or context issues
      - "C" for incoherent, disjointed, or lost context

      Return only the letter (A, B, or C).""",
                  }
              ],
          )

          rating = (response.output_text or "C").strip().upper()
          choice_scores = {"A": 1, "B": 0.6, "C": 0}
          score = choice_scores.get(rating, 0)

          return {
              "name": "Conversation coherence",
              "score": score,
              "metadata": {"rating": rating, "thread_length": len(thread)},
          }


      Eval(
          "Support Conversation Quality",
          data=SUPPORT_DATASET,
          task=support_task,
          scores=[conversation_coherence],
      )
      ```

      ```ruby eval_trace_llm_scorer.rb theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      require "braintrust"
      require "openai"

      Braintrust.init

      client = OpenAI::Client.new(api_key: ENV.fetch("OPENAI_API_KEY", nil))
      judge_client = OpenAI::Client.new(api_key: ENV.fetch("OPENAI_API_KEY", nil))

      SUPPORT_DATASET = [
        {input: "My order hasn't arrived yet. Order #12345."},
        {input: "I need help resetting my password."},
      ]

      def chat(client, messages)
        client.chat.completions.create(model: "gpt-5-mini", messages: messages)
          .choices.first.message.content || ""
      end

      support_task = Braintrust::Task.new("support") do |input:|
        messages = [{role: "system", content: "You are a helpful customer support agent."}]

        messages << {role: "user", content: input}
        messages << {role: "assistant", content: chat(client, messages)}

        messages << {role: "user", content: "Can you provide more details?"}
        messages << {role: "assistant", content: chat(client, messages)}

        messages << {role: "user", content: "Thank you for your help!"}
        chat(client, messages)
      end

      conversation_coherence = Braintrust::Scorer.new("conversation_coherence") do |trace:|
        next nil unless trace

        thread = trace.thread
        thread_text = thread.map { |msg| "#{msg["role"]}: #{msg["content"]}" }.join("\n\n")

        response = judge_client.chat.completions.create(
          model: "gpt-5-mini",
          messages: [{
            role: "user",
            content: <<~PROMPT
              Evaluate the coherence of this customer support conversation:

              #{thread_text}

              Rate the conversation coherence:
              - "A" for highly coherent with natural flow and consistent context
              - "B" for mostly coherent with minor gaps or context issues
              - "C" for incoherent, disjointed, or lost context

              Return only the letter (A, B, or C).
            PROMPT
          }]
        )

        rating = (response.choices.first.message.content || "C").strip.upcase
        scores = {"A" => 1.0, "B" => 0.6, "C" => 0.0}

        {
          name: "Conversation coherence",
          score: scores.fetch(rating, 0.0),
          metadata: {rating: rating, thread_length: thread.length}
        }
      end

      Braintrust::Eval.run(
        project: "Support Conversation Quality",
        cases: SUPPORT_DATASET,
        task: support_task,
        scorers: [conversation_coherence]
      )

      OpenTelemetry.tracer_provider.shutdown
      ```
    </CodeGroup>
  </Tab>

  <Tab title="CLI" icon="terminal">
    Define TypeScript or Python scorers in code and push to Braintrust:

    <CodeGroup dropdown>
      ```typescript title="trace_llm_scorer.ts" theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      import braintrust from "braintrust";
      import { z } from "zod";

      const project = braintrust.projects.create({ name: "my-project" });

      project.scorers.create({
        name: "Conversation coherence",
        slug: "conversation-coherence",
        description: "Evaluate multi-turn conversation coherence",
        parameters: z.object({
          trace: z.any(),
        }),
        messages: [
          {
            role: "user",
            content: `Evaluate the coherence of this conversation:

      {{thread}}

      Rate the coherence:
      - "A" for highly coherent with natural flow
      - "B" for mostly coherent with minor gaps
      - "C" for incoherent or disjointed`,
          },
        ],
        model: "gpt-5-mini",
        useCot: true,
        choiceScores: {
          A: 1,
          B: 0.6,
          C: 0,
        },
      });
      ```

      ```python title="trace_llm_scorer.py" theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      import braintrust
      from pydantic import BaseModel

      project = braintrust.projects.create(name="my-project")

      class TraceParams(BaseModel):
          trace: dict

      project.scorers.create(
          name="Conversation coherence",
          slug="conversation-coherence",
          description="Evaluate multi-turn conversation coherence",
          parameters=TraceParams,
          messages=[
              {
                  "role": "user",
                  "content": """Evaluate the coherence of this conversation:

      {{thread}}

      Rate the coherence:
      - "A" for highly coherent with natural flow
      - "B" for mostly coherent with minor gaps
      - "C" for incoherent or disjointed""",
              }
          ],
          model="gpt-5-mini",
          use_cot=True,
          choice_scores={
              "A": 1,
              "B": 0.6,
              "C": 0,
          },
      )
      ```
    </CodeGroup>

    Push to Braintrust:

    <CodeGroup>
      ```bash TypeScript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      bt functions push trace_llm_scorer.ts
      ```

      ```bash Python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      bt functions push trace_llm_scorer.py
      ```
    </CodeGroup>
  </Tab>

  <Tab title="UI" icon="mouse-pointer-2">
    1. Go to [**<Icon icon="triangle" /> Scorers**](https://www.braintrust.dev/app/~/scorers) > **+ Scorer**.
    2. Enter a scorer name and slug.
    3. Select **LLM-as-a-judge**.
    4. Configure:
       * **Prompt**: Use the `{{thread}}` variable to reference the conversation thread.
       * **Model**: Which model to use as judge
       * **Choice scores**: Map model choices (A, B, C) to numeric scores
       * **Use CoT**: Enable chain-of-thought reasoning for complex evaluations
    5. Click **Save as custom scorer**.
  </Tab>
</Tabs>

## Set pass thresholds

Define minimum acceptable scores to automatically mark results as passing or failing. When configured, scores that meet or exceed the threshold are marked as **passing** (green highlighting with checkmark), while scores below are marked as **failing** (red highlighting).

<Note>
  Pass thresholds apply only to scorers that output numeric scores. Classifiers, which output labels, don't use them.
</Note>

<Tabs className="tabs-border">
  <Tab title="CLI" icon="terminal">
    Add `__pass_threshold` to the scorer's metadata (value between 0 and 1):

    <CodeGroup dropdown>
      ```typescript theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      project.scorers.create({
        name: "Helpfulness scorer",
        slug: "helpfulness-scorer",
        messages: [
          {
            role: "user",
            content: 'Rate the helpfulness of this response: {{output}}\n\nReturn "A" for very helpful, "B" for somewhat helpful, "C" for not helpful.',
          },
        ],
        model: "gpt-5-mini",
        choiceScores: { A: 1, B: 0.5, C: 0 },
        metadata: {
          __pass_threshold: 0.7,
        },
      });
      ```

      ```python theme={"theme":{"light":"github-light","dark":"github-dark-dimmed"}}
      project.scorers.create(
          name="Helpfulness scorer",
          slug="helpfulness-scorer",
          messages=[
              {
                  "role": "user",
                  "content": 'Rate the helpfulness of this response: {{output}}\n\nReturn "A" for very helpful, "B" for somewhat helpful, "C" for not helpful.',
              }
          ],
          model="gpt-5-mini",
          choice_scores={"A": 1, "B": 0.5, "C": 0},
          metadata={"__pass_threshold": 0.7},
      )
      ```
    </CodeGroup>
  </Tab>

  <Tab title="UI" icon="mouse-pointer-2">
    When creating or editing a scorer in the UI:

    1. Look for the **Pass threshold** slider in the scorer configuration.
    2. Drag the slider to set your minimum acceptable score (0–1).
    3. Click **Save as custom scorer**.
  </Tab>
</Tabs>

## Apply classification labels

An LLM judge can also power a [classifier](/evaluate/write-scorers#classifiers). The difference is the output: a numeric judge maps the model's choices to scores, while a classifier keeps each choice as a label. The model selects one choice from the fixed set you define.

That choice becomes both the `id` and `label` of the resulting classification, the scorer's name becomes the `name`, and the model's reasoning is stored in `metadata`. Because the model picks a single choice, an LLM-as-a-judge classifier always returns one label.

<Note>
  LLM-as-a-judge classifiers can only be created in the UI. Unlike LLM-as-a-judge scorers, they can't be defined in code. To classify with a model in code instead, write a [custom code classifier](/evaluate/custom-code#apply-classification-labels) that calls an LLM as needed.
</Note>

To create an LLM-as-a-judge classifier:

1. Go to [**<Icon icon="triangle" /> Scorers**](https://www.braintrust.dev/app/~/scorers) and create a scorer. Under **Type**, choose **LLM judge**.
2. Select a **Model** to run the judge. An LLM-as-a-judge classifier requires a model that supports both streaming and tool use.
3. Write the **Messages**: a user message that passes in the content to evaluate (for example, `{{input}}`), and a system message with the rubric that describes each label and when to choose it.
4. Set **Output type** to **Classification**.
5. Under **Classifications**, add each label the model can choose. Labels must be unique, and the model is forced to pick exactly one through a tool schema. Optionally enable **Allow "No match"** to let the model return no label when none fits.
6. Optionally enable **Use chain of thought (CoT)** so the model reasons before choosing. Its reasoning is saved to the classification's `metadata`.
7. Click **Save as custom scorer**.

Select the saved classifier when you [run an experiment](/evaluate/run-evaluations) or [score production traces](/evaluate/score-online), just like a scorer.

Like any LLM judge, a classifier can run at span or trace scope. See [Scorer and classifier scopes](/evaluate/write-scorers#scorer-and-classifier-scopes).

<Note>
  On self-hosted deployments, classifiers require data plane v2.0 or later.
</Note>

## Next steps

* [Autoevals](/evaluate/autoevals) for pre-built scorers you can drop in without writing a prompt
* [Custom code](/evaluate/custom-code) for deterministic logic or when you need full control
* [Run evaluations](/evaluate/run-evaluations) using your scorers
* [Score production logs](/evaluate/score-online) with online scoring rules
