Skip to main content
Human review is a critical part of evaluating AI applications. While Braintrust helps you automatically evaluate AI software with scorers, human feedback provides essential ground truth and quality assessment. Braintrust integrates human feedback from end users, subject matter experts, and product teams in one place. Use human review to:
  • Evaluate and compare experiments.
  • Assess the efficacy of automated scoring methods.
  • Curate production logs into evaluation datasets.
  • Label categorical data and provide corrections.
  • Track quality trends over time.
A typical workflow has three stages, each covered on its own page:
  1. Configure review scores (this page) so reviewers have something to capture.
  2. Score traces and datasets to record judgments row by row.
  3. Manage review work to assign, filter, and track review across your team.
When the same span is scored by more than one person, see Review with multiple reviewers for how Braintrust combines their scores.

Configure review scores

Review scores appear in all logs and experiments in a project. Use them for quality control, data labeling, or feedback collection.
only available on Pro and Enterprise plans.
  1. Go to Settings > Human review.
  2. Click + Human review score.
  3. Enter a name and description for your score. Descriptions support Markdown.
  4. Select a score type:
    • Categorical score: Predefined options with assigned scores. Each option gets a unique percentage value between 0% and 100% (stored as 0 to 1). Use for classification tasks like sentiment or correctness categories. Also supports writing to the expected field instead of creating a score.
    • Continuous score: Numeric values between 0% and 100% with a slider input control. Use for subjective quality assessments like helpfulness or tone.
    • Free-form input: String values written to the metadata field at a specified path. Use for explanations, corrections, or structured feedback.
  5. (Optional) Expand Score visibility to configure who sees this score during review:
    • Select members or permission groups to limit visibility to specific reviewers. If you don’t select anyone, the score is visible to everyone.
    • Click + Condition to show the score only when a filter condition is true, such as when another score exceeds a threshold. See Show scores conditionally for details.
  6. Click Save.
Score visibility controls which reviewers see a score in the review modal. It declutters the review experience for large teams. It is not an access control or security boundary: any reviewer with hidden scores can reveal them with the Show all scores toggle.
You can also create human review scores as you review traces. In the trace view, click + Human review score and define the score as described above.

Restrict score visibility

By default, every reviewer sees every configured score. Restrict a score to specific members or permission groups so only relevant reviewers see it in the review modal, which keeps the review experience focused for large teams. To set visibility on a new score, expand Score visibility while configuring it (see the steps above) and select the members or permission groups that should see it. To change visibility on an existing score:
  1. Go to Settings > Human review, or open the review panel while reviewing.
  2. Select the edit icon next to the score name.
  3. Expand Score visibility and select the members or permission groups that should see the score. To make it visible to everyone again, deselect all.
  4. Click Save.
If a row has configured scores but none are visible to the current reviewer, the review panel shows No scores are available to you for this row. Score visibility is a display filter, not an access control rule. Any reviewer who has hidden scores can reveal them with the Show all scores toggle in the review panel. To enforce who can read scores, use project permissions instead.

Show scores conditionally

You can configure filter conditions that control when a score appears in the review panel. A score with conditions only shows when all its conditions evaluate to true for the span being reviewed. This is useful for dependent workflows. For example, show a detailed quality rubric only when a triage score indicates the trace needs closer review, or surface a correction score only when the expected output matches a specific category. To add conditions to a new score, expand Score visibility while configuring it and click + Condition. To add or edit conditions on an existing score:
  1. Go to Settings > Human review, or open the review panel while reviewing.
  2. Select the edit icon next to the score name.
  3. Expand Score visibility and click + Condition.
  4. Add conditions using SQL syntax. Conditions are organized into three scopes:
    • Span: Evaluates against the current span. Can reference other scores (scores.ScoreName), expected values (expected.field), or metadata (metadata.path).
    • Trace: Evaluates against all spans in the trace and is true when at least one span matches. Can reference span_attributes, metrics, scores, error, and tags.
    • Subspan: Evaluates against all child spans of the current span and is true when at least one child span matches. Uses the same fields as Trace conditions.
    Within each scope, multiple conditions are joined with AND. Conditions across scopes are also joined with AND: all configured scopes must pass for the score to appear.
  5. Click Save.
Score names in the settings table display an indicator icon when conditions are configured. Hover over it to see the full “Show when” expression. User or group visibility and conditional visibility are evaluated together: both must pass for a score to appear. Conditional visibility is a display rule based on the score data, not an access control boundary.

Create and edit scores inline

While reviewing, create new score types or edit existing configurations without navigating to settings:
  • To create a new score, click + Human review score.
  • To edit an existing score, select the edit icon next to the score name.
Changes apply immediately across your project.
Editing a score configuration affects how that score works going forward. Existing score values on traces remain unchanged.

Annotate in playgrounds

For a lighter-weight alternative to the full review workflow, you can annotate outputs directly in playgrounds and then get prompt improvement suggestions based on your annotations. Playground annotations help with rapid iteration during prompt development, while the Review page is better for systematic evaluation of production logs and experiments.

Capture production feedback

In addition to internal reviews, capture feedback directly from production users. Production feedback helps you understand real-world performance and build datasets from actual user interactions. See Capture user feedback for implementation details and Build datasets from user feedback to learn how to turn feedback into evaluation datasets. You can also use dashboards to monitor user satisfaction trends and correlate automated scores with user feedback.

Next steps