- Evaluate and compare experiments.
- Assess the efficacy of automated scoring methods.
- Curate production logs into evaluation datasets.
- Label categorical data and provide corrections.
- Track quality trends over time.
- Configure review scores (this page) so reviewers have something to capture.
- Score traces and datasets to record judgments row by row.
- Manage review work to assign, filter, and track review across your team.
Configure review scores
Review scores appear in all logs and experiments in a project. Use them for quality control, data labeling, or feedback collection. only available on Pro and Enterprise plans.
- Go to Settings > Human review.
- Click + Human review score.
- Enter a name and description for your score. Descriptions support Markdown.
- Select a score type:
- Categorical score: Predefined options with assigned scores. Each option gets a unique percentage value between 0% and 100% (stored as 0 to 1). Use for classification tasks like sentiment or correctness categories. Also supports writing to the
expectedfield instead of creating a score. - Continuous score: Numeric values between 0% and 100% with a slider input control. Use for subjective quality assessments like helpfulness or tone.
- Free-form input: String values written to the
metadatafield at a specified path. Use for explanations, corrections, or structured feedback.
- Categorical score: Predefined options with assigned scores. Each option gets a unique percentage value between 0% and 100% (stored as 0 to 1). Use for classification tasks like sentiment or correctness categories. Also supports writing to the
- (Optional) Expand Score visibility to configure who sees this score during review:
- Select members or permission groups to limit visibility to specific reviewers. If you don’t select anyone, the score is visible to everyone.
- Click + Condition to show the score only when a filter condition is true, such as when another score exceeds a threshold. See Show scores conditionally for details.
- Click Save.
Score visibility controls which reviewers see a score in the review modal. It declutters the review experience for large teams. It is not an access control or security boundary: any reviewer with hidden scores can reveal them with the Show all scores toggle.
Restrict score visibility
By default, every reviewer sees every configured score. Restrict a score to specific members or permission groups so only relevant reviewers see it in the review modal, which keeps the review experience focused for large teams. To set visibility on a new score, expand Score visibility while configuring it (see the steps above) and select the members or permission groups that should see it. To change visibility on an existing score:- Go to Settings > Human review, or open the review panel while reviewing.
- Select the edit icon next to the score name.
- Expand Score visibility and select the members or permission groups that should see the score. To make it visible to everyone again, deselect all.
- Click Save.
Show scores conditionally
You can configure filter conditions that control when a score appears in the review panel. A score with conditions only shows when all its conditions evaluate to true for the span being reviewed. This is useful for dependent workflows. For example, show a detailed quality rubric only when a triage score indicates the trace needs closer review, or surface a correction score only when the expected output matches a specific category. To add conditions to a new score, expand Score visibility while configuring it and click + Condition. To add or edit conditions on an existing score:- Go to Settings > Human review, or open the review panel while reviewing.
- Select the edit icon next to the score name.
- Expand Score visibility and click + Condition.
-
Add conditions using SQL syntax. Conditions are organized into three scopes:
- Span: Evaluates against the current span. Can reference other scores (
scores.ScoreName), expected values (expected.field), or metadata (metadata.path). - Trace: Evaluates against all spans in the trace and is true when at least one span matches. Can reference
span_attributes,metrics,scores,error, andtags. - Subspan: Evaluates against all child spans of the current span and is true when at least one child span matches. Uses the same fields as Trace conditions.
- Span: Evaluates against the current span. Can reference other scores (
- Click Save.
Create and edit scores inline
While reviewing, create new score types or edit existing configurations without navigating to settings:- To create a new score, click + Human review score.
- To edit an existing score, select the edit icon next to the score name.
Editing a score configuration affects how that score works going forward. Existing score values on traces remain unchanged.
Annotate in playgrounds
For a lighter-weight alternative to the full review workflow, you can annotate outputs directly in playgrounds and then get prompt improvement suggestions based on your annotations. Playground annotations help with rapid iteration during prompt development, while the Review page is better for systematic evaluation of production logs and experiments.Capture production feedback
In addition to internal reviews, capture feedback directly from production users. Production feedback helps you understand real-world performance and build datasets from actual user interactions. See Capture user feedback for implementation details and Build datasets from user feedback to learn how to turn feedback into evaluation datasets. You can also use dashboards to monitor user satisfaction trends and correlate automated scores with user feedback.Next steps
- Score traces and datasets to start recording human judgments
- Manage review work across your team
- Add labels and corrections to categorize and tag traces
- Build datasets from reviewed logs
- Run evaluations with human-reviewed datasets