TechnologiesDeveloper Tools & Diagnostics

Evaluations

iOSmacOStvOSwatchOSvisionOS

Evaluations provides a framework for systematically assessing intelligence-powered features built on Foundation Models. You define an Evaluation over an EvaluationSubject such as a ModelSubject, load test datasets through a Loader like JSONLoader, ArrayLoader, or StreamLoader, and use a SampleGenerator to produce model responses from a ModelSample. You score those responses with a Metric and an Evaluator, aggregate outcomes with a MetricsAggregator and AggregateMetric, and apply model-as-judge scoring through ModelJudgeEvaluator and ModelJudgePrompt using a ScoringScale and ScoreDimension. The framework also includes tool-call assessment with ToolCallEvaluator, ToolExpectation, and TrajectoryExpectation, returning structured EvaluationResult values.

Essentials 5

Define an evaluation and the subject it runs against.

  • Pr
    Evaluation
    A type that defines an evaluation.
  • Pr
    EvaluationSubject
    A type that represents the output produced by the system under test.
  • St
    ModelSubject
    The subject type for language model evaluations.
  • St
    EvaluationContext
    A context that provides the evaluation result within a test scope.
  • St
    EvaluationTrait
    A test trait that runs an evaluation and records the result as attachments.

Loading Datasets 9

Read test samples from files, arrays, or streams into an evaluation.

  • Pr
    Loader
    A protocol for types that supply a dataset for evaluation.
  • St
    JSONLoader
    A loader backed by a JSON or JSONL file.
  • St
    ArrayLoader
    A loader backed by an in-memory array.
  • St
    StreamLoader
    A loader backed by a custom async sequence.
  • Pr
    SampleProtocol
    A type that defines evaluation samples.
  • Pr
    ModelSampleProtocol
    A type that defines language model evaluation samples with prompt, instructions, and expectations.
  • St
    ModelSample
    A general-purpose language model evaluation sample.
  • St
    ModelSampleInput
    The data sent to a language model for evaluation.
  • St
    ModelSampleOutput
    The expected output value and evaluation expectations for a sample.

Generating Responses 3

Produce model responses from samples for evaluation.

  • Ac
    SampleGenerator
    An actor that generates evaluation samples using a language model.
  • St
    StructuredTranscript
    A structured record of a model interaction used during evaluation.
  • En
    StructuredValue
    A type-safe representation of JSON values.

Scoring and Evaluators 4

Score model responses with metrics and pluggable evaluators.

  • Pr
    EvaluatorProtocol
    A type that evaluates subjects and produces metrics.
  • St
    Evaluator
    A closure-based evaluator.
  • St
    Metric
    A named metric that carries a result value.
  • Pr
    ScoreLevel
    A type that defines individual levels within a scoring scale.

Model-as-Judge Scoring 6

Use a model to judge responses along defined dimensions and scales.

  • St
    ModelJudgeEvaluator
    An evaluator that uses a language model as a judge to score responses.
  • St
    ModelJudgePrompt
    A configuration for how a model-as-judge evaluator constructs its prompt.
  • St
    ScoringScale
    A scoring scale that defines the set of options a judge can assign.
  • St
    ScaleOption
    A single option in a scoring scale.
  • St
    ScoreDimension
    A named scoring dimension for a model judge evaluator.
  • En
    ScoringMode
    The scoring constraint mode for a model-as-judge evaluator.

Tool-Call Assessment 5

Evaluate whether a model invokes tools and follows expected trajectories.

  • St
    ToolCallEvaluator
    An evaluator that verifies agentic tool calls against an expected trajectory.
  • St
    ToolExpectation
    A specification for an expected tool call, or a group of expectations
  • St
    TrajectoryExpectation
    The expected pattern of tool calls for an evaluation.
  • En
    ArgumentMatcher
    The values that define how to validate a tool-call argument.
  • En
    ArgumentValue
    A primitive value type for argument specifications that is @Generable.

Aggregating Metrics 3

Combine per-sample outcomes into aggregate measures.

  • St
    MetricsAggregator
    A utility for computing aggregate statistics from evaluation metrics.
  • St
    AggregateMetric
    An aggregate statistic computed from a metric's results across the evaluation dataset.
  • En
    AggregationOperation
    The type of aggregation operation used to compute a summary statistic.

Results 2

Structured outcomes returned from running an evaluation.

  • St
    EvaluationResult
    The results of running a model evaluation.
  • St
    ResultColumn
    A typed descriptor for a column in an evaluation result DataFrame.

Errors 3

Errors raised while running, scoring, or reading results from evaluations.

  • En
    EvaluationError
    Errors thrown during an evaluation run.
  • En
    EvaluationResultsError
    Errors the framework throws when parsing evaluation results.
  • En
    ModelJudgeError
    Errors that can occur during model-as-judge scoring.

Structures 1

  • St
    EvaluatorsBuilder
    A result builder that enables declarative evaluator lists.

Extends 5

TranscriptTestTraitDataFrameCollectionArray
← Developer Tools & Diagnostics