Evaluations — Apple framework reference

Evaluations provides a framework for systematically assessing intelligence-powered features built on Foundation Models. You define an Evaluation over an EvaluationSubject such as a ModelSubject, load test datasets through a Loader like JSONLoader, ArrayLoader, or StreamLoader, and use a SampleGenerator to produce model responses from a ModelSample. You score those responses with a Metric and an Evaluator, aggregate outcomes with a MetricsAggregator and AggregateMetric, and apply model-as-judge scoring through ModelJudgeEvaluator and ModelJudgePrompt using a ScoringScale and ScoreDimension. The framework also includes tool-call assessment with ToolCallEvaluator, ToolExpectation, and TrajectoryExpectation, returning structured EvaluationResult values.

Essentials 5

Define an evaluation and the subject it runs against.

Pr
Evaluation
A type that defines an evaluation.
Pr
EvaluationSubject
A type that represents the output produced by the system under test.
St
ModelSubject
The subject type for language model evaluations.
St
EvaluationContext
A context that provides the evaluation result within a test scope.
St
EvaluationTrait
A test trait that runs an evaluation and records the result as attachments.

Loading Datasets 9

Read test samples from files, arrays, or streams into an evaluation.

Pr
Loader
A protocol for types that supply a dataset for evaluation.
St
JSONLoader
A loader backed by a JSON or JSONL file.
St
ArrayLoader
A loader backed by an in-memory array.
St
StreamLoader
A loader backed by a custom async sequence.
Pr
SampleProtocol
A type that defines evaluation samples.
Pr
ModelSampleProtocol
A type that defines language model evaluation samples with prompt, instructions, and expectations.
St
ModelSample
A general-purpose language model evaluation sample.
St
ModelSampleInput
The data sent to a language model for evaluation.
St
ModelSampleOutput
The expected output value and evaluation expectations for a sample.

Generating Responses 3

Produce model responses from samples for evaluation.

Ac
SampleGenerator
An actor that generates evaluation samples using a language model.
St
StructuredTranscript
A structured record of a model interaction used during evaluation.
En
StructuredValue
A type-safe representation of JSON values.

Scoring and Evaluators 4

Score model responses with metrics and pluggable evaluators.

Pr
EvaluatorProtocol
A type that evaluates subjects and produces metrics.
St
Evaluator
A closure-based evaluator.
St
Metric
A named metric that carries a result value.
Pr
ScoreLevel
A type that defines individual levels within a scoring scale.

Model-as-Judge Scoring 6

Use a model to judge responses along defined dimensions and scales.

St
ModelJudgeEvaluator
An evaluator that uses a language model as a judge to score responses.
St
ModelJudgePrompt
A configuration for how a model-as-judge evaluator constructs its prompt.
St
ScoringScale
A scoring scale that defines the set of options a judge can assign.
St
ScaleOption
A single option in a scoring scale.
St
ScoreDimension
A named scoring dimension for a model judge evaluator.
En
ScoringMode
The scoring constraint mode for a model-as-judge evaluator.

Tool-Call Assessment 5

Evaluate whether a model invokes tools and follows expected trajectories.

St
ToolCallEvaluator
An evaluator that verifies agentic tool calls against an expected trajectory.
St
ToolExpectation
A specification for an expected tool call, or a group of expectations
St
TrajectoryExpectation
The expected pattern of tool calls for an evaluation.
En
ArgumentMatcher
The values that define how to validate a tool-call argument.
En
ArgumentValue
A primitive value type for argument specifications that is @Generable.

Aggregating Metrics 3

Combine per-sample outcomes into aggregate measures.

St
MetricsAggregator
A utility for computing aggregate statistics from evaluation metrics.
St
AggregateMetric
An aggregate statistic computed from a metric's results across the evaluation dataset.
En
AggregationOperation
The type of aggregation operation used to compute a summary statistic.

Results 2

Structured outcomes returned from running an evaluation.

St
EvaluationResult
The results of running a model evaluation.
St
ResultColumn
A typed descriptor for a column in an evaluation result DataFrame.

Errors 3

Errors raised while running, scoring, or reading results from evaluations.

En
EvaluationError
Errors thrown during an evaluation run.
En
EvaluationResultsError
Errors the framework throws when parsing evaluation results.
En
ModelJudgeError
Errors that can occur during model-as-judge scoring.

Structures 1

St
EvaluatorsBuilder
A result builder that enables declarative evaluator lists.

Extends 5

← Developer Tools & Diagnostics