Evaluations provides a framework for systematically assessing intelligence-powered features built on Foundation Models. You define an Evaluation over an EvaluationSubject such as a ModelSubject, load test datasets through a Loader like JSONLoader, ArrayLoader, or StreamLoader, and use a SampleGenerator to produce model responses from a ModelSample. You score those responses with a Metric and an Evaluator, aggregate outcomes with a MetricsAggregator and AggregateMetric, and apply model-as-judge scoring through ModelJudgeEvaluator and ModelJudgePrompt using a ScoringScale and ScoreDimension. The framework also includes tool-call assessment with ToolCallEvaluator, ToolExpectation, and TrajectoryExpectation, returning structured EvaluationResult values.
Essentials 5
Define an evaluation and the subject it runs against.
- PrEvaluationA type that defines an evaluation.
- PrEvaluationSubjectA type that represents the output produced by the system under test.
- StModelSubjectThe subject type for language model evaluations.
- StEvaluationContextA context that provides the evaluation result within a test scope.
- StEvaluationTraitA test trait that runs an evaluation and records the result as attachments.
Loading Datasets 9
Read test samples from files, arrays, or streams into an evaluation.
- PrLoaderA protocol for types that supply a dataset for evaluation.
- StJSONLoaderA loader backed by a JSON or JSONL file.
- StArrayLoaderA loader backed by an in-memory array.
- StStreamLoaderA loader backed by a custom async sequence.
- PrSampleProtocolA type that defines evaluation samples.
- PrModelSampleProtocolA type that defines language model evaluation samples with prompt, instructions, and expectations.
- StModelSampleA general-purpose language model evaluation sample.
- StModelSampleInputThe data sent to a language model for evaluation.
- StModelSampleOutputThe expected output value and evaluation expectations for a sample.
Generating Responses 3
Produce model responses from samples for evaluation.
- AcSampleGeneratorAn actor that generates evaluation samples using a language model.
- StStructuredTranscriptA structured record of a model interaction used during evaluation.
- EnStructuredValueA type-safe representation of JSON values.
Scoring and Evaluators 4
Score model responses with metrics and pluggable evaluators.
- PrEvaluatorProtocolA type that evaluates subjects and produces metrics.
- StEvaluatorA closure-based evaluator.
- StMetricA named metric that carries a result value.
- PrScoreLevelA type that defines individual levels within a scoring scale.
Model-as-Judge Scoring 6
Use a model to judge responses along defined dimensions and scales.
- StModelJudgeEvaluatorAn evaluator that uses a language model as a judge to score responses.
- StModelJudgePromptA configuration for how a model-as-judge evaluator constructs its prompt.
- StScoringScaleA scoring scale that defines the set of options a judge can assign.
- StScaleOptionA single option in a scoring scale.
- StScoreDimensionA named scoring dimension for a model judge evaluator.
- EnScoringModeThe scoring constraint mode for a model-as-judge evaluator.
Tool-Call Assessment 5
Evaluate whether a model invokes tools and follows expected trajectories.
- StToolCallEvaluatorAn evaluator that verifies agentic tool calls against an expected trajectory.
- StToolExpectationA specification for an expected tool call, or a group of expectations
- StTrajectoryExpectationThe expected pattern of tool calls for an evaluation.
- EnArgumentMatcherThe values that define how to validate a tool-call argument.
- EnArgumentValueA primitive value type for argument specifications that is @Generable.
Aggregating Metrics 3
Combine per-sample outcomes into aggregate measures.
- StMetricsAggregatorA utility for computing aggregate statistics from evaluation metrics.
- StAggregateMetricAn aggregate statistic computed from a metric's results across the evaluation dataset.
- EnAggregationOperationThe type of aggregation operation used to compute a summary statistic.
Results 2
Structured outcomes returned from running an evaluation.
- StEvaluationResultThe results of running a model evaluation.
- StResultColumnA typed descriptor for a column in an evaluation result DataFrame.
Errors 3
Errors raised while running, scoring, or reading results from evaluations.
- EnEvaluationErrorErrors thrown during an evaluation run.
- EnEvaluationResultsErrorErrors the framework throws when parsing evaluation results.
- EnModelJudgeErrorErrors that can occur during model-as-judge scoring.
Structures 1
- StEvaluatorsBuilderA result builder that enables declarative evaluator lists.