Recommended Workflow
We recommend a systematic approach to implementing automated evaluations: This approach enables two powerful use cases:1. Nightly Evaluations (Production Monitoring)
Run scheduled evaluations to ensure nothing has changed in your production system. The score can be sent to Slack or your alerting system with a direct link to the evaluation pipeline. This helps detect production issues by sampling a wide range of requests and comparing against expected performance.2. CI/CD Integration
Trigger evaluations in your CI/CD pipeline (GitHub, GitLab, etc.) whenever relevant PRs are created. Wait for the evaluation score before proceeding with deployment, to make sure that your changes do not break anything.Complete Example: Building an Evaluation Pipeline
Here’s a complete example of building an evaluation pipeline from scratch using the API:Step-by-Step Implementation
Step 1: Create a Dataset
To run evaluations, you’ll need a dataset against which to test your prompts. PromptLayer now provides a comprehensive set of APIs for dataset management:1.1 Create a Dataset Group
First, create a dataset group to organize your datasets:- Endpoint:
POST /api/public/v2/dataset-groups - Description: Create a new dataset group within a workspace
- Authentication: JWT or API key
- Docs Link: Create Dataset Group
1.2 Create a Dataset Version
Once you have a dataset group, you can create dataset versions using two methods:Option A: From Request History
Create a dataset from your existing request logs:- Endpoint:
POST /api/public/v2/dataset-versions/from-filter-params - Description: Create a dataset version by filtering request logs
- Authentication: API key only
- Docs Link: Create Dataset Version from Filter Params
Option B: From File Upload
Upload a CSV or JSON file to create a dataset:- Endpoint:
POST /api/public/v2/dataset-versions/from-file - Description: Create a dataset version by uploading a file
- Authentication: API key only
- Docs Link: Create Dataset Version from File
Step 2: Create an Evaluation Pipeline
Create your evaluation pipeline (called a “report” in the API) by making a POST request to/reports:
- Endpoint:
POST /reports - Description: Creates a new evaluation pipeline
- Authentication: JWT or API key
- Docs Link: Create Reports
Request Payload
Response
Step 3: Configure Pipeline Steps
The evaluation pipeline consists of steps, each referred to as a “report column”. Columns execute sequentially from left to right, where each column can reference the outputs of previous columns.- Endpoint:
POST /report-columns - Description: Add a step to your evaluation pipeline
- Authentication: JWT or API key
Basic Request Structure
Column Types Reference
Below is a complete reference of all available column types and their configurations. Each column type serves a specific purpose in your evaluation pipeline.Primary Column Types
PROMPT_TEMPLATE
Executes a prompt template from your Prompt Registry.ENDPOINT
Calls a custom API endpoint with data from previous columns.MCP
Executes functions on a Model Context Protocol (MCP) server.HUMAN
Allows manual human input for evaluation.CODE_EXECUTION
Executes Python or JavaScript code for custom logic.CODING_AGENT
Uses an AI coding agent to process data.CONVERSATION_SIMULATOR
Simulates multi-turn conversations for testing.WORKFLOW
Executes a PromptLayer workflow.Evaluation Column Types
LLM_ASSERTION
Uses an LLM to evaluate assertions about the data.AI_DATA_EXTRACTION
Extracts specific data using AI.COMPARE
Compares two columns for equality.CONTAINS
Checks if a column contains specific text.REGEX
Matches a regular expression pattern.REGEX_EXTRACTION
Extracts text using a regex pattern.COSINE_SIMILARITY
Calculates semantic similarity between two texts.ABSOLUTE_NUMERIC_DISTANCE
Calculates absolute distance between numeric values.Helper Column Types
JSON_PATH
Extracts data from JSON using JSONPath.XML_PATH
Extracts data from XML using XPath.PARSE_VALUE
Converts column values to different types.APPLY_DIFF
Applies diff patches to content.VARIABLE
Creates a static value column.ASSERT_VALID
Validates data types (JSON, number, SQL).COALESCE
Returns the first non-null value from multiple columns.COMBINE_COLUMNS
Combines multiple columns into one.COUNT
Counts characters, words, or paragraphs.MATH_OPERATOR
Performs mathematical operations.MIN_MAX
Finds minimum or maximum values.Column Reference Syntax
When configuring columns that reference other columns, use these formats:- Dataset columns: Use the exact column name from your dataset (e.g.,
"question","expected_output") - Previous step columns: Use the exact name you gave to the column (e.g.,
"AI Answer","Validation Result") - Variable columns: For columns of type VARIABLE, reference them by their name
Important Notes
- Column Order Matters: Columns execute left to right. A column can only reference columns to its left.
- Column Names: Must be unique within a pipeline. Use descriptive names.
- Dataset Columns: Are automatically available as the first columns in your pipeline.
- Error Handling: If a column fails, subsequent columns that depend on it will also fail.
- Scoring: If your last column contains all boolean or numeric values, it becomes the evaluation score.
Step 4: Trigger the Evaluation
Once your pipeline is configured, trigger it programmatically using the run endpoint:- Endpoint:
POST /reports/{report_id}/run - Description: Execute the evaluation pipeline with optional dataset refresh
- Docs Link: Run Evaluation Pipeline
Example Payload
Step 5: Monitor and Retrieve Results
You have two options for monitoring evaluation progress:Option A: Polling
Continuously check the report status until completion:- Endpoint:
GET /reports/{report_id} - Description: Retrieve the status and results of a specific report by its ID.
- Docs Link: Get Report Status
Option B: Webhooks
Listen for thereport_finished webhook event for real-time notifications when evaluations complete.
Step 6: Get the Score
Once the evaluation is complete, retrieve the final score:- Endpoint:
GET /reports/{report_id}/score - Description: Fetch the score of a specific report by its ID.
- Docs Link: Get Evaluation Score

