Skip to main content
This continues from Quickstart Part 1, where we built a cake recipe generator prompt.
In Part 1, you created a prompt, tested it in the playground, and learned about versioning. Now let’s evaluate prompt quality, test different models, and connect everything to your code.

Evaluating a Prompt

Before deploying a prompt, you want to know if it’s actually good. PromptLayer lets you build evaluation pipelines that score your prompt’s outputs automatically.

Creating a Dataset

Evaluations run against a dataset - a collection of test cases with inputs and expected outputs. Let’s create one for our cake recipe prompt. Click NewDataset and name it “cake-recipes-test”.
Creating a dataset
Add a few test cases. Each row needs the input variables your prompt expects (cake_type, serving_size) and optionally an expected output to compare against:
cake_type,serving_size,expected_output
Chocolate Cake,8,"Should include cocoa or chocolate, have clear measurements"
Vanilla Birthday Cake,12,"Should be festive, mention frosting options"
Gluten-Free Lemon Cake,6,"Must not include wheat flour, should use alternatives"
Vegan Carrot Cake,10,"No eggs or dairy, should suggest substitutes"
Download this CSV or add rows manually in the UI.
Learn more about Datasets.

Creating an Eval Pipeline

Now let’s build a pipeline that runs your prompt against each test case and scores the results. Click NewEvaluation and select your dataset. First, add a Prompt Template column. This runs your prompt against each row in the dataset, using the column values as input variables. The output appears in a new column. Next, add an LLM-as-judge scoring column. This uses AI to score each output against criteria you define. For our recipe prompt, we might check:
  • Does the recipe include all required sections (Overview, Ingredients, Instructions)?
  • Are measurements provided in both metric and US units?
  • Is the serving size correct?
LLM as judge
You can also add an Equality Comparison column to compare the prompt output against the expected_output column in your dataset.
Eval pipeline setup
Run the evaluation to see scores across all test cases. Learn more about Evaluations.
Beyond LLM-as-judge, PromptLayer supports:
  • Human grading: Collect scores from domain experts
  • Equality Comparison: Compare outputs to expected results
  • Cosine similarity: Measure semantic similarity between outputs
  • Code evaluators: Write custom Python scoring functions
Agent nodes work the same way in eval pipelines.

Testing Different Models

Want to compare how your prompt performs across GPT, Claude, and Gemini? Create a new evaluation for model comparison. Add multiple Prompt Template columns, each configured with a different model override. The pipeline runs your prompt on each model and shows results side by side.
Comparing models
This helps you find the best price/performance balance for your use case. PromptLayer supports all major providers, plus any OpenAI-compatible API via custom base URLs.

Historical Backtests

Once your prompt is in production, you’ll have real request logs. Use these to test new prompt versions against actual user inputs.

Creating a Historical Dataset

Go to Datasets and click Add from Request History. This opens a request log browser where you can filter and select requests.
Adding from request history
Filter by prompt name, date range, metadata, or search content. Select the requests you want and click Add Requests. This captures the real inputs your users sent, along with the outputs your current prompt produced.

Running a Backtest

Create an evaluation that runs your new prompt version against this historical dataset. Add columns for:
  • New prompt output: The response from your updated prompt version
  • Equality Comparison: Side-by-side comparison highlighting changes from the original output
Backtest results
Backtests are powerful because they show impact on real user data without needing to define exact pass/fail criteria upfront. You can quickly spot if a change produces dramatically different outputs. Learn more about backtesting.

CI/CD Evaluation

Attach an evaluation pipeline to run automatically every time you save a new prompt version - similar to GitHub Actions running tests on each commit. When saving a prompt, the commit dialog lets you select an evaluation pipeline. Choose one and click Next. From then on, each new version you create will run through the eval and show its score in the version history. This makes it easy to spot regressions before they reach production.
Eval scores by version
Learn more about continuous integration.

Ad-Hoc Batch Runs

Evaluations aren’t just for testing prompts - they’re one of the best tools for running prompts in batch. Think of it like a spreadsheet where each column can be an AI-powered computation. Upload a CSV or create a dataset, add prompt columns, run the batch, and export the results. Common use cases:
  • Data labeling: Run GPT over production data to create labeled training datasets
  • Research: Use web search prompts to find information about a list of companies or people
  • Content generation: Process hundreds of items (commits, support tickets, reviews) to generate summaries or emails
  • Data enrichment: Take a list of names and enrich with company, location, or other attributes
You don’t need to build permanent evaluation infrastructure for this. One-off, ad-hoc batch runs are a perfectly valid use case. Create a dataset, add your prompt columns, run it, export, and move on.

Connecting in Code

PromptLayer serves as the source of truth for your prompts. Your application fetches prompts by name, keeping prompt logic out of your codebase and enabling non-engineers to make changes.

Installation

pip install promptlayer

Running Prompts

Initialize the client and run a prompt. The SDK fetches the prompt template from PromptLayer, runs it against your configured LLM provider locally, then logs the result back.
from promptlayer import PromptLayer

client = PromptLayer()  # Uses PROMPTLAYER_API_KEY env var

response = client.run(
    prompt_name="cake-recipe",
    input_variables={
        "cake_type": "Chocolate Cake",
        "serving_size": "8"
    }
)
The response includes a prompt blueprint - a model-agnostic format that works the same whether you’re using OpenAI, Anthropic, or any other provider. Access the generated content with:
print(response["prompt_blueprint"]["prompt_template"]["messages"][-1]["content"])
This lets you switch models without changing how you read responses.
After running, head to Logs in PromptLayer to see your request with the full input, output, and latency.
Log from SDK run
Use prompt_release_label="production" to fetch the version labeled for production. Use prompt_version=3 to pin to a specific version number. Agents work the same way - just pass the agent name. Store API keys as environment variables (PROMPTLAYER_API_KEY, OPENAI_API_KEY) - the client reads these automatically.

Metadata and Logging

Add metadata to track requests by user, session, or feature flag:
response = client.run(
    prompt_name="cake-recipe",
    input_variables={"cake_type": "Chocolate", "serving_size": "8"},
    tags=["production", "recipe-feature"],
    metadata={
        "user_id": "user_123",
        "session_id": "sess_abc"
    }
)

# Add a score after reviewing the output
client.track.score(request_id=response["request_id"], score=95)
Your metadata and tags appear in the log details, letting you filter and search by user or feature.
Log with metadata
Learn more about metadata and tagging. For agents or complex workflows, enable tracing with client = PromptLayer(enable_tracing=True) and use the @client.traceable decorator on your functions to see each step as spans. Learn more about traces.

Organizations

Use organizations and workspaces to manage teams and environments. Common setups include separate workspaces for Production, Staging, and Development.
Changelog
Key features:
  • Role-based access control: Owner, Admin, Editor, and Viewer roles at organization and workspace levels
  • Audit logs: Track who changed what and when
  • Author attribution: See who created and modified each prompt version
  • Centralized billing: Manage usage across all workspaces
Each workspace has its own API key. Switch between workspaces using the dropdown in the top navigation.

Next Steps