Quickstart - Pt 2 - PromptLayer

This continues from Quickstart Part 1, where we built a cake recipe generator prompt.

In Part 1, you created a prompt, tested it in the playground, and learned about versioning. Now let’s evaluate prompt quality, test different models, and connect everything to your code.

Evaluating a Prompt

Before deploying a prompt, you want to know if it’s actually good. PromptLayer lets you build evaluation pipelines that score your prompt’s outputs automatically.

Creating a Dataset

Evaluations run against a dataset - a collection of test cases with inputs and expected outputs. Let’s create one for our cake recipe prompt. Click New → Dataset and name it “cake-recipes-test”.

Add a few test cases. Each row needs the input variables your prompt expects (cake_type, serving_size) and optionally an expected output to compare against:

Sample CSV for cake recipe dataset

cake_type,serving_size,expected_output
Chocolate Cake,8,"Should include cocoa or chocolate, have clear measurements"
Vanilla Birthday Cake,12,"Should be festive, mention frosting options"
Gluten-Free Lemon Cake,6,"Must not include wheat flour, should use alternatives"
Vegan Carrot Cake,10,"No eggs or dairy, should suggest substitutes"

Download this CSV or add rows manually in the UI.

Learn more about Datasets.

Creating an Eval Pipeline

Now let’s build a pipeline that runs your prompt against each test case and scores the results. Click New → Evaluation and select your dataset. First, add a Prompt Template column. This runs your prompt against each row in the dataset, using the column values as input variables. The output appears in a new column. Next, add an LLM-as-judge scoring column. This uses AI to score each output against criteria you define. For our recipe prompt, we might check:

Does the recipe include all required sections (Overview, Ingredients, Instructions)?
Are measurements provided in both metric and US units?
Is the serving size correct?

You can also add an Equality Comparison column to compare the prompt output against the expected_output column in your dataset.

Run the evaluation to see scores across all test cases. Learn more about Evaluations.

Other evaluation types

Beyond LLM-as-judge, PromptLayer supports:

Human grading: Collect scores from domain experts
Equality Comparison: Compare outputs to expected results
Cosine similarity: Measure semantic similarity between outputs
Code evaluators: Write custom Python scoring functions

Agent nodes work the same way in eval pipelines.

Testing Different Models

Want to compare how your prompt performs across GPT, Claude, and Gemini? Create a new evaluation for model comparison. Add multiple Prompt Template columns, each configured with a different model override. The pipeline runs your prompt on each model and shows results side by side.

This helps you find the best price/performance balance for your use case. PromptLayer supports all major providers, plus any OpenAI-compatible API via custom base URLs.

Historical Backtests

Once your prompt is in production, you’ll have real request logs. Use these to test new prompt versions against actual user inputs.

Creating a Historical Dataset

Go to Datasets and click Add from Request History. This opens a request log browser where you can filter and select requests.

Filter by prompt name, date range, metadata, or search content. Select the requests you want and click Add Requests. This captures the real inputs your users sent, along with the outputs your current prompt produced.

Running a Backtest

Create an evaluation that runs your new prompt version against this historical dataset. Add columns for:

New prompt output: The response from your updated prompt version
Equality Comparison: Side-by-side comparison highlighting changes from the original output

Backtests are powerful because they show impact on real user data without needing to define exact pass/fail criteria upfront. You can quickly spot if a change produces dramatically different outputs. Learn more about backtesting.

CI/CD Evaluation

Attach an evaluation pipeline to run automatically every time you save a new prompt version - similar to GitHub Actions running tests on each commit. When saving a prompt, the commit dialog lets you select an evaluation pipeline. Choose one and click Next. From then on, each new version you create will run through the eval and show its score in the version history. This makes it easy to spot regressions before they reach production.

Learn more about continuous integration.

Ad-Hoc Batch Runs

Evaluations aren’t just for testing prompts - they’re one of the best tools for running prompts in batch. Think of it like a spreadsheet where each column can be an AI-powered computation. Upload a CSV or create a dataset, add prompt columns, run the batch, and export the results. Common use cases:

Data labeling: Run GPT over production data to create labeled training datasets
Research: Use web search prompts to find information about a list of companies or people
Content generation: Process hundreds of items (commits, support tickets, reviews) to generate summaries or emails
Data enrichment: Take a list of names and enrich with company, location, or other attributes

You don’t need to build permanent evaluation infrastructure for this. One-off, ad-hoc batch runs are a perfectly valid use case. Create a dataset, add your prompt columns, run it, export, and move on.

Connecting in Code

PromptLayer serves as the source of truth for your prompts. Your application fetches prompts by name, keeping prompt logic out of your codebase and enabling non-engineers to make changes.

Installation

pip install promptlayer

Running Prompts

Initialize the client and run a prompt. The SDK fetches the prompt template from PromptLayer, runs it against your configured LLM provider locally, then logs the result back.

from promptlayer import PromptLayer

client = PromptLayer()  # Uses PROMPTLAYER_API_KEY env var

response = client.run(
    prompt_name="cake-recipe",
    input_variables={
        "cake_type": "Chocolate Cake",
        "serving_size": "8"
    }
)

Output Format

The response includes a prompt blueprint - a model-agnostic format that works the same whether you’re using OpenAI, Anthropic, or any other provider. Access the generated content with:

print(response["prompt_blueprint"]["prompt_template"]["messages"][-1]["content"])

This lets you switch models without changing how you read responses.

After running, head to Logs in PromptLayer to see your request with the full input, output, and latency.

Use prompt_release_label="production" to fetch the version labeled for production. Use prompt_version=3 to pin to a specific version number. Agents work the same way - just pass the agent name. Store API keys as environment variables (PROMPTLAYER_API_KEY, OPENAI_API_KEY) - the client reads these automatically.

Metadata and Logging

Add metadata to track requests by user, session, or feature flag:

response = client.run(
    prompt_name="cake-recipe",
    input_variables={"cake_type": "Chocolate", "serving_size": "8"},
    tags=["production", "recipe-feature"],
    metadata={
        "user_id": "user_123",
        "session_id": "sess_abc"
    }
)

# Add a score after reviewing the output
client.track.score(request_id=response["request_id"], score=95)

Your metadata and tags appear in the log details, letting you filter and search by user or feature.

Learn more about metadata and tagging. For agents or complex workflows, enable tracing with client = PromptLayer(enable_tracing=True) and use the @client.traceable decorator on your functions to see each step as spans. Learn more about traces.

Organizations

Use organizations and workspaces to manage teams and environments. Common setups include separate workspaces for Production, Staging, and Development.

Key features:

Role-based access control: Owner, Admin, Editor, and Viewer roles at organization and workspace levels
Audit logs: Track who changed what and when
Author attribution: See who created and modified each prompt version
Centralized billing: Manage usage across all workspaces

Each workspace has its own API key. Switch between workspaces using the dropdown in the top navigation.

Next Steps

Deployment Strategies - Caching, webhooks, and production patterns
Tutorial Videos - Watch walkthroughs of common workflows
API Reference - Full SDK documentation

Get Started

Languages & Environments

Usage Documentation

Why PromptLayer?

Reference

Quickstart - Pt 2

Evaluating a Prompt

Creating a Dataset

Creating an Eval Pipeline

Testing Different Models

Historical Backtests

Creating a Historical Dataset

Running a Backtest

CI/CD Evaluation

Ad-Hoc Batch Runs

Connecting in Code

Installation

Running Prompts

Metadata and Logging

Organizations

Next Steps

Get Started

Languages & Environments

Usage Documentation

Why PromptLayer?

Reference

​Evaluating a Prompt

​Creating a Dataset

​Creating an Eval Pipeline

​Testing Different Models

​Historical Backtests

​Creating a Historical Dataset

​Running a Backtest

​CI/CD Evaluation

​Ad-Hoc Batch Runs

​Connecting in Code

​Installation

​Running Prompts

​Metadata and Logging

​Organizations

​Next Steps

Evaluating a Prompt

Creating a Dataset

Creating an Eval Pipeline

Testing Different Models

Historical Backtests

Creating a Historical Dataset

Running a Backtest

CI/CD Evaluation

Ad-Hoc Batch Runs

Connecting in Code

Installation

Running Prompts

Metadata and Logging

Organizations

Next Steps