Langfuse vs. LangSmith II: Datasets and Evaluation

Do you want our logo?

Do you want our logo description

We continue this series of posts comparing the two leading platforms in LLM engineering: Langfuse and LangSmith. If you’d like to learn more about the motivation behind this series, how the platforms are configured, and what steps are required to version prompts and trace LLM calls, make sure to check out the first post.

Datasets

Defining datasets is a key step in managing the lifecycle of your prompts. Datasets are collections of inputs and their corresponding expected outputs from the LLM, used to evaluate the application or specific components of it (such as prompts, the LLM itself, or the database).

It’s important to note that creating a dataset requires ground truths—the correct answers to the questions you want the LLM to handle—which serve as a benchmark to compare the model’s actual responses. In both LangSmith and Langfuse, datasets can be created iteratively, allowing you to incorporate real-world cases from production environments selectively. These datasets can be built using either the graphical interface or programmatically via code.

LangSmith

Creating a dataset in LangSmith is as straightforward as using the .create_dataset() method with the desired name. To add examples, you then use the .create_examples() method and fill in the necessary parameters:

inputs: A list of dictionaries representing the input the LLM will receive (the ‘prediction’ function) on which the generation will be based. In the following example, we assume the input will be the ‘question.’ While it will also receive dynamic inputs like chunks, these are expected to be inserted directly into the prediction function.
outputs: A list of dictionaries of the same size as the inputs list, containing the ground truths (expected answers). In our example, this corresponds to the anticipated response to the question. Outputs can also include multiple elements, such as in cases where information extraction is being evaluated.

dataset_name = "Dataset1"
dataset = client.create_dataset(dataset_name)
client.create_examples(
    inputs = [
        {"question": "What color was Little Red Riding Hood?"},
        {"question": "What does the wolf want?"},
    ],
    outputs = [
        {"answer": "Little Red Riding Hood was red."},
        {"answer": "The wolf wants to eat Little Red Riding Hood."},
    ],
    dataset_id = dataset.id,
)

In LangSmith, a dataset is displayed as follows:

View of Dataset1 in the Datasets & Testing Tab. The Examples tab is selected, displaying a table with the following columns: Input, Output, Created At, Modified At, Splits, and a menu with actions for each dataset record. There are two records: The question: “What color was Little Red Riding Hood?”, the question: “What does the wolf want?”

Langfuse

In Langfuse, there is also a .create_dataset() method:

langfuse.create_dataset(
    name="simple_dataset",
    description="Test dataset",
    # Optional metadata
    metadata={
        "author": "Miguel",
        "date": "2024-08-06",
        "type": "tutorial"
    }
)

However, to add items to the dataset, it must be done one by one:

langfuse.create_dataset_item(
    dataset_name="simple_dataset",
    input={
        "text": "What color was Little Red Riding Hood?"
    },
    expected_output={
        "text": "Little Red Riding Hood was red"
    },
)

langfuse.create_dataset_item(
    dataset_name="simple_dataset",
    input={
        "text": "What does the wolf want?"
    },
    expected_output={
        "text": "The wolf wants to eat Little Red Riding Hood"
    },
)

You can see a table with the columns Item ID, Source, Status, Created At, Input, Expected Output, Metadata, and a menu with actions available for each dataset record.

Datasets also enable control over prompt deterioration. Datasets allow for monitoring the degradation of a prompt’s performance, which may be linked to changes in the LLM version or incremental adjustments due to evolving input data over time.

Experimentation and Evaluation

As mentioned earlier, the purpose of creating a dataset is to evaluate a specific part of the application.

Both LangSmith and Langfuse integrate seamlessly with the LangChain evaluation framework and also with RAGAS (a library for automatic metrics designed to evaluate RAG use cases). In this scenario, however, we will create a custom evaluation prompt to assess whether the generated content includes the information from the reference content (generated response vs. correct answer).

For this, we will once again utilize a ChatPromptTemplate, but this time we will include some behavior examples (few-shot technique) to ensure the format is more predictable (e.g., returning “True” or “False”).

LangSmith

To begin, we will write shared code for both platforms: the system template, the template for inserting the human message, and a template with examples:

from langchain_core.prompts.few_shot import FewShotChatMessagePromptTemplate
from langchain_core.messages.human import HumanMessage

evaluation_prompt_system = """
You will receive two text fragments: one as the reference 
and the other as the text to evaluate. Your task is to 
determine if the text to evaluate semantically contains 
all the information in the reference text. Respond only 
with 'True' or 'False'. Consider the following:

1. The evaluated text does not need to be literally the 
   same as the reference text.
2. The evaluated text must contain all the information 
   in the reference text.
3. The evaluated text can include additional information 
   beyond the reference text.
4. Only return 'True' or 'False' depending on whether the 
   conditions are met.


You will receive:
reference_text: Reference text
evaluate_text: Text to evaluate
"""

human_prompt = """
reference_text: {reference_text}
evaluate_text: {evaluate_text}
"""

# Few-shot examples
examples = [
    {
        "reference": "The house is blue",
        "evaluate": "The house is blue and has a brown door",
        "response": "True"
    },
    {
        "reference": "The Russian Revolution began in 1328",
        "evaluate": "The Russian Revolution began in 1917",
        "response": "False"
    },
    {
        "reference": "My cousin has blonde hair and blue eyes",
        "evaluate": "My cousin Pedro has blue eyes",
        "response": "False"
    },
    {
        "reference": "I own two houses in Formentera",
        "evaluate": "I own two beautiful houses in Formentera",
        "response": "True"
    }
]

example_prompt = ChatPromptTemplate.from_messages([
    ('human', 'reference_text: {reference}\nevaluate_text: {evaluate}'),
    ('ai', '{response}')
])

# Creating the few-shot prompt
few_shot_template = FewShotChatMessagePromptTemplate(
    example_prompt=example_prompt,
    examples=examples
)

Once we have the three templates, we need to create our prompt_to_evaluate to upload it to the platform. Let’s start with LangSmith:

# Creating messages for few-shot examples
examples_msgs = [
    ("human", i.content) if isinstance(i, HumanMessage)
    else ("ai", i.content)
    for i in few_shot_template.format_messages()
]

# Combining the system prompt with few-shot examples and human prompt
prompt_to_evaluate = ChatPromptTemplate.from_messages(
    [("system", evaluation_prompt_system)] +
    examples_msgs +
    [("human", human_prompt)]
)

# Uploading the prompt to LangSmith
client.push_prompt(
    "prompt_for_evaluation",
    object=prompt_to_evaluate
)

For the custom evaluation stage, it is necessary to define:

A function (a callable) that handles generation. This can internally involve, for example, a call to the API that serves our application.
A function that performs the evaluation and returns a dictionary with a key field containing the name of the metric and a score field with its numerical value.

Next, we define both functions. The generation function (generate_response()) initializes the LLM, loads the prompt to be evaluated, and invokes the model. It must accept a dictionary as input, corresponding to the dictionaries defined as inputs when creating the dataset. This input will contain the necessary fields defined during the dataset creation process (e.g., question in our case). The output should also be a dictionary, with fields that will later be referenced by the evaluation function (in this case, output).

The evaluation function (evaluate_shot()) takes a Run object and an Example object from LangSmith as arguments. Inside the function, the evaluation prompt is instantiated, and we retrieve both the example output (defined as answer in the dataset storage) and the execution result, which follows the same structure defined in the return of the generation function (in our case, it will only contain the output field).

from langsmith.schemas import Example, Run
from langsmith.evaluation import evaluate
from langchain.chains.question_answering import load_qa_chain

# Function to generate a response using the LLM
def generate_response(inputs: dict):
    llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
    prompt = client.pull_prompt("prompt_rag")
    chain = load_qa_chain(
        llm=llm,
        chain_type="stuff",
        prompt=prompt,
        document_variable_name="contexto"
    )

    # Perform generation
    response = chain.invoke({
        "input_documents": chunks,
        "pregunta": inputs["question"]
    })
    return {"output": response["output_text"]}

# Function to evaluate the response against the expected output
def evaluate_shot(run: Run, example: Example):
    prompt_evaluation = client.pull_prompt("prompt_for_evaluation")
    reference = example.outputs["answer"]
    output_to_evaluate = run.outputs["output"]

    chain = prompt_evaluation | ChatOpenAI(model="gpt-4o")
    response = chain.invoke({
        "evaluate_text": output_to_evaluate,
        "reference_text": reference
    }).content

    return {"key": "accuracy", "score": float(eval(response))}

Finally, to conduct the evaluation, we use the evaluate() function, passing the dataset, generation function, and evaluation function:

# Evaluate the dataset using the defined generation and evaluation functions
results = evaluate(
    generate_response,
    data="Dataset1",
    evaluators=[evaluate_shot]
)

And now we can go to LangSmith to view our experiment, which has automatically been assigned a name composed of two random words and a number separated by hyphens. In this case, advanced-exchange-25.

View of Dataset1 in the “Datasets & Testing” tab: The “Examples” tab is selected. A table is displayed with the columns Input, Output, Created At, Modified At, Splits, and an actions menu for each dataset record. Two records are present: the question “What color was Little Red Riding Hood?” and the question “What does the wolf want?”.

When entering the experiment, we can see that another can be added for comparison. As shown, the column advanced-exchange-25 contains the actual responses and their respective scores. In this case, the first response has a score of 0 because the actual response says, “The wolf wants to eat the grandmother, Little Red Riding Ho…”, whereas the reference states, “The wolf wants to eat Little Red Riding Hood.” The second response has a score of 1 because it says, “Little Red Riding Hood is red,” matching the reference, “Little Red Riding Hood is red.”

View of the advanced-exchange-25 experiment within the Datasets & Testing tab. A table is displayed with the columns Input, Reference Output, advanced-exchange-25, and a button to add more experiments for comparison. There are two entries: the question “What does the wolf want?” and the question “What color was Little Red Riding Hood?”

Langfuse

Now, we’ll create the prompt_to_evaluate needed to upload it to Langfuse, using the same templates we defined for LangSmith.

examples_msgs = [
    {
        "role": "human",
        "content": i.content
    } if type(i) == HumanMessage
    else {"role": "ai", "content": i.content}
    for i in few_shot_template.format_messages()
]

prompt_to_evaluate = [{
    "role": "system", "content": evaluation_prompt_system
    }] + examples_msgs + [{
    "role": "human", "content": human_prompt
}]

langfuse.create_prompt(
    name = "evaluation_prompt",
    prompt = prompt_to_evaluate,
    labels = ["production"],
    tags = ["evaluation", "simple_evaluation_prompt"],
    config = {"llm": "gpt-4o"},
    type = "chat"
)

Next, we define the function that will handle the evaluation prompt. Since it also involves generation, it will be traced, indicating that it is an evaluation trace:

@observe()
def generate_evaluation(llm_model, reference_text, evaluate_text, model_kwargs, metadata):
    langfuse_handler = langfuse_context.get_current_langchain_handler()
    langfuse_context.update_current_trace(
        name = "evaluation_trace",
        tags = ["evaluation", "first_evaluation"],
        metadata = metadata
    )
    llm = ChatOpenAI(model = llm_model, **model_kwargs)

    # Perform the generation
    chain = evaluation_prompt | llm
    response = chain.invoke(
        {
            "reference_text": reference_text,
            "evaluate_text": evaluate_text
        },
        config = {"callbacks": [langfuse_handler]}
    )

    return response

We instantiate the dataset and the evaluation prompt:

dataset = langfuse.get_dataset("dataset_sencillo")
evaluation_prompt = ChatPromptTemplate.from_messages(
    langfuse.get_prompt(
        "evaluation_prompt"
    ).get_langchain_prompt())

Finally, we iterate through the dataset, apply the evaluation, and save the score:

for item in dataset.items:
    with item.observe(
        run_name = "evaluation",
        run_description = "First Evaluation",
        run_metadata = {"model": "gpt4-o"}
    ) as trace_id:
        question = item.input['text']
        expected_output = item.expected_output['text']

        # Perform the generation
        response = generate_response(
            "gpt-4o",
            question,
            chunks,
            {"temperature": 0.1}
        )['output_text']

        # Perform the evaluation
        evaluation_value = generate_evaluation(
            "gpt-4o",
            expected_output,
            response,
            {"temperature": 0.1},
            metadata = {"date": "2024-08-06"}
        ).content

        # Save the score
        langfuse.score(
            trace_id = trace_id,
            name = "evaluate_" + item.id,
            value = float(eval(evaluation_value))
        )

Now we can observe the evaluation traces in Langfuse, along with the names, tags, and their respective scores.

View of the Traces Tab in Tracing: A table is displayed with the columns ID, Timestamp, Name, User, Session, Latency, Usage, Total Cost, Scores, and Tags. There are a couple of traces with the name generate_response, showing values in the Scores column. Directly above each of these traces, there are corresponding traces named evaluation_trace, with the tags “evaluation” and “first_evaluation”.

We can also go directly to the “scores” tab if we’re only interested in tracking how the scores have changed over time.

Finally, the dataset view also records this information in the “runs” tab, allowing us to compare evaluations for the same dataset at a glance.

View of the dataset_sencillo dataset within the Datasets tab. The Runs tab is selected. A table is displayed with the columns: Name, Description, Run Items, Latency (avg), Total Cost (avg), Scores (avg), Created, and Metadata. There is a single entry with the name evaluation, 4 run items, and two scores, both with a value of 1.

Conclusions

As we have seen, these are two very similar platforms, but here’s a summary of their key differences:

LangSmith offers complete integration with LangChain, making it easier to integrate with LLM-based applications. Langfuse is also compatible with LangChain but requires some code modifications.
LangSmith is not open-source, so it requires a paid license. Langfuse, on the other hand, allows for a free self-hosted version but requires managing your own infrastructure.
At the moment, LangSmith does not support differentiating versioning for prompts, datasets, or experiments by project (only tracing is project-specific). However, it appears to be more stable than Langfuse.
Though not discussed in this post, LangSmith includes highly useful features in its UI, such as the playground and LangGraph for easily using agents (currently only available for paid accounts).

Both platforms are excellent options for monitoring your LLM-based applications—just choose the one that best fits your needs.

Miguel López Campos

I am a native of both Almería and Granada and am in love with artificial intelligence and its applications. I have a PhD in natural language processing from the University of Granada and have been working on the application of generative AI in companies from different industries (health, banking…) for almost two years. I am also curious about how data science can enhance professional sports. In my spare time I like to eat popcorn at the cinema and practice that “sport” known as “pádel”.

View more of Miguel.

Leticia Martín-Fuertes

I have a classical philology degree from the Autonomous University of Madrid and work as a computational linguist at Paradigma. Before that, I worked as a computational linguist at Molino de Ideas, Bitext and Google in, among others, lexical richness analysis, sentiment analysis and conversational system projects. I have co-organised data science and natural language processing communities such as R-Ladies and Lingwars.

View more of Leticia.