Text generation has a large impact on society and, as a result, there is the need to evaluate this paradigm: How good is my prompt? How do I know if my LLM has stopped working properly? Intuitively, it would seem that a human being is the best option for answering these questions.

What happens, however, if we want to conduct a consistent, automatic evaluation? And what if these questions were answered by... another LLM?

Introduction

Evaluating an LLM-based system is key to ensuring that:

  1. The system will yield quality results once in production, and
  2. The system will maintain that quality over time.

In this post we will talk about how to evaluate these systems using metrics, focusing on LLM-based metrics. These are metrics that use an LLM to evaluate the outputs of another LLM.

First of all, we will describe some difficulties that appear when evaluating an LLM and the main aspects to take into account. Then we will talk about the metrics that are already out there and we believe are the most relevant, as well as about some useful libraries and frameworks.

Finally, we will discuss the application of LLM-based metrics at Paradigma.

Premises: difficulties when evaluating

There are a series of factors and elements that make the evaluation of an LLM a non-trivial matter. These difficulties need to be identified in order for an evaluation methodology to be as robust as possible.

Then, we need to define some of the principles that give rise to difficulties in terms of evaluation:

  1. The form of the message. In other words, is the LLM answering with good enough manners? Are certain words being repeatedly used across the various generated answers? Is the language used by the LLM the one we want for this task?
  2. The correction of the information. We want to answer questions like: Is my system providing answers with all the possible information? Is the information provided by my system in its answers accurate or invented?
  3. Other aspects. There are other aspects that need to be taken into consideration when evaluating an LLM. For example, how vulnerable is the LLM to prompt injections? Is the format of the answer the one I need?

Non LLM-based metrics

Within NLP there is a series of tasks that do not require a “human” understanding of the contents generated by an LLM, as they do not call for too much creativity on behalf of the system.

Now we will define some of the metrics that are used in evaluating these systems:

LLM-based metrics

When it comes to evaluating an LLM or a prompt, it is very likely that we would like to perform a semantic evaluation, ie we want the answers to have a certain semantic content – albeit not always using the same words in the same order. This is why the best way to do it is having human supervision.

When it comes to developing and maintaining an LLM-based system, however, an automatic evaluation process has to be carried out, which is impossible with human supervision.

But what if we were to ask an LLM to understand and evaluate a content like a human being? LLM-based metrics, where an LLM determines whether a text meets certain requirements specified in a prompt, were thus born.

LLM-based metrics can be used to evaluate different tasks or types of system (RAG, Q&A, code generation…) and can either be supervised, ie require a ground truth for comparison with the generated text, or unsupervised.

Now we will describe some of the most commonly used LLM-based metrics that we find to be the most interesting:

TP, FP and FN example

Libraries and frameworks

In view of the growing importance of LLM-based metrics, newly developed packages and frameworks for supporting the evaluation of LLMs are appearing all the time. We would like to highlight here the following new tools:

RAGAs: this is a RAG system evaluation-focused library that implements several of the metrics described above.

DeepEval: this is a framework specialised in approaching the evaluation of LLMs as unit tests. It incorporates a large number of metrics, as well as libraries and RAGAs.

LangSmith: this is a framework that has many built-in functionalities for managing the lifecycle of an LLM-based system: traceability, prompt versioning, metric logging and so on. It belongs to LangChain, so integration is almost absolute.

Langfuse: this is an open-source framework that is similar to LangSmith but has less functionalities. Our colleague José María Hernández de la Cruz gave a webinar where he talked about this framework and showed how it is used.

Actual example: Benchmarking

At Paradigma Digital we have already worked with LLM-based metrics. As part of a project developed by the GENAI team, we benchmarked different technologies to obtain a baseline for developing generative AI projects.

We based this project on a contact centre use case, where the evaluation dataset consisted of a set of questions and answers in Spanish that, in theory, a call centre worker would give based on the calls made by selected customers.

LLMs were one of the technologies we benchmarked. In order to do so, we devised metrics (based on some of the ones described above) that we thought would be useful in this use case:

After analysing the calculated metrics, we reached the following conclusions:

Example: evaluating an LLM with itself

Conclusions and discussion

In this post we have briefly covered how an LLM can be evaluated, particularly using other LLMs as evaluators. The more salient points are listed below:

References

Tell us what you think.

Comments are moderated and will only be visible if they add to the discussion in a constructive way. If you disagree with a point, please, be polite.

Subscribe