In this fourth installment of the series on running LLMs locally, we take a look at how to run them with Llamafile, an alternative tool to Ollama and LM Studio. We’ll explore how it works and a step-by-step guide to its implementation.

Want to catch up on the previous posts in the series?

What is Llamafile?

Llamafile is a tool that turns LLMs into a single executable file, bundling the model weights together with a special version of the llama.cpp library. This file can be run on most computers without installing additional dependencies, and it also includes an inference server that exposes an API to interact with the model. All of this is made possible by combining the llama.cpp library with the Cosmopolitan Libc project (which allows C programs to be compiled and executed across a wide range of platforms and architectures).

Some example model llamafiles you can find are:

Model Size License Llamafile
LLaMA 3.2 1B Instruct Q4_K_M 1.12 GB LLAMA 3.2 Mozilla/Llama-3.2-1B-Instruct-Q4_K_M-llamafile
Gemma 3 1B Instruct Q4_K_M 1.11 GB Gemma 3 Mozilla/gemma-3-1b-it-Q4_K_M-llamafile
Mistral-7B-Instruct v0.3 Q2_K 3.03 GB Apache 2.0 Mozilla/Mistral-7B-Instruct-v0.3-Q2_K-llamafile

*Click on each “license” and “llamafile” cell to view the links.

You can find more llamafiles here and here.

Supported platforms

Thanks to the execution versatility provided by the Cosmopolitan Libc project, Llamafile can currently run on the following platforms:

For GPUs, additional configuration may be required, such as installing the NVIDIA CUDA SDK or the AMD ROCm HIP SDK. If GPUs are not detected correctly, Llamafile will default to using the CPU.

Execution

In this article, we will run llamafiles on the Ubuntu operating system. To do so, follow these steps to activate a model’s llamafile:

  1. Download the model’s llamafile. For example, Mozilla/Llama-3.2-1B-Instruct-llamafile.
  2. Grant execution permissions to the downloaded file using the following command:
chmod +x Llama-3.2-1B-Instruct-Q4_K_M.llamafile
  1. Run the file with the command:
./Llama-3.2-1B-Instruct-Q4_K_M.llamafile
simonrodriguez@simonrodriguez:~/llamafile$ chmod +x Llama-3.2-1B-Instruct-Q4_K_M.llamafile simonrodriguez@simonrodriguez:~/llamafile$ ./Llama-3.2-1B-Instruct-Q4_K_M.llamafile LLAMAFILE software: llamafile 0.9.2 model: Llama-3.2-1B-Instruct-Q4_K_M.gguf compute: 11th Gen Intel Core i7-1185G7 @ 3.00GHz (tigerlake) server: http://127.0.0.1:8080/ A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. >>> Hello, what can you tell me about llamafile? Hello! I’m glad to help. Llamafile is an artificial intelligence service focused on creating and managing text documents. It is a natural language model that uses natural language processing (NLP) techniques to analyze and generate text. Llamafile is trained on a broad corpus of text data, including a variety of titles, articles, emails, and other types of documents. It uses this knowledge to generate human-like text, meaning it can create documents autonomously.
  1. When you finish interacting with the model, simply press Control+C.

In addition, the llamafile itself provides a chat-style user interface to interact with the model (http://localhost:8080).

Screen shown when you access localhost:8080 while it is running

API

Although direct execution allows us to interact with the models, since we are reviewing these tools from a development team perspective, we also need an API to integrate them into applications. Llamafile exposes, among others, the following endpoints:

simonrodriguez@simonrodriguez:~$ curl https://www.paradigmadigital.com/health ("slots_idle":1,"slots_processing":0,"status":"ok") simonrodriguez@simonrodriguez:~$
simonrodriguez@simonrodriguez:~$ curl -X POST https://www.paradigmadigital.com/completion \ -H "Content-Type: application/json" \ -d '{ "prompt": "Hello, what can you tell me about Paris?" }' {"content":"Hello! It’s a pleasure to talk with you about the charming city of Paris! It’s a place that evokes passion and love, where every day is a new journey through time..."}
simonrodriguez@simonrodriguez:~$ curl -X POST https://www.paradigmadigital.com/tokenize \ -H "Content-Type: application/json" \ -d '{ "content": "Hello, what can you tell me about Paris?" }' {"tokens":[71,8083,29386,26860,60045,50018,2727,409,12366,30]}
simonrodriguez@simonrodriguez:~$ curl -X POST https://www.paradigmadigital.com/detokenize \ -H "Content-Type: application/json" \ -d '{ "tokens": [71,8083,29386,26860,60045,50018,2727,409,12366,30] }' {"content":"Hello, what can you tell me about Paris?"}
simonrodriguez@simonrodriguez:~$ curl -X POST https://www.paradigmadigital.com/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [{ "role": "user", "content": "What can you tell me about Paris?" }] }'

At this link you can review all available parameters for the different endpoints.

Integration with previously downloaded models

As with Ollama and LM Studio, Llamafile can also work with external models as long as they are stored in GGUF format. While this is generally true, in some cases it may be necessary to make certain adjustments depending on the application used to download those models.

To run GGUF files, it is necessary to compile Llamafile on your machine. Installing Llamafile involves the following steps (in this case, on Ubuntu):

  1. Download the source code from the Git repository.
  2. In the directory where the code was downloaded, run the following commands (you may need to install updated versions of make, wget, and unzip):
make -j8
sudo make install PREFIX=/usr/local

Once llamafile is installed on the system, you can run models from GGUF files that may have been previously downloaded from other applications such as LM Studio or Ollama.

LM Studio

With LM Studio models usually stored in ~/.cache/lm-studio/models or ~/.lm-studio/models, you can run llamafile simply with the following command:

llamafile -m llama-3.2-1b-instruct-q4_k_m.gguf
simonrodriguez@simonrodriguez:~/.lmstudio/models/hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF$ ls llama-3.2-1b-instruct-q4_k_m.gguf simonrodriguez@simonrodriguez:~/.lmstudio/models/hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF$ llamafile -m llama-3.2-1b-instruct-q4_k_m.gguf LLAMAFILE software: llamafile 0.9.3 model: llama-3.2-1b-instruct-q4_k_m.gguf compute: 11th Gen Intel Core i7-1185G7 @ 3.00GHz (tigerlake) server: http://127.0.0.1:8080/ A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. >>> Hello, what can you tell me about Madrid? Hello! Madrid is a vibrant city full of history in Spain. Located in the heart of the Iberian Peninsula, it is one of the most visited cities in the country. The city has a rich culture and architecture, with more than 2,000 historic buildings, including the Almudena Cathedral, San Miguel Cathedral, and the Almudena Mosque.

Ollama

When a model is downloaded in Ollama, its metadata is stored in a file (manifest) in the corresponding directory, usually under ~/.ollama/models/manifests/registry.ollama.ai/library/.

If we open this file, we can see the properties in JSON format. Inside layers, the digest property (whose mediaType value ends with .model) is the one we need to focus on.

highlighting the "digest" field inside "layers" in the JSON

This digest value is used as the file name in the blobs directory (~/.ollama/models/blobs). It is this file from the blobs folder that can be used to run llamafile for the corresponding model. An example execution would be:

llamafile -m sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45
simonrodriguez@simonrodriguez:/usr/share/ollama/.ollama/models/blobs$ ls sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45 simonrodriguez@simonrodriguez:/usr/share/ollama/.ollama/models/blobs$ llamafile -m sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45 LLAMAFILE software: llamafile 0.9.3 model: sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45 compute: 11th Gen Intel Core i7-1185G7 @ 3.00GHz (tigerlake) server: http://127.0.0.1:8080/ A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. >>> Hello, what can you tell me about Barcelona? Hello! Barcelona is a beautiful and vibrant city located on the Mediterranean coast, in Catalonia, Spain. It is famous for its Gothic and Renaissance architecture, its charming cobbled streets, and its stunning beaches. The city is a popular destination for tourists from all over the world and is also especially attractive for culture and history lovers. Some of Barcelona’s most notable attractions include the Sagrada Família, Park Güell, the Gothic Quarter, and Barceloneta Beach.

Conclusions

In this post, we’ve explored another alternative to Ollama and LM Studio: Llamafile. This tool follows a slightly different approach, where each model is executed through a separate executable, with the advantage that no additional software installation is required.

In the next post, we’ll revisit an old acquaintance from the development world that is also joining the trend of running LLMs locally. I’ll see you in the comments!

References

Tell us what you think.

Comments are moderated and will only be visible if they add to the discussion in a constructive way. If you disagree with a point, please, be polite.

Subscribe