Many things are changing in this new era of AI. Every day, we see new AI tools reaching new milestones, and yet we are still in a relatively early stage (although, if you think about it, it’s not that early — AI as a discipline dates back to the 1940s and 50s).

On the other hand, and as we’ve seen in recent years with Cloud platforms, some things may not change much in the world of development. That’s because the major AI platforms used by most people are still controlled by a handful of large multinational corporations. We can confirm this by looking at the AI investment breakdown from the top tech companies and the recent AI-related hires.

Naturally, this means we’ll be more or less dependent on what these companies decide in terms of tool features, data handling, privacy, and pricing. While this model will likely remain (and undoubtedly provides value), we now also have a more decentralized alternative: open AI tools and models.

Alongside the rise of major AI platforms, open development communities have emerged. These communities aim to make AI applications and resources more accessible, relying on open source, without being tied to any one company or technology stack.

In this series of posts, we’ll explore some of these more decentralized platforms and what they can offer — particularly for development teams.

Foundational Concepts

While this article has a hands-on focus — showing how to interact with LLMs in a straightforward way — there are some key concepts that are important to understand to better grasp how certain configurations impact the behavior or accuracy of LLM responses.

Below, we explain a couple of high-level concepts that help shed light on how these models work under the hood.

How LLMs Work

LLMs (Large Language Models) are AI models built on neural networks and trained using self-supervised machine learning on massive datasets, allowing them to understand and generate natural language and other types of content to perform a wide range of tasks.

This gives us a general sense of how they can help us in our daily work. But… how do they really work internally? Many modern LLMs are based on the Transformer architecture, composed of neural networks trained as language models.

In the Transformer architecture, words (inputs) are converted into vector representations or embeddings which are then used, along with optional additional inputs, to generate probabilistic predictions (words) to accomplish the task at hand (text generation, translation, summarization, etc.).

The Transformer architecture consists of two main components:

Transformer Architecture
Transformer Architecture

At a high level, this architecture relies on statistical knowledge of language, i.e., the probabilities of one word appearing in a given context. While originally developed for translation tasks, it can be adapted for other NLP tasks (Natural Language Processing).

For example, to perform a text generation task, the Transformer architecture processes input words in parallel to predict the next word based on those inputs and their context.

Text generation example using Transformer architecture
Text generation example using Transformer architecture

Based on these probabilities, a word is selected and the process repeats — the newly generated word is added to the input. This cycle continues iteratively, with each generated word becoming part of the input, until the full text is generated. Naturally, during this iterative process, the output probabilities can vary depending on many tunable parameters in the LLM.

Quantization

Quantization is the process of reducing the precision of a model’s weights, converting numbers from floating point format (e.g., 16 or 32 bits) to integers (e.g., 4 or 8 bits).

This precision reduction affects the embeddings (how words are represented in the Transformer architecture described above). There are several ways to implement this, but the goal is the same: reduce model file size and RAM usage, and often improve inference speed as well.

Quantization
Quantization

Ultimately, quantization enables the execution of larger models on the same hardware, though it comes at the cost of some loss in quality or accuracy — sometimes negligible.

Below is a sample comparison of quantized models (original table):

Model Metric F16 Q2_K Q3_K_M Q4_K_S
7B *perplexity 5.9066 6.7764 6.1503 6.0215
7B file size 13.0G 2.67G 3.06G 3.56G
7B milliseconds/token @ 8th, M2 Max 111 36 36 36
7B ms/tok @ 4th, Ryzen 7950X 214 57 61 68
13B perplexity 5.2543 5.8545 5.4498 5.3404
13B file size 25.0G 5.13G 5.88G 6.80G
13B ms/tok @ 8th, M2 Max 213 67 77 68
13B ms/tok @ 4th, Ryzen 7950X 414 109 118 130

* Perplexity refers to a metric used to evaluate the prediction performance of a model. The lower the value, the better the model.

Model quantization is widely used in scenarios with limited resources, such as mobile applications, autonomous vehicles, IoT devices, drones, and more.

The concepts mentioned above may be among the most relevant for basic interaction with LLM execution tools or the LLMs themselves, but there are many more to be aware of (tokens, prompts, embeddings, RAG...).

Benefits of Running LLMs Locally

Nowadays, chatbots based on LLMs/AI can be found in nearly every application or website to help us with daily tasks.

In general, interactions with these applications are straightforward and pose no immediate security or privacy risks. Even if some platforms offer these services for free, there is always an underlying cost (often a significant one).

While these factors may not concern users initially, they are critical for businesses, as they can lead to billing issues and security risks. In such cases, it is more appropriate to run LLMs on machines controlled by the user or organization, offering several benefits such as:

Throughout this post series, we will explore some platforms/tools that allow us to run LLMs locally on our PCs in a simple and transparent way.

Ollama

Ollama is an application that allows users to run and interact with LLMs locally on their machine without needing a constant internet connection. This applies to interactions with LLMs, although you will need an internet connection to install Ollama itself and download the models to your machine.

Ollama is built on top of the llama.cpp library, providing a wrapper layer that simplifies interaction and management of LLMs, abstracting many of the lower-level concepts for developers and users.

System Requirements

Before beginning installation, keep in mind the system requirements for running Ollama smoothly. These requirements primarily pertain to the models run within Ollama, as the Ollama software itself is lightweight. Key requirements include:

Installing Ollama

There are several ways to run Ollama on your system depending on your OS and needs:

In this post, we will install Ollama for Linux following the official documentation. Simply download Ollama using the command provided on their website:

curl -fsSL https://ollama.com/install.sh | sh

Once Ollama is installed on your system, it's helpful to understand certain aspects that will be useful for running it effectively:

Model Storage

Each operating system has a different folder path where Ollama stores downloaded models. It's important to be aware of this for managing disk space and making backups.

On Linux, with the default recommended installation, models are stored at the path: /usr/share/ollama/.ollama/models.

Environment Variables

Ollama allows customization via environment variables. Some of the most useful include:

Note that how you configure these variables depends on the installation method used for Ollama. In the case of the default installation on Linux, you’ll need to override the corresponding config file by running the following commands:

  1. Run the command:
sudo systemctl edit ollama.service
  1. Include the lines with the variables and their values in the configuration file:
[Service]
Environment="OLLAMA_MODELS=/path-personalizado"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_DEBUG=1"
...
  1. Save the changes to the file. The original file in Linux is located in the path /etc/systemd/system/ollama.service.d.
  2. Run the following command to apply the changes:
sudo systemctl daemon-reload
  1. Run the following command to restart the Ollama service:
sudo systemctl restart ollama

Application logs

Specific to each operating system. On Linux (with Ollama running natively), you can run the following command to view the logs (add the -f option to the command to view the logs in real time):

sudo journalctl -u ollama

Producing an output like the following:

...
-- Reboot --
date&time user systemd[1]: Started Ollama Service.

date&time user ollama[2598]: ... routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* ...] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"

date&time user ollama[2598]: time=... level=INFO source=images.go:432 msg="total blobs: 25"

date&time user ollama[2598]: time=... level=INFO source=images.go:439 msg="total unused blobs removed: 0"

date&time user ollama[2598]: time=... level=INFO source=routes.go:1297 msg="Listening on 127.0.0.1:11434 (version 0.6.1)"

date&time user ollama[2598]: time=... level=INFO source=gpu.go:217 msg="looking for compatible GPUs"

date&time user ollama[2598]: time=... level=INFO source=gpu.go:377 msg="no compatible GPUs were discovered"

date&time user ollama[2598]: time=... level=INFO source=types.go:130 msg="inference compute" id=0 library=cpu variant="" compute="" driver=0.0 name="" total="... GiB" available="... GiB"
...

By adding the option -n 100 to the previous command, we can view the last 100 lines of the log. Using the Linux terminal pipe options, we can export the Ollama logs to a file with the following command:

sudo journalctl -u ollama.service > ollama_logs.txt

Permission Errors

On Linux, errors may occur during model creation or download if the user doesn't have write permissions in the directory where models are stored.

History File

This file stores the history of conversations with Ollama. On Linux, it can be found at: ~/.ollama/history.

GPU Acceleration

When a model is running, the ollama ps command will indicate—via a cpu/gpu value—whether GPU acceleration is being used. For this section, the PROCESSOR property should show 100% GPU to confirm GPU usage. We'll explore the behavior of ollama ps in more detail in a later section.

You may also encounter errors such as CUDA error or ROCm error during execution. If so, make sure to check the GPU drivers and system configuration for compatibility.

Below is an example of running ollama ps where GPU acceleration is not active:

GPU Acceleration
GPU Acceleration

Basic Ollama Commands

Once Ollama is installed, you can interact with it and the associated models using the following commands.

PULL

This command is used to download available models from the Ollama website. It fetches the necessary files to run the model and updates them if newer versions are available. Here's an example command:

ollama pull llama3.2:1b

On the Ollama website, in the models section, you'll find various tags that let you filter models by topic:

Tags to filter models by topic in Ollama
Tags to filter models by topic in Ollama

Each model also displays multiple tags following a common structure:

Tag structure in Ollama. qwen3, "latest generation of LLMs in Qwen series"
Tag structure in Ollama. qwen3, "latest generation of LLMs in Qwen series"

It's worth noting that model tagging may evolve or new tags may appear. On each model’s page, you'll find relevant details about its characteristics and, in some cases, performance benchmarks.

Qwen3
Qwen3

RUN

Command to run a model and interact with it:

ollama run llama3.2:1b

If the model hasn't been downloaded beforehand, Ollama will download it. Once the model is fully loaded, an interactive prompt will appear where you can start sending requests to the model:

interactive prompt in Ollama "send a message"

Within this interactive prompt, several special commands are available:

/? or /help command in Ollama
/? or /help command in Ollama
/set parameter   command in Ollama
/set parameter command in Ollama
/show command in Ollama
/show command in Ollama
/show info command in Ollama
/show info command in Ollama
/show template command in Ollama
/show template command in Ollama
/save  command in Ollama
/save command in Ollama
/load  command in Ollama
/load command in Ollama
/bye or /exit command in Ollama
/bye or /exit command in Ollama

Additionally, there are other variants of the run command, such as:

ollama run llava "What's in this image? ./multimodal.jpg"
Multimodal model
Multimodal model
ollama run llama3.2:1b "Summarize this file: $(cat README.md)"
Prompt as an argument
Prompt as an argument

LIST

Command to list downloaded models:

ollama list

Displays information such as the model name, ID, size on disk, and last modified date:

list command in Ollama
list command in Ollama

SHOW

Command that displays detailed information about each model:

ollama show gemma3:1b

It will display information such as the parameter configuration, metadata, or template details.

show command in Ollama
show command in Ollama

RM

Command to remove a model from Ollama:

ollama rm gemma3:1b
rm command in Ollama
rm command in Ollama

CP

Command to copy an existing model, which can be used for model customization. In reality, the command does not copy the entire file but rather its reference. This means that if, for example, you copy a 5GB model, it will not create a new 5GB file — instead, it copies the manifest file that references the model:

ollama cp llama3.2:1b llama-custom
cp command in Ollama
cp command in Ollama

PS

Command to view which models are currently loaded into memory, also useful to check whether GPU acceleration is being used in Ollama:

ollama ps

Provides information about the model, ID, size, processor used (CPU/GPU), and time since last access.

ps command in Ollama
ps command in Ollama

STOP

Command to stop a running model:

ollama stop llama3.2:1b
stop command in Ollama
stop command in Ollama

PUSH

Command to upload a created model to the Ollama registry.

ollama push llama-custom

The Ollama registry is a centralized repository for storing models used by the tool itself, and where users can publish their own models. This registry offers several benefits, including:

HELP

Command to learn how to use the rest of Ollama's commands.

ollama help
Ollama help command
Ollama help command

SERVE

Command to start Ollama. Useful in cases where Ollama is not running as a background system process.

ollama serve

How to choose the right model?

As we've seen in previous sections, several technical considerations must be taken into account when selecting the model you want or can use: available resources such as RAM, disk, CPU, etc., or specifics of each model: size, quantization, variants, etc.

But beyond the technical side, models can also be selected based on tasks or use cases, following these recommendations:

Ultimately, it’s important to read the documentation and specifications of each model and run experiments to evaluate which one performs best for your specific use case and dataset.

Conclusions

This introduction to running LLMs locally has covered some of the key concepts to keep in mind when working with these models, as well as how to choose the most suitable one for your use case.

Focusing on the practical side, we've seen how simple it is to interact with models via Ollama using its various commands.

In the next article, we’ll dive into more advanced Ollama options, including model customization.

References

Tell us what you think.

Comments are moderated and will only be visible if they add to the discussion in a constructive way. If you disagree with a point, please, be polite.

Subscribe