As we saw in the previous post, Ollama allows us to interact with LLMs in a simple and transparent way. Far beyond running basic tests or getting started with LLMs to understand how they work, software development often requires more advanced and customized options.

In this post, we will explore these advanced options that Ollama offers to customize our models, as well as how to use its API to get the most out of it and build AI applications.

Previous concepts

In the previous post we addressed some theoretical aspects related to running LLMs locally with Ollama. As we move on to more complex use cases, it’s also necessary to consider new concepts.

It should be noted that these theoretical concepts are generic to LLMs, regardless of the tools discussed. Sometimes, the tool itself may not allow a direct implementation of a concept (for example: Ollama does not train models from scratch, but can apply LoRA/QLoRA adapters to fine‑tune a model).

Here are those concepts:

Fine‑tuning

Fine‑tuning is the process of adjusting a pre‑trained model to a specific task. This is done using a dataset tied to that specific task, which can partially or fully update the neural network’s weights. This allows adapting a general-purpose model to a particular task, without retraining from scratch — resulting in significant time and resource savings.

As discussed earlier, LLMs work by predicting the next word in a context. Without a carefully crafted prompt (prompt engineering), an un‑tuned LLM might produce a coherent answer that’s nonetheless misaligned with the user’s intent.

For instance, if you ask: “How can I build a résumé?” an LLM might answer “Using Microsoft Word”. That answer is valid but might not match the user’s expectations — even if the LLM has broad knowledge about writing résumés.

Fine‑tuning becomes especially useful to customize an LLM’s style (e.g., for a chatbot), or to add domain‑specific knowledge (e.g. specialized vocabulary in legal, finance or healthcare contexts) which the base model was not trained on.

Some types of fine‑tuning are:

Another classification of fine‑tuning methods is whether they’re human‑supervised (SFT – Supervised Fine‑Tuning), semi‑supervised, or unsupervised.

File formats: Safetensors and GGUF

A tensor is a mathematical object (essentially an array with multiple dimensions) used to represent data. Safetensor is a file format developed by Hugging Face designed to store such tensors securely. Unlike other formats, Safetensor is read‑only, preventing unwanted code execution, and is crafted for efficiency and portability.

Scalar, vector, matrix, tensor

To illustrate what a tensor may represent in real life:

Another file format is GGUF (GPT‑Generated Unified Format), which supports optimized management of LLMs. GGUF can store tensors plus metadata, and supports different types of quantization and fine‑tuning.

GGUF is designed for extensibility and versatility, allowing new information to be added without breaking compatibility with older models, and supporting future‑proof use.

GGUF model
GGUF model

Model parameters

Just as applications have configurable parameters (multi‑currency, multi‑language, etc.), models also expose configurable parameters. Beyond model‑specific parameters, there are common ones critical for optimization:

These are among the most relevant parameters, but there are many more. Depending on the interaction mode with Ollama (CLI, API, modelfile) not all may be configurable.

Context length

Context length defines the maximum number of tokens the model can handle at once. This includes:

Context length is critical, because it directly impacts the quality of the model’s output. In long conversations or when extensive information is provided (e.g. documents), exceeding context limits may lead to erratic or repetitive responses, or loss of ability to reference earlier information.

Although adjustable, LLMs come pre-trained with a maximum context length (e.g. 2048, 4096, 8192, … up to 128K; and some advanced models like Gemini 1.5 Pro claim support for 2 million tokens). Increasing context length beyond model limits can lead to incoherent responses and heavy resource usage (RAM, CPU), potentially causing performance issues.

You can modify this parameter per model through:

Context length: cli
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:1b",
  "prompt": "How fast does the Earth rotate?",
  "options": {
    "num_ctx": 1024
   }
}'
Context length: API

API

In addition to its interactive commands, Ollama provides an API that allows you to interact with LLMs programmatically. This API is especially useful when integrating with other applications (for example, from Spring AI), since you only need to make a call to the corresponding REST endpoint.

By default, the API is exposed on port 11434 (http://localhost:11434 or http://127.0.0.1:11434), although this can be configured. The API exposes the following endpoints:

Ollama api generate
Ollama api chat
Ollama api embeddings
Ollama api tags
Ollama api show
ollama api delete
ollama api pul
ollama api create
ollama api copy

OpenAI API Compatibility

As widely known, OpenAI is one of the main drivers behind the current wave of LLM-based AI, so many applications rely on its APIs. To keep integrations simple, Ollama provides a compatibility layer that matches the OpenAI API format.

This means that, in addition to the endpoints above, Ollama also exposes others under the /v1/ path (http://localhost:11434/v1/) following the OpenAI API structure. Some of the currently supported compatible endpoints include:

v1 chat completions
v1 embeddings
v1 models

With this compatibility layer, many libraries that integrate with OpenAI’s API only need minimal configuration changes to work with Ollama. Some key settings include:

Modelfiles

From its early stages, Ollama was designed with modularity and extensibility in mind (its founders include people who previously worked on Docker). This is why Modelfiles exist: configuration files that act as recipes for defining LLM models in layered form, similar to how Dockerfiles describe container images.

A Modelfile combines the model’s GGUF weights, templates, system prompt, and other instructions.

The information is divided into specific Modelfile instructions, similar in spirit to a Dockerfile. Some of the most important include:

Parameter Description Type Example
num_ctx Defines the size of the context window int 4096
temperature Controls the model’s “creativity” level float 0.6
stop Stop sequences for the LLM cycle. When the model encounters this pattern, it stops and returns. string "AI assistant:"
top_k Filters output options to the top k with the highest probabilities int 40
top_p Filters output options based on cumulative probability float 0.9
Variable Description
{{ .System }} The system prompt used to configure the model’s behavior.
{{ .Prompt }} The user prompt.
{{ .Response }} The LLM’s response.

An example of a template instruction could be:

TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
"""
SYSTEM """ You are a travel agent who provides useful information such as laws, prices, important places to visit, places to eat, accommodation and other important tourist information. """
ADAPTER ./llama3.2-lora.gguf
LICENSE """ MIT License """

An example of message instructions could be:

MESSAGE user Is Madrid in Spain?
MESSAGE assistant yes
MESSAGE user Is Paris in Spain?
MESSAGE assistant no
MESSAGE user Is Barcelona in Spain?
MESSAGE assistant yes

It’s important to note that the Modelfile is not case sensitive — instructions are written in uppercase purely for clarity. Additionally, the instructions do not need to follow a strict order (although placing the FROM instruction first is recommended for readability).

Considering the points described above, a simple example of a Modelfile could be:

FROM llama3.2:1b
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
SYSTEM You are a travel agent who provides useful information such as laws, prices, important places to visit, food spots, accommodations, and other key tourist details.

After saving this file, the following command would be executed to create it:

ollama create custom-model -f ./ModelfileCustom

Once created, you can verify that it is now listed alongside the other models, and you can start interacting with it:

modelfile example

Importing External Models

Ollama supports the execution of models external to its own hub, as long as they are distributed in standard formats such as GGUF or Safetensors, or with adapters for fine-tuning. Here are some examples of the process:

GGUF

  1. Download the file in .gguf format. You can find an example at this link.
  2. Generate the Modelfile referencing the downloaded file.
FROM ./external-models/gguf/llama-3.2-1b-instruct-q4_k_m.gguf
PARAMETER num_ctx 4096
SYSTEM "You are a chatbot that only tells jokes."
  1. Create the model so that it can be executed later:
ollama create custom-gguf-model -f ./Modelfile_gguf
import external GGUF model

Safetensors

  1. Download the required files into the same folder. You can find an example at this link.
  2. Generate the Modelfile with a reference to the path where the files are located.
FROM ./external-models/safetensors/llama3_2_instruct_st/
PARAMETER num_ctx 4096
PARAMETER temperature 0.7
  1. Create the model so that it can be executed later:
ollama create custom-st-model -f ./modelfile

Adapters

  1. Download the required files into the same folder. You can find an example at this link.
  2. Generate the Modelfile referencing the path where the files are located.
FROM llama-3.2-3b-instruct-bnb-4bit
ADAPTER ./adapters/llama3_2_lora/llama3.2_LoRA_Spanish.gguf
PARAMETER temperature 0.5
SYSTEM "You now respond in the style taught by the LoRA."
  1. Create the model so that it can be executed later:
ollama create custom-lora-model -f ./modelfile

Quantization with Ollama

Ollama also includes the option to quantize a model (i.e., reduce the numerical precision of its weights) at the moment of its creation. This is done by specifying the -q (or --quantize) parameter in the create command, as follows:

ollama create llama-quantized -f ./Modelfile -q q4_K_M

Based on the following Modelfile, you can check the quantisation result (you can see how the resulting model is smaller in size):

FROM llama3.2:1b-instruct-fp16
PARAMETER temperature 0.9
SYSTEM """ You are a travel agent called Mr Travel."""
quantization ollama example

Under the hood, the quantization instructions from the llama.cpp library are used to carry out this process, saving the model with the appropriate name. Some of the possible quantization options include: q4_0, q5_0, q6_K, q8_0, q4_K_S, q4_K_M, etc.

Conclusions

This post explored more advanced interaction options with Ollama. We tested how the Ollama API works, as well as important parameters and configurations to ensure optimal model performance.

We also tested one of Ollama’s most powerful features: the ability to customize models so they better fit specific use cases and provide a better user experience in AI-powered applications.

In the next post, we’ll introduce other tools for running LLMs locally.

References

Tell us what you think.

Comments are moderated and will only be visible if they add to the discussion in a constructive way. If you disagree with a point, please, be polite.

Subscribe