Bleeding Lama CVE-2026-7482

What is ollama#

Ollama is a tool used to run large language models locally. Instead of relying on cloud providers, developers can download and run models like Llama, Mistral, Gemma, and others directly on their own systems.

It exposes an API server that allows applications to:

load AI models,
generate responses,
create custom models,
and share or push models to remote registries.

Its API is a local HTTP REST API, usually running on:

http://localhost:11434

Endpoint	Purpose
`/api/generate`	Generate text
`/api/chat`	Chat completion
`/api/tags`	List installed models
`/api/pull`	Download models
`/api/push`	Push models
`/api/create`	Create custom models
`/api/delete`	Delete models
`/api/embeddings`	Generate embeddings

Exmple: generate text

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain SQL injection"
}'

Internally, Ollama supports models stored in the GGUF format, which is commonly used for quantized LLMs because it reduces memory usage and improves performance on consumer hardware.

What is CVE-2026-7482#

CVE-2026-7482 is a critical heap out-of-bounds read vulnerability affecting Ollama versions before 0.17.1.

The issue exists inside the GGUF model loader. Ollama trusted certain values inside uploaded GGUF files without properly validating them first. A specially crafted malicious model could trick the server into reading memory outside the intended buffer.

Because the vulnerable API endpoints can be exposed over the network, an attacker may remotely leak sensitive memory contents from the Ollama process.

Root Cause#

The vulnerability comes from improper validation of tensor metadata inside GGUF files.

A GGUF file contains structured metadata describing:

tensor offsets,
tensor sizes,
quantization data,
and model layout information.

Ollama parsed attacker-controlled values from the file and later used them during model loading and quantization without checking whether the requested memory region actually existed inside the file boundaries.

Example:

data := buffer[offset : offset+size]

if:

offset + size > actual_file_size

the program ends up reading memory past the allocated heap buffer.

That unintended memory may contain sensitive information belonging to the running Ollama process.

The core problem was essentially:

trusting untrusted metadata,
missing bounds checks,
and unsafe slicing operations during GGUF parsing.

Exploit flow#

An attacker first creates a malicious GGUF model file. Instead of using normal tensor metadata, the attacker modifies fields like tensor offsets and tensor sizes so they point outside the valid model boundaries.

When the file gets uploaded to the Ollama server through endpoints such as /api/create, Ollama begins parsing the model as if it were legitimate.

During the quantization or model loading process, the server uses the attacker-controlled offsets to read memory regions that were never meant to be accessed. Since the application does not properly validate the boundaries, it accidentally leaks portions of heap memory.

The leaked memory can contain fragments of prompts,API tokens,chat history,environment variables or other sensitive runtime data.

An attacker can then use Ollama’s own functionality, such as /api/push, to retrieve or export the corrupted model output containing the leaked memory data.

What makes this particularly dangerous is that:

no authentication is required in many default deployments,
no user interaction is needed,and exploitation can happen entirely over the network if the Ollama instance is publicly exposed.

In real-world cases, a publicly reachable Ollama server could be scanned, exploited remotely, and used to collect sensitive information from the host system without crashing the service or drawing immediate attention.