Self-Hosting AI on Apple Silicon (M4, 2026)

How to run local AI on an Apple Silicon Mac in 2026: how much RAM you need, which chip to buy, MLX vs llama.cpp, and an Ollama plus Open WebUI setup.

An Apple Silicon Mac on a desk, used as a local AI server

Updated 18 June 2026 How we review →

By Rob18 June 2026 · 8 min read

An Apple Silicon Mac makes a genuinely good local-AI machine, and not for the reason most people assume. It is not about raw GPU power - it is about unified memory. Get the two things that actually matter right (enough RAM, and the right runtime) and a Mac mini quietly running models in the corner will outperform a lot of pricier setups. Here is what to buy and how to set it up.

Apple Silicon (Apple's own M-series chips, which put the CPU, GPU and memory on one package) shares a single high-bandwidth pool of memory across the whole chip. That one design choice is why a Mac can run large models that would need an expensive dedicated GPU on a PC - and why the buying decision comes down to memory, not benchmarks.

What actually limits local AI on a Mac?

Two things, in this order. First, unified memory sets the ceiling on model size. Because the CPU, GPU and Neural Engine all share one pool of RAM, the GPU can use almost all of it to hold model weights - no copying between separate memory banks. So the biggest model you can run is roughly your total RAM minus a few gigabytes for the system and your context. A 32GB Mac comfortably holds a model of around 28GB.

Second, memory bandwidth sets the speed. Local LLM inference is bandwidth-bound, not compute-bound: the chip spends its time reading weights from memory, so how fast it can do that - not how many GPU cores it has - decides your tokens per second. This is why the headline GPU-core counts matter far less than the memory specs when you are buying for AI.

How much RAM do you need for which model?

Model files are sized by their parameter count and how aggressively they are quantized (compressed - 4-bit is the common sweet spot, trading a little quality for a much smaller footprint). These are practical, real-world tiers, leaving headroom for the OS and a reasonable context window.

8GB: Small models only - 3B to 4B (e.g. a compact Llama or Qwen). Usable for simple tasks.
16GB: 7B-8B comfortably. The realistic entry point for genuinely useful local AI.
32GB: 14B comfortably; 32B is tight but possible. A strong all-rounder.
64GB: Reaches 70B-class models at 4-bit. Serious local-AI territory.
128GB+: Largest open models and long-context work. M3 Ultra goes up to 192GB.

Which chip should you buy?

For local AI the chips line up by memory bandwidth and maximum RAM rather than by name. An M4 or M4 Pro Mac mini with 32 to 64GB is the value sweet spot for most people - quiet, cheap to run, and capable of genuinely useful models. An M4 Max (around 546 GB/s of bandwidth) pushes speed and RAM higher for heavier use. The M3 Ultra sits at the top, with around 800 GB/s of bandwidth and up to 192GB of unified memory - that extra bandwidth translates fairly directly into faster generation on the same model, and the huge RAM ceiling unlocks the largest models.

For most home setups, a Mac mini with as much RAM as you can justify beats a faster chip with less memory. If you are weighing a Mac against a small PC for this job, the best mini PCs for local LLMs guide covers the other side of that decision.

MLX or llama.cpp - which runtime should you use?

This is the choice that most affects speed for free. llama.cpp is the long-standing cross-platform engine; MLX is Apple's own framework, tuned for Apple Silicon. On mid-sized models, MLX generates noticeably faster - commonly 20 to 50 percent quicker than llama.cpp - because it is built for the hardware. Above roughly 27B parameters the gap narrows, because memory bandwidth becomes the bottleneck for both and the runtime matters less.

The good news is you no longer have to choose a harder tool to get MLX speed: recent versions of Ollama added an MLX backend, so you can keep the easy Ollama workflow and still get the Apple-optimised speed. One honest caveat - MLX's decode-speed lead does not always carry over to very long contexts, where the initial prompt-processing step can dominate the wall-clock time. For everyday chat and coding at normal context lengths, MLX is the faster default.

How do you set it up with Ollama and Open WebUI?

The fastest path to a working local AI on a Mac is Ollama for the engine and Open WebUI for the interface. The steps below get you from nothing to a private chat assistant.

Install Ollama
Download the Ollama macOS app or install via Homebrew. It runs as a background service and exposes a local API.
Pull and run a model
From Terminal, run a model sized to your RAM (e.g. an 8B model on a 16GB Mac). Ollama downloads it once, then runs it locally - everything stays on your machine.
Enable the MLX backend
On a recent Ollama version, switch to the MLX backend for the Apple-optimised speed boost. Keep llama.cpp as the fallback for long-context jobs if you hit the prefill caveat above.
Add Open WebUI
Run Open WebUI (most easily via Docker) and point it at your local Ollama API. You now have a clean browser chat interface - like a private, offline version of a commercial assistant.
Tune the context length
Set the context window to match your RAM headroom. Longer context uses more memory and slows prompt processing, so size it to your real needs rather than maxing it out.

For a fuller walkthrough of the Ollama side, the Ollama setup guide goes step by step. And if you are weighing whether to keep AI local at all versus using the cloud, cloud LLM vs local LLM lays out the cost and privacy trade-offs.

What speeds are realistic - and what trips people up?

Speeds vary a lot by model size, quantization and context length, so treat any single tokens-per-second figure with caution. The pattern that holds: smaller models on higher-bandwidth chips feel instant, and speed drops as the model grows because you are reading more weights per token. A well-matched setup - say an 8B model on a 16GB-plus Mac with MLX - is comfortably fast for interactive chat; push to a 70B model and you trade speed for capability.

The common mistakes are predictable. Buying for GPU cores instead of memory. Running the default backend and leaving MLX's free speed on the table. Loading a model too large for your RAM, which forces swapping and tanks performance. And maxing out the context window when you do not need it, which eats memory and slows the first response. Match the model to your RAM, use MLX for everyday work, and a Mac is one of the most pleasant ways to run AI privately at home.

Frequently asked questions

Q01How much RAM do I need to run local AI on a Mac?

16GB is the realistic entry point, running 7-8B models comfortably. 32GB handles 14B models well, and 64GB reaches 70B-class models at 4-bit quantization. Because unified memory caps model size, more RAM directly expands what you can run.

Q02Is MLX really faster than llama.cpp on Apple Silicon?

Yes, on mid-sized models - commonly 20 to 50 percent faster, because MLX is tuned for Apple Silicon. Above roughly 27B the advantage shrinks as memory bandwidth becomes the limit. Recent Ollama versions include an MLX backend so you get the speed without changing tools.

Q03Which Mac is best for local AI in 2026?

For most people, an M4 or M4 Pro Mac mini with as much unified memory as you can afford is the value pick. The M3 Ultra, with up to 192GB and around 800 GB/s bandwidth, is the top choice for the largest models and fastest generation.

Q04Does running local AI keep my data private?

Yes. With Ollama and Open WebUI the model runs entirely on your Mac, so prompts and responses never leave your machine. That is the core privacy advantage of self-hosting over a cloud assistant.

Self-Hosting AI on Apple Silicon (M4, 2026)

What actually limits local AI on a Mac?

How much RAM do you need for which model?

Which chip should you buy?

MLX or llama.cpp - which runtime should you use?

How do you set it up with Ollama and Open WebUI?

Install Ollama

Pull and run a model

Enable the MLX backend

Add Open WebUI

Tune the context length

What speeds are realistic - and what trips people up?

Frequently asked questions

Ollama UK 2026 Setup Guide

Best Mini PCs for Local LLM UK 2026

Cloud LLM vs Local LLM: Cost + Privacy

RAG vs Fine-Tuning vs Long Context

Self-Hosting AI on Apple Silicon (M4, 2026)

What actually limits local AI on a Mac?

How much RAM do you need for which model?

Which chip should you buy?

MLX or llama.cpp - which runtime should you use?

How do you set it up with Ollama and Open WebUI?

Install Ollama

Pull and run a model

Enable the MLX backend

Add Open WebUI

Tune the context length

What speeds are realistic - and what trips people up?

Frequently asked questions

Related guides

Ollama UK 2026 Setup Guide

Best Mini PCs for Local LLM UK 2026

Cloud LLM vs Local LLM: Cost + Privacy

RAG vs Fine-Tuning vs Long Context