GLM-4.7-Flash: Run a 30B Coding Model Locally

GLM-4.7-Flash is a 30B MoE model with an MIT licence that runs on 24GB of memory. What it is good at, how to run it locally, and who it suits.

A compact home server setup for running local AI

Updated 18 June 2026 How we review →

By Rob18 June 2026 · 6 min read

For a long time, running a genuinely capable AI model at home meant either a compromise or a very expensive graphics card. GLM-4.7-Flash is one of the clearest signs that the gap is closing. It is a 30B-class model you can run on a single 24GB machine, it is free to use commercially, and it is built specifically for the things local AI users actually want: coding, agent workflows, and private offline chat.

Here is what it is, why its design makes home use realistic, and whether it is worth your hard drive space.

What is GLM-4.7-Flash?

GLM-4.7-Flash is an open-weight large language model from Z.AI (formerly Zhipu AI), released on 19 January 2026. The headline numbers are a 30B total parameter count, a context window of up to 200K tokens, and an MIT licence, which means you can use it commercially without restrictive terms. The weights are published openly on Hugging Face, so there is no account or API key required to run it yourself.

The "Flash" in the name signals its intent: it is the fast, efficient member of the GLM-4.7 family, tuned to deliver strong results at a size and speed that ordinary hardware can handle.

Why does the mixture-of-experts design matter for local use?

The key to running GLM-4.7-Flash at home is its architecture. It is a mixture-of-experts (MoE) model, a design that splits the network into many specialist sub-models and only activates the few relevant ones for each token. So while the model totals 30B parameters, only around 3.6B are actually used per token.

That sparse activation is what makes the difference. You get the breadth of knowledge that comes with a large total parameter count, but the memory and speed cost closer to a small model. In practice that is why a 30B-class model can fit on 24GB of memory and still respond at a usable pace, where a dense 30B model would need far more.

What is it actually good at?

Z.AI built GLM-4.7-Flash with coding and agentic workflows front of mind, and that is where it is reported to be strongest, posting leading results among comparable open models on coding and reasoning benchmarks such as SWE-Bench and GPQA. It is also positioned for translation, chat, and creative writing, so it is a general workhorse rather than a one-trick model.

The 200K context window is the other practical win. A large context means you can feed it a whole codebase, a long document, or an extended conversation without it losing the thread, which matters a lot for agent-style tasks that accumulate state as they run.

How do you run it locally?

The realistic floor is 24GB of RAM, VRAM, or unified memory for a quantised version, with around 32GB needed for full precision. That puts it within reach of a higher-end Mac with unified memory, a 24GB graphics card, or a well-specified mini PC, rather than requiring a server.

For serving, GLM-4.7-Flash supports inference frameworks including vLLM and SGLang, with full instructions in the official repository, and the wider local-model ecosystem (Ollama, llama.cpp and friends) is the gentlest on-ramp for most people. If you are new to this, our Ollama setup guide walks through getting a local model running, and our best mini PCs for local LLMs roundup covers hardware that comfortably clears the 24GB bar.

Who should care about GLM-4.7-Flash?

If you value privacy and want a capable assistant that never sends your data to a cloud provider, this is one of the strongest options you can self-host today. Developers who want a local coding model for offline work or to avoid per-token API bills are the obvious audience, and homelab enthusiasts get a genuinely useful model that earns its place on the hardware.

It is not for everyone. If your machine has less than 24GB of memory, you will be happier with a smaller model, and if you only need occasional simple answers, a hosted service is less hassle. But for the growing group of people who want serious AI running on their own hardware, GLM-4.7-Flash is a milestone worth knowing about.

Frequently asked questions

Q01Is GLM-4.7-Flash free?

Yes. The weights are open and released under the MIT licence, which permits commercial use. You can download and run it yourself at no cost; your only outlay is the hardware to run it on, or a hosting provider if you choose not to self-host.

Q02What hardware do you need to run it?

About 24GB of RAM, VRAM, or unified memory for a quantised version, and roughly 32GB for full precision. That means a higher-end Mac with unified memory, a 24GB graphics card, or a well-specified mini PC can run it. Machines below 24GB should choose a smaller model.

Q03How can a 30B model run on just 24GB?

Because it is a mixture-of-experts model. Only about 3.6B of its 30B parameters activate for any given token, so its real-time memory and compute cost is far lower than a dense 30B model while keeping much of the capability.

Q04What is GLM-4.7-Flash best at?

Coding and agentic workflows are its strongest areas, with reported leading results on coding and reasoning benchmarks among comparable open models. It also handles chat, translation, and creative writing, and its 200K context window suits long documents and codebases.

The bottom line

GLM-4.7-Flash is a clear marker of how far local AI has come: a 30B-class model, strong at coding and agents, that runs on hardware many enthusiasts already own, under a licence that puts no strings on it. If you have been waiting for a capable model that fits a real home machine and respects your privacy, this is one of the best reasons yet to set one up.

GLM-4.7-Flash: Run a 30B Coding Model Locally

What is GLM-4.7-Flash?

Why does the mixture-of-experts design matter for local use?

What is it actually good at?

How do you run it locally?

Who should care about GLM-4.7-Flash?

Frequently asked questions

The bottom line

Ollama UK 2026 Setup Guide

Best Mini PCs for Local LLM UK 2026

Qwen 3.6: The Open-Source AI

Local LLM for Smart Home UK 2026

GLM-4.7-Flash: Run a 30B Coding Model Locally

What is GLM-4.7-Flash?

Why does the mixture-of-experts design matter for local use?

What is it actually good at?

How do you run it locally?

Who should care about GLM-4.7-Flash?

Frequently asked questions

The bottom line

Related guides

Ollama UK 2026 Setup Guide

Best Mini PCs for Local LLM UK 2026

Qwen 3.6: The Open-Source AI

Local LLM for Smart Home UK 2026