GLM-4.7-Flash: Run a 30B Coding Model Locally
GLM-4.7-Flash is a 30B MoE model with an MIT licence that runs on 24GB of memory. What it is good at, how to run it locally, and who it suits.

For a long time, running a genuinely capable AI model at home meant either a compromise or a very expensive graphics card. GLM-4.7-Flash is one of the clearest signs that the gap is closing. It is a 30B-class model you can run on a single 24GB machine, it is free to use commercially, and it is built specifically for the things local AI users actually want: coding, agent workflows, and private offline chat.
Here is what it is, why its design makes home use realistic, and whether it is worth your hard drive space.
What is GLM-4.7-Flash?
GLM-4.7-Flash is an open-weight large language model from Z.AI (formerly Zhipu AI), released on 19 January 2026. The headline numbers are a 30B total parameter count, a context window of up to 200K tokens, and an MIT licence, which means you can use it commercially without restrictive terms. The weights are published openly on Hugging Face, so there is no account or API key required to run it yourself.
The "Flash" in the name signals its intent: it is the fast, efficient member of the GLM-4.7 family, tuned to deliver strong results at a size and speed that ordinary hardware can handle.
Why does the mixture-of-experts design matter for local use?
The key to running GLM-4.7-Flash at home is its architecture. It is a mixture-of-experts (MoE) model, a design that splits the network into many specialist sub-models and only activates the few relevant ones for each token. So while the model totals 30B parameters, only around 3.6B are actually used per token.
That sparse activation is what makes the difference. You get the breadth of knowledge that comes with a large total parameter count, but the memory and speed cost closer to a small model. In practice that is why a 30B-class model can fit on 24GB of memory and still respond at a usable pace, where a dense 30B model would need far more.
What is it actually good at?
Z.AI built GLM-4.7-Flash with coding and agentic workflows front of mind, and that is where it is reported to be strongest, posting leading results among comparable open models on coding and reasoning benchmarks such as SWE-Bench and GPQA. It is also positioned for translation, chat, and creative writing, so it is a general workhorse rather than a one-trick model.
The 200K context window is the other practical win. A large context means you can feed it a whole codebase, a long document, or an extended conversation without it losing the thread, which matters a lot for agent-style tasks that accumulate state as they run.
How do you run it locally?
The realistic floor is 24GB of RAM, VRAM, or unified memory for a quantised version, with around 32GB needed for full precision. That puts it within reach of a higher-end Mac with unified memory, a 24GB graphics card, or a well-specified mini PC, rather than requiring a server.
For serving, GLM-4.7-Flash supports inference frameworks including vLLM and SGLang, with full instructions in the official repository, and the wider local-model ecosystem (Ollama, llama.cpp and friends) is the gentlest on-ramp for most people. If you are new to this, our Ollama setup guide walks through getting a local model running, and our best mini PCs for local LLMs roundup covers hardware that comfortably clears the 24GB bar.
Who should care about GLM-4.7-Flash?
If you value privacy and want a capable assistant that never sends your data to a cloud provider, this is one of the strongest options you can self-host today. Developers who want a local coding model for offline work or to avoid per-token API bills are the obvious audience, and homelab enthusiasts get a genuinely useful model that earns its place on the hardware.
It is not for everyone. If your machine has less than 24GB of memory, you will be happier with a smaller model, and if you only need occasional simple answers, a hosted service is less hassle. But for the growing group of people who want serious AI running on their own hardware, GLM-4.7-Flash is a milestone worth knowing about.
Frequently asked questions
Q01Is GLM-4.7-Flash free?
Q02What hardware do you need to run it?
Q03How can a 30B model run on just 24GB?
Q04What is GLM-4.7-Flash best at?
The bottom line
GLM-4.7-Flash is a clear marker of how far local AI has come: a 30B-class model, strong at coding and agents, that runs on hardware many enthusiasts already own, under a licence that puts no strings on it. If you have been waiting for a capable model that fits a real home machine and respects your privacy, this is one of the best reasons yet to set one up.
Ollama UK 2026 Setup Guide
Best Mini PCs for Local LLM UK 2026
Qwen 3.6: The Open-Source AI