Not really. The state of the art models are huge, even the open-weight ones. You really don’t want to quantize below 4-bit, and even that’s a bit of a stretch… Ideally you’d use at least 8-bit to get good results with these models when used for coding.
GLM-5.1 needs around 400GB VRAM at 4-bit quantization. Apple aren’t making the Mac Studio with 512GB unified RAM any more, so you’d need something like 5 x Nvidia A100 80GB to run a model like this.
Distillation works better than quantization, to the point Qwen recently out-benchmarked its 397B model with a 27B model, two months apart. Arguably the only reason to train comically large models is that this is a decent strategy for finding very small models.
These models run on normal computers, and they are giving them away.
Does your company not have computers?
Not really. The state of the art models are huge, even the open-weight ones. You really don’t want to quantize below 4-bit, and even that’s a bit of a stretch… Ideally you’d use at least 8-bit to get good results with these models when used for coding.
GLM-5.1 needs around 400GB VRAM at 4-bit quantization. Apple aren’t making the Mac Studio with 512GB unified RAM any more, so you’d need something like 5 x Nvidia A100 80GB to run a model like this.
Kimi K2.6 is around the same size.
Distillation works better than quantization, to the point Qwen recently out-benchmarked its 397B model with a 27B model, two months apart. Arguably the only reason to train comically large models is that this is a decent strategy for finding very small models.