AMD has announced a free software update enabling 128 billion parameter Large Language Models (LLMs) to be run locally on Windows PCs powered by AMD Ryzen AI Max+ 395 128GB processors, a capability previously only accessible through cloud infrastructure.
With this update, AMD is allowing users to access and deploy advanced AI models locally, bypassing the need for third-party infrastructure, which can provide greater control, lower ongoing costs, and improved privacy.
The company says this shift addresses growing demand for scalable and private AI processing at the client device level.
Previously, models of this scale, such as those approaching the size of ChatGPT 3.0, were operable only within large-scale data centres. The new functionality comes through an upgrade to AMD Variable Graphics Memory, included with the upcoming Adrenalin Edition 25.8.1 WHQL drivers.
This upgrade leverages the 96GB Variable Graphics Memory available on the Ryzen AI Max+ 395 128GB machine, supporting the execution of memory-intensive LLM workloads directly on Windows PCs.
A broader deployment
This update also marks the AMD Ryzen AI Max+ 395 (128GB) as the first Windows AI PC processor to run Meta’s Llama 4 Scout 109B model – specifically with full vision and multi-call processing (MCP) support.
The processor can manage all 109 billion parameters in memory, although the mixture-of-experts (MoE) architecture means only 17 billion parameters are active at any given time. The company reports output rates of up to 15 tokens per second for this model.
According to AMD, the ability to handle such large models locally is important for users who require high-capacity AI assistants on-the-go. The system also supports flexible quantisation and can facilitate a range of LLMs, from compact 1B parameter models to Mistral Large, using the GGUF format.
This isn’t just about bringing cloud-scale compute to the desktop; it’s about expanding the range of options for how AI can be used, built, and deployed locally.
The company further states that performance in MoE models like Llama 4 Scout correlates with the number of active parameters, while dense models depend on the total parameter count.
The memory capacity of the AMD Ryzen AI Max+ platform allows users to opt for higher-precision models, supporting up to 16-bit models through llama.cpp when trade-offs between quality and performance are warranted.
Context and workflow
AMD also highlights the importance of context size when working with LLMs. The AMD Ryzen AI Max+ 395 (128GB), equipped with the new driver, can run Meta’s Llama 4 Scout at a context length of 256,000 (with Flash Attention ON and KV Cache Q8), significantly exceeding the standard 4,096 token window default in many applications.
Examples provided include demonstrations where an LLM summarises extensive documents, such as an SEC EDGAR filing, requiring over 19,000 tokens to be held in context. Another example cited the summarisation of a research paper from the ARXIV database, needing more than 21,000 tokens from query initiation to final output. AMD notes that more complex workflows might require even greater context capacity, particularly for multi-tool and agentic scenarios.
AMD states that while occasional users may manage with a context length of 32,000 tokens and a lightweight model, more demanding use cases will benefit from hardware and software that support expansive contexts, as offered by the AMD Ryzen AI Max+ 395 128GB.
Looking ahead, AMD points to an expanding set of agentic workflows as LLMs and AI agents become more widely adopted for local inferencing. Industry trends indicate that model developers, including Meta, Google, and Mistral, are increasingly integrating tool-calling capabilities into their training runs to facilitate local personal assistant use cases.
AMD also provides guidance on maintaining caution when enabling tool access for large language models, noting the potential for unpredictable system behaviour and outcomes. Users are advised to install LLM implementations only from trusted sources.
The AMD Ryzen AI Max+ 395 (128GB) is now positioned to support most models available through llama.cpp and other tools, offering flexible deployment and model selection options for users with high-performance local AI requirements.