Intel's latest driver release, 32.0.101.8517, for Arc Pro GPUs increases the integrated GPU's memory allocation to enable broader LLM inference support. The new driver allows users to allocate up to 93% of their system RAM to the integrated GPU. While the driver currently supports only a select number of SKUs, Intel is paving the way for larger LLM inference workloads without hitting memory capacity bottlenecks.
Traditional memory partitioning usually limits a GPU to 50% of system RAM. AMD's Variable Graphics Memory (VGM) allows high-end configurations, such as the Strix Halo, to allocate 96GB from a 128GB pool to the iGPU. Intel has been more aggressive in this regard. Last year, Intel raised the limit to 87% with its new "Shared GPU Memory Override" for Core Ultra Series 2 processors.
The latest driver release pushes that boundary further to 93% for local AI inference. This only supports integrated Arc Pro GPUs, such as the Arc Pro B390 and Arc Pro B370. While this allocation update is the headline feature for integrated GPUs only, the driver also supports discrete Arc Pro A and B-series cards.

This allows users to run much larger LLMs without expensive hardware. On a 32GB system, this allocation provides enough memory to run a Qwen 2.5 32B model at 4-bit quantization with a comfortable context window. Meanwhile, workstations equipped with 64GB of RAM can run heavyweight models like Llama 3 70B, with enough headroom for the KV cache and system stability.
While this is impressive, computational power and bandwidth still affect the model's run time. Intel's Core Ultra Series 3 (Panther Lake) chips feature fast LPDDR5X-9600 memory, delivering bandwidth in the 150 GB/s range. AMD's Strix Halo, on the other hand, has a 256-bit memory bus that delivers 256 GB/s of bandwidth. This ensures large models not only fit in memory but also run at respectable speeds.

Apple Silicon, however, remains the gold standard. The M5 Max offers 614 GB/s bandwidth, but its real advantage is the Unified Memory Architecture (UMA). Apple's UMA ignores the traditional partitioning found in the x86 world, where, instead of setting a hard limit or fence, the entire memory pool is natively accessible to both the CPU and the GPU.
We've seen UMA's quirks in action, with a user running a 400B LLM on an iPhone 17 Pro. Apple offers efficiency and speed, while Intel and AMD are competing on flexibility and affordability for AI workloads, especially with the advent of LPCAMM2.



