The iPhone 17 Pro can run a 400B parameter Large Language Model on-device by streaming weights from the SSD

While the speed remains impractical for daily use, this proof of concept demonstrates how new inference engines are successfully bypassing RAM limits.

The iPhone 17 Pro can run a 400B parameter Large Language Model on-device by streaming weights from the SSD
Comment IconFacebook IconX IconReddit Icon
Tech Reporter
Published
2 minutes & 15 seconds read time
TL;DR: The open-source flash-moe engine runs a 400B-parameter MoE model on an iPhone 17 Pro by streaming weights from NVMe storage, using only 5.5GB RAM. Though slow at 0.6 tokens/sec, it proves large models can operate on consumer devices without full memory loading, highlighting SSD speed as the main bottleneck.

A new open-source inference engine, flash-moe, by Daniel Woods, has successfully run a 400B-parameter Large Language Model on an iPhone 17 Pro, a device with just 12GB of RAM. The project leverages Apple's "LLM in a Flash" research, in which model weights are streamed on demand directly from the device's NVMe storage rather than preloading the entire 400B parameter set into system RAM. That said, at 0.6 tokens per second and a TTFT (Time To First Token) of almost 50 seconds, the demonstration serves only as a proof of concept for the time being.

Typically, dense models require all their weights to be preloaded into memory to ensure low-latency access. However, Mixture of Experts (MoE) models, such as Qwen's 3.5 series, use a small subset of 'experts' rather than activating all 400B parameters for each request. The specific model in question is Qwen3.5-397B-A17B (2-bit quantized), which is a 397B model with 17B active parameters. Per Daniel's published paper, only 5.5 GB of weights are resident in memory at any time, even for a massive 400B model.

After a series of optimizations and a custom metal GPU pipeline written in Objective-C, the project demonstrates that streaming MoE models from consumer-grade SSDs is possible and yields acceptable results. The paper used an M3 Max MacBook with 48GB of system RAM, achieving 5.74 tokens per second when running Qwen3.5-397B-A17B. This speed was achieved after 90+ experiments, with a baseline of 0.28 tokens per second, yielding a 20.5x improvement.

The iPhone 17 Pro can run a 400B parameter Large Language Model on-device by streaming weights from the SSD 6

Another developer forked this project and created an iOS port, where the A19 Pro, albeit significantly weaker than the M3 Max, was seen delivering 0.6 tokens per second using the same engine. Despite executing a massive 400B parameter model, the iPhone still maintained 5.5GB of free system memory. Obviously, in its current state, this isn't an alternative to cloud-based options like ChatGPT. Rather, it demonstrates an idea that frontier-level models can fit in your pocket.

As it stands, the obvious bottleneck is the SSD's speed. The SSD setup in Apple's M3 Max MacBook has a total available bandwidth of 17.5 GB/s via the Apple Fabric, compared to 400 GB/s of total system bandwidth between the CPU, GPU, and RAM. Nonetheless, while the 400B model is heavily quantized and uses streaming, it is quite impressive how these developers have managed to accomplish this on an iPhone, while still leaving 5.5GB of that memory free for the rest of the OS. At this rate, I wouldn't be surprised if the next device candidate is an Apple Watch.

Photo of the Apple MacBook Pro Laptop
Best Deals: Apple MacBook Pro Laptop
Today7 days ago30 days ago
$2499.99 USD-
$2499.99 USD-
$2499.99 USD-
$2499.99 USD-
Check PriceCheck Price
* Prices last scanned 5/7/2026 at 2:52 pm CDT - prices may be inaccurate. As an Amazon Associate, we earn from qualifying purchases. We earn affiliate commission from any Newegg or PCCG sales.
News Source:github.com

Tech Reporter

Email IconX IconLinkedIn Icon

Hassam is a veteran tech journalist and editor with over eight years of experience embedded in the consumer electronics industry. His obsession with hardware began with childhood experiments involving semiconductors, a curiosity that evolved into a career dedicated to deconstructing the complex silicon that powers our world. From benchmarking PC internals to stress-testing flagship CPUs and GPUs, Hassam specializes in translating high-level engineering into deep, unbiased insights for the enthusiast community.

Follow TweakTown on Google News
Newsletter Subscription