Distributed AI Inference on Consumer Devices — Technical Architecture Report
Distributed AI Inference on Consumer Devices
1. How Model Inference Can Be Distributed
Pipeline Parallelism (Layer Splitting)
The most common approach for distributed inference over networks. The model's transformer layers are split into contiguous blocks, each assigned to a different device. Data flows sequentially: device 1 processes layers 0–19, sends hidden states to device 2 (layers 20–39), etc.
- Pros: Simple to implement, minimal communication (only hidden states between stages), works over regular networks
- Cons: Sequential — latency = sum of all stages + network hops. Devices idle while waiting. Throughput limited by slowest device.
- Used by: Petals, llama.cpp RPC, early Exo versions
Tensor Parallelism (Intra-Layer Splitting)
Each layer's weight matrices are sharded across devices. For a given layer, devices compute partial results in parallel and then reduce (all-reduce or all-gather) before proceeding to the next layer.
- Pros: True parallel speedup — 1.8x on 2 devices, 3.2x on 4 devices (Exo benchmarks). All devices active simultaneously.
- Cons: Requires high-bandwidth, low-latency interconnect. Every layer requires a communication round. Impractical over regular Ethernet.
- Used by: Exo 1.0 (with RDMA over Thunderbolt 5), MLX distributed
Split Inference (Hybrid)
Client runs embedding + first N layers locally, offloads middle layers to remote servers, runs final layers locally. Useful for privacy (input/output embeddings never leave the device).
- Used by: Some Petals configurations, academic research on "split learning"
Key Insight
Pipeline parallelism works over the internet; tensor parallelism requires LAN/direct-connect. The choice is dictated by interconnect bandwidth and latency, not model architecture.
2. Existing Projects
Petals (BigScience / Hivemind team)
- What: BitTorrent-style collaborative inference. Community members donate GPU capacity to serve model layers. Clients connect and route inference through available servers.
- Architecture: Built on Hivemind's DHT for peer discovery. Each server hosts a contiguous block of transformer layers. Pipeline parallelism over the internet.
- Models: Llama 3.1 405B, Mixtral 8x22B, Falcon 180B, BLOOM 176B
- Performance: ~6 tokens/sec for Llama 2 70B, ~4 tokens/sec for Falcon 180B, ~1 step/sec for BLOOM 176B
- Platforms: Linux (NVIDIA/AMD GPU), macOS (Apple Silicon), Windows (WSL)
- Limitations: Relies on community swarm — performance varies with available nodes. Privacy concerns (data passes through untrusted peers). Sequential pipeline means latency scales with hop count.
Hivemind (learning-at-home)
- What: PyTorch library for decentralized deep learning. Foundation that Petals is built on.
- Focus: Primarily distributed training (not inference), using decentralized parameter averaging and DHT-based coordination.
- Key features: No master node, fault-tolerant backprop, Decentralized Mixture-of-Experts
- Relevance: Provides the networking/coordination layer that any SETI@home-style system needs
Exo (exo labs)
- What: Private AI clustering tool. Connect your own devices into a unified inference cluster.
- Architecture: Automatic device discovery (mDNS/libp2p), topology-aware auto-parallelism, supports both pipeline and tensor parallelism
- Headline feature (v1.0, 2025): RDMA over Thunderbolt 5 — reduces inter-device memory access latency from ~300μs to near-zero. Devices act as one unified memory pool.
- Performance (4× M3 Ultra Mac Studio, 1.5 TB unified memory):
- Qwen3 235B (8-bit): 32 tok/s on 4 nodes (scales up with more nodes)
- DeepSeek V3.1 671B (8-bit): runs across cluster, usable speeds
- Kimi K2 Thinking 1T params: ~30 tok/s on 4 nodes
- Platforms: macOS (primary, with MLX backend), Linux (CPU only currently, GPU WIP)
- Key differentiator: Only project currently supporting RDMA over Thunderbolt. Tensor parallelism gives actual speedup as you add devices (unlike pipeline parallelism which can slow down).
GPUStack
- What: Open-source GPU cluster manager for AI model deployment. Enterprise-focused.
- Architecture: Server/worker model. Server schedules and orchestrates; workers run inference engines (vLLM, SGLang, TensorRT-LLM, MindIE).
- Focus: Datacenter GPUs (A100, H100/H200, Ascend 910B). Not consumer-oriented.
- Features: Performance auto-tuning, KV cache optimization (LMCache, HiCache), speculative decoding (EAGLE3, MTP), load balancing
- Platforms: Linux workers only. Server can run on CPU-only machine. No macOS worker support.
llama.cpp (ggml-org)
- What: C/C++ LLM inference engine. Not a distributed system per se, but has RPC-based layer offloading.
- RPC mode: Spread layers across nodes via RPC. Pipeline parallelism.
- Performance limitation: Performance decreases as you add more nodes (network overhead exceeds compute gains for many configurations). Jeff Geerling's benchmarks show llama.cpp cluster performance flat or declining vs Exo's scaling.
- Strengths: Massive model/quantization format support (GGUF), runs everywhere, extremely optimized single-node performance.
Ollama
- What: User-friendly LLM runner. No native clustering.
- Community efforts: Some projects proxy multiple Ollama instances behind a load balancer, but no true distributed inference (each instance runs complete models independently).
Together.ai (early work)
- What: Started as a distributed/decentralized inference research project. Pivoted to centralized cloud inference platform.
- Historical relevance: Demonstrated feasibility of collaborative inference, but found centralized approach more commercially viable.
Other Notable Projects
- Distributed-llama: C++ pipeline parallelism for Llama models across networked devices
- Tensorrt-LLM: NVIDIA's inference library with multi-GPU tensor parallelism (datacenter-focused)
- DeepSpeed-Inference: Microsoft's distributed inference with tensor/pipeline parallelism (datacenter)
3. Key Technical Challenges
Network Latency
The fundamental bottleneck for distributed consumer inference.
| Interconnect | Bandwidth | Latency | Suitability |
|---|---|---|---|
| Internet (typical) | 50-500 Mbps | 20-100ms | Pipeline only, slow |
| Gigabit Ethernet | 1 Gbps | <1ms | Pipeline OK, tensor unusable |
| 2.5 GbE | 2.5 Gbps | <1ms | Pipeline decent |
| 10 GbE | 10 Gbps | <0.5ms | Pipeline good |
| Thunderbolt 5 (TCP) | ~50-60 Gbps | ~300μs | Pipeline great, tensor marginal |
| Thunderbolt 5 (RDMA) | ~50-60 Gbps | ~few μs | Tensor parallelism viable |
| NVLink (datacenter) | 900 Gbps | ~ns | Gold standard |
Hidden state sizes for common models:
- 70B model: ~16 KB per token per layer boundary
- 405B model: ~26 KB per token per layer boundary
- For pipeline parallelism, you transfer hidden states once per stage boundary — manageable even on slow links
- For tensor parallelism, you transfer per layer — requires very fast interconnect
Device Heterogeneity
Consumer devices vary wildly:
| Device | Memory | Memory BW | Compute (FP16) |
|---|---|---|---|
| M1 MacBook Air (8GB) | 8 GB | 68 GB/s | ~2.6 TFLOPS |
| M2 Pro Mac Mini (32GB) | 32 GB | 200 GB/s | ~6.8 TFLOPS |
| M3 Ultra Mac Studio (512GB) | 512 GB | 800 GB/s | ~27 TFLOPS |
| RTX 3060 (12GB) | 12 GB VRAM | 360 GB/s | 12.7 TFLOPS |
| RTX 4090 (24GB) | 24 GB VRAM | 1008 GB/s | 82.6 TFLOPS |
| RTX 3080 (10GB) | 10 GB VRAM | 760 GB/s | 29.8 TFLOPS |
Problem: Pipeline parallelism throughput limited by slowest device. Tensor parallelism requires roughly equal devices. Scheduling must account for memory capacity, compute speed, and interconnect quality.
Node Reliability
Consumer nodes are unreliable — they go offline, sleep, throttle, get busy with other tasks.
- Petals' approach: Redundant serving (multiple nodes host same layers), automatic rerouting on failure, DHT-based health monitoring
- Exo's approach: Designed for private trusted clusters — less fault tolerance needed
- Challenge: For a public SETI@home-style network, you need 2-3x redundancy per layer block, plus fast failover (Petals achieves this but at cost of throughput)
Quantization Requirements
Quantization is essential to fit large models on consumer hardware:
| Model | FP16 Size | Q8 (8-bit) | Q4 (4-bit) | Q2 (2-bit) |
|---|---|---|---|---|
| Llama 3 8B | 16 GB | 8 GB | 4.5 GB | ~3 GB |
| Llama 3 70B | 140 GB | 70 GB | 40 GB | ~23 GB |
| Llama 3.1 405B | 810 GB | 405 GB | ~230 GB | ~130 GB |
| DeepSeek V3 671B | 1.3 TB | 671 GB | ~380 GB | ~210 GB |
Quality impact: Q8 is nearly lossless. Q4 (GGUF Q4_K_M) is good for most use cases. Below Q3, quality degrades noticeably for complex reasoning.
Minimum Device Specs
To meaningfully participate in distributed inference:
- Minimum useful: 8 GB RAM/VRAM — can host ~2-4 layers of a 70B model (Q4)
- Comfortable: 16+ GB — host significant portion of 70B or small portion of 405B
- Power user: 64+ GB unified memory (Apple Silicon) or 24 GB VRAM (RTX 4090)
4. What's Actually Feasible Today
Feasibility Matrix
| Scenario | Model Size | Hardware Needed | Latency | Status |
|---|---|---|---|---|
| Single consumer device | ≤13B (Q4) | 8 GB RAM | 20-50 tok/s | ✅ Trivial |
| Single high-end device | ≤70B (Q4) | 48-64 GB | 10-30 tok/s | ✅ Works great |
| Private LAN cluster (2-4 nodes) | ≤405B (Q8) | 4× M3 Ultra (512GB) | 15-32 tok/s | ✅ Proven (Exo+RDMA) |
| Private LAN cluster (budget) | ≤70B (Q4) | 2× 32GB Mac Mini | 5-15 tok/s | ✅ Works |
| Private LAN cluster | ≤1T params (MoE) | 4× 512GB Mac Studio | ~30 tok/s | ✅ Proven (Kimi K2) |
| Internet swarm (Petals-style) | ≤405B | Community GPUs | 1-6 tok/s | ⚠️ Works but slow/unreliable |
| Internet swarm (consumer) | ≤70B | Heterogeneous consumer HW | 0.5-3 tok/s | ⚠️ Marginal |
Latency Penalty vs Centralized
| Setup | Llama 70B tok/s | vs A100 baseline |
|---|---|---|
| Single A100 (80GB) | 40-60 tok/s | baseline |
| Single M3 Ultra (192GB) | 25-35 tok/s | 0.5-0.6x |
| 2× Mac Studio (RDMA) | 20-30 tok/s | 0.4-0.5x |
| Petals swarm (good conditions) | 4-6 tok/s | 0.1x |
| Petals swarm (poor conditions) | 1-2 tok/s | 0.02-0.03x |
Key takeaway: Private LAN clusters with RDMA are 40-60% of datacenter speed — very usable. Internet-distributed inference incurs 10-50x latency penalty — only viable for non-interactive or batch workloads.
5. Desktop-First Approach: Runtimes & Frameworks
macOS (Apple Silicon)
| Framework | Description | Strengths |
|---|---|---|
| MLX (Apple) | Apple's ML framework for Apple Silicon. Native Metal GPU support. | Best Apple Silicon performance. MLX distributed for multi-node. |
| llama.cpp (ggml) | C++ inference. Metal backend. | Widest model support (GGUF). Great single-node. |
| Exo | Clustering layer on top of MLX | Only RDMA-capable solution. Tensor parallelism. Auto-discovery. |
| Ollama | User-friendly wrapper around llama.cpp | Easiest setup. No clustering. |
Apple Silicon sweet spot:
- M1/M2 with 16-32 GB: Run 7-13B models locally, contribute layers to larger distributed models
- M2/M3 Pro/Max with 32-96 GB: Run 70B models locally (Q4), meaningful cluster participant
- M3 Ultra with 192-512 GB: Run 405B locally, anchor node for clusters
Windows (NVIDIA GPUs)
| Framework | Description | Strengths |
|---|---|---|
| llama.cpp (CUDA) | C++ with CUDA backend | Fast, wide model support |
| vLLM | Production inference server | PagedAttention, continuous batching |
| Ollama | Wrapper for llama.cpp | Easy setup |
| Petals (WSL) | Distributed inference | Swarm participation |
| ExLlamaV2 | Optimized GPTQ/EXL2 inference | Fastest per-token for NVIDIA |
NVIDIA consumer GPU landscape:
- RTX 3060 12GB / RTX 4060 8GB: 7-13B models. Limited distributed utility.
- RTX 3090/4090 24GB: Up to 70B Q4 with heavy offloading, or strong cluster node
- Multi-GPU (2× 4090): ~48 GB VRAM, can run 70B Q4 fully in VRAM
Linux
All of the above plus:
- vLLM and SGLang run natively (not in WSL)
- TensorRT-LLM for NVIDIA GPUs
- ROCm for AMD GPUs (RX 7900 XTX with 24 GB VRAM is excellent value)
- Vulkan backend in llama.cpp for broader GPU support
Cross-Platform Clustering Stack
For a SETI@home-style platform, the architecture would be:
┌─────────────────────────────────────────────┐
│ Coordination Layer │
│ (DHT/libp2p peer discovery, health monitor) │
├─────────────────────────────────────────────┤
│ Scheduling / Routing │
│ (topology-aware, heterogeneity-aware) │
├─────────────────────────────────────────────┤
│ Communication Layer │
│ RDMA (TB5) │ TCP (LAN) │ QUIC (Internet) │
├─────────────────────────────────────────────┤
│ Inference Backends │
│ MLX (macOS) │ CUDA/llama.cpp │ ROCm │ CPU │
└─────────────────────────────────────────────┘
6. Recommendations & Key Takeaways
-
Private LAN clusters are the sweet spot today. Exo + Apple Silicon + RDMA delivers near-datacenter experience for up to 1T parameter models. This is production-ready.
-
Internet-distributed inference (SETI@home model) is technically feasible but marginal. Petals proves it works, but 1-6 tok/s with reliability issues limits it to non-interactive use cases (batch processing, fine-tuning, embedding generation).
-
Pipeline parallelism is the only viable approach over the internet. Tensor parallelism requires sub-millisecond latency, limiting it to direct-connect or RDMA.
-
Apple Silicon is the dominant consumer platform for this. Unified memory architecture means a $1,600 Mac Mini with 32 GB can do what requires a $1,600 GPU with only 24 GB VRAM. The M3 Ultra at 512 GB is unmatched in consumer memory capacity.
-
The technology gap is narrowing fast. In 2023, distributed consumer inference was a research curiosity. In 2025, Exo runs DeepSeek V3.1 (671B) at usable speeds on a desk-sized cluster. By 2027, with TB5 ubiquity and continued Apple Silicon scaling, 1T+ model inference on consumer clusters will be routine.
-
For a new platform: Build on existing primitives — MLX distributed for Apple, llama.cpp CUDA for NVIDIA, libp2p for peer discovery. The hard unsolved problems are: reliable scheduling across heterogeneous unreliable nodes, and making pipeline parallelism over the internet fast enough for interactive use.