LLM Inference Articles

Android On-device AI Memory Management: Model Loading Peaks, Tensor Lifetimes, and KV Cache Reclaim

A practical memory-management path for Android on-device LLM deployment, covering mmap model loading, tensor lifecycle reclamation, sliding-window KV cache, layer-wise decay, and LMK survival.

Android On-device AI Prompt Engineering: Token Budgets, Few-shot Compression, and TTFT Control

A practical Android on-device LLM prompt-engineering guide showing how token budgeting, few-shot template compression, and dynamic budget switching reduced first-token latency from 8.7 seconds to under 2 seconds.