Android On-device AI
This topic covers Android on-device AI engineering.
It focuses on how AI capabilities actually land inside Android apps: how models are loaded, how inference is scheduled, how memory and power are controlled, how edge and cloud paths work together, and how Compose screens handle streaming or multimodal output.
This is different from AI Development Tools, which is about using AI to write and operate software. This page is about building AI features that run on Android devices.
Learning Path
- Start with platform capabilities: AICore, Gemini Nano, ML Kit, NNAPI, LiteRT, and MediaPipe.
- Benchmark the full pipeline instead of only the model: latency, throughput, NPU/GPU/CPU usage, memory bandwidth, power, and thermal behavior.
- Design LLM product behavior: prompt budget, context windows, streaming output, local RAG, and conversation state.
- Productionize the system: model distribution, versioning, concurrency, fallback, security, multimodal input, and privacy boundaries.
Platform and Capability Entry Points
- Android AICore and Gemini Nano: system services, model access, and local inference
- Android ML Kit pipeline: from visual detection to CameraX integration
- Android NNAPI internals: HAL abstraction and Qualcomm/MediaTek NPU paths
- Android 16 App Functions: semantic indexing and cross-app intelligent actions
Performance and Resource Control
- Android on-device AI benchmark design: latency, throughput, power, and thermal degradation
- Profiling NPU scheduling and memory bandwidth with Perfetto
- Memory-bandwidth optimization: from GPU shared memory to NPU zero-copy paths
- Power and thermal management for on-device inference
- Dynamic inference policy based on temperature, battery, and memory pressure
- Memory management for local AI: model-load peaks and KV cache recycling
LLM, RAG, and UI Integration
- Streaming local LLM output: from token generation to incremental Compose rendering
- Context-window engineering: prompt compression and conversation state machines
- Local RAG on Android: from vector databases to knowledge-augmented inference
- Prompt engineering for on-device inference: token budgets and few-shot templates
- Compose UI architecture for local AI chat: streaming rendering and multi-turn state
Production Governance
- Hybrid edge-cloud AI inference: model routing and offline fallback
- Dynamic model delivery and version management on Android
- Concurrent inference scheduling: singleton engines, priority queues, and backpressure
- Model security: encrypted storage, TEE inference, and IP protection
- Realtime video-stream inference: from CameraX frame callbacks to GPU processing
- Multimodal local AI: Gemini Nano multimodality and real-time Compose interaction
Next Step
For resource pressure, frame stability, and tracing methods, continue with Android Performance. For streaming and chat UI, continue with Jetpack Compose. For release gates and model governance, continue with Mobile Engineering.