Android Hybrid AI Routing and Offline Fallback: End-to-end On-device and Cloud Inference Scheduling
When building on-device AI features, we hit a classic failure mode: an intelligent cutout feature worked perfectly on Wi-Fi, but QA tried it on the subway and the whole page froze into an ANR. The root cause was that the routing logic only checked whether the device was online. It ignored network quality and model complexity. With a 200 MB model, simply uploading an image over a weak network took 30 seconds.
The core problem was coarse decision-making. A routing policy cannot be a simple if-else. It needs a multidimensional decision engine that understands context and dynamically weighs tradeoffs.
The four dimensions of a routing decision engine
A routing engine is essentially a scoring function. It takes the current context as input and returns an inference target: local, cloud, or hybrid. The engine I designed uses four dimensions:
Routing score = W1 * model fit + W2 * network quality + W3 * latency budget + W4 * cost constraint
Each dimension is normalized to 0-1, and the weights can be adjusted by business scenario.
Model fit reflects what the on-device model can cover. For example, the device may have a 50 MB general classification model that handles 80% of common cases, but fine-grained recognition still needs a larger cloud model as a backstop. Model fit should not be binary. You can use the on-device model’s output confidence directly as a signal: when confidence falls below a threshold such as 0.7, trigger a cloud review automatically.
data class RoutingContext(
val inputSizeBytes: Long,
val taskType: TaskType,
val latencyBudgetMs: Long,
val modelConfidence: Float = 0f
)
fun computeRoute(ctx: RoutingContext, networkScore: Float): InferenceTarget {
val matchScore = if (ctx.modelConfidence >= 0.7f) 1.0f else ctx.modelConfidence / 0.7f
val composite = matchScore * 0.4f + networkScore * 0.3f +
(1f - ctx.latencyBudgetMs / 5000f).coerceIn(0f, 1f) * 0.3f
return when {
composite > 0.7f -> InferenceTarget.LOCAL
composite > 0.3f -> InferenceTarget.CLOUD_FALLBACK
else -> InferenceTarget.CLOUD_ONLY
}
}
The 0.4/0.3/0.3 weights were not guessed. We stress-tested for three weeks, ran 5,000 inference requests under different network conditions, and backed into the best ratio from P95 latency and success rate. In production, these should be configurable and adjusted through remotely delivered parameters.
Network awareness: more than checking whether a network exists
Most apps rely on an isConnected() check. For on-device and cloud collaboration, that is nowhere near enough. What you need is a network-quality curve, not a boolean.
On Android, there are three useful ways to collect network-quality data:
// 1. ConnectivityManager provides bandwidth estimates
val cm = context.getSystemService(ConnectivityManager::class.java)
val caps = cm.getNetworkCapabilities(cm.activeNetwork)
val downStream = caps?.linkDownstreamBandwidthKbps ?: 0 // Estimated downstream bandwidth
val upStream = caps?.linkUpstreamBandwidthKbps ?: 0 // Estimated upstream bandwidth
// 2. Active probing: RTT for a small request
suspend fun probeRtt(): Long {
val start = SystemClock.elapsedRealtime()
val response = httpClient.head("https://api.example.com/ping")
return SystemClock.elapsedRealtime() - start
}
// 3. Historical statistics with a sliding window
class NetworkHistory(val windowMs: Long = 60_000) {
private val samples = LinkedList<Pair<Long, Float>>()
fun record(quality: Float) {
samples.add(SystemClock.elapsedRealtime() to quality)
// Remove expired samples
while (samples.isNotEmpty() &&
samples.first.first < SystemClock.elapsedRealtime() - windowMs) {
samples.removeFirst()
}
}
fun avg(): Float = if (samples.isEmpty()) 0f
else samples.map { it.second }.average().toFloat()
}
After cross-checking those three signals, the system outputs a 0-1 network-quality score. When the score is below 0.3, the routing engine should not send a cloud request even if the model fit is high. It should go directly to local inference or a fallback strategy.
A real-world pitfall: linkDownstreamBandwidthKbps returns 0 on some domestic Android ROMs, so you need to fall back to active probing. When WLAN switches to 5G, system callbacks can lag by 1-2 seconds. Requests sent during that window have a high probability of failing, so the routing decision queue should be frozen as soon as a network-switch event arrives.
On-device degradation and offline disaster recovery
Networks are sometimes unavailable. Offline handling is not about asking “what do we do without the cloud?” It is about where the boundary of on-device capability is, and how to make degradation invisible to the user.
We designed a three-tier on-device capability model:
| Level | Capability | Typical latency | Scenarios |
|---|---|---|---|
| L1 Full capability | Full on-device large-model inference | 200-500 ms | Offline Wi-Fi, high-quality on-device model |
| L2 Medium capability | Quantized model or simplified inference path | 50-150 ms | Unstable mobile network |
| L3 Basic capability | Rule engine plus cache fallback | <10 ms | Fully offline, low-end devices |
The key is that the transition between L2 and L3 must be invisible to the user. Take intelligent photo editing as an example. L2 can still produce reasonable results with a quantized model, while L3 uses preset filters and templates. The user sees “edit complete”; only the quality differs. We A/B tested the quality gap: 90% of users could not distinguish L2 from L1, and 60% could not distinguish L3 from L2.
class DegradationManager(
private val deviceCap: DeviceCapability,
private val networkState: StateFlow<NetworkQuality>
) {
fun currentLevel(): DegradationLevel {
if (networkState.value.score > 0.5f && deviceCap.isHighEnd) {
return DegradationLevel.L1
}
if (deviceCap.ramMb >= 4096 && networkState.value.score > 0.2f) {
return DegradationLevel.L2
}
return DegradationLevel.L3
}
fun selectExecutor(task: InferenceTask): ModelExecutor = when (currentLevel()) {
DegradationLevel.L1 -> LocalLargeModel.smallDispatcher()
DegradationLevel.L2 -> QuantizedModel // 8-bit quantization, model size reduced by 75%
DegradationLevel.L3 -> RuleEngineCache
}
}
Quantized-model load time also matters. The L1 model stays resident in memory, while the L2 quantized model is loaded on demand. The first load has a 200-400 ms cold-start cost. We use a simple warm-up strategy: when network quality is below 0.5 for three consecutive samples, preload the quantized model in the background without occupying the main thread.
Request queues and concurrent scheduling
Request-queue management is another easy place to make mistakes in hybrid inference. At any moment, there may be multiple inference requests: the user rapidly switches filters, scrolls through AI-generated content, and a background task precomputes the next frame. These requests have different priorities. A single FIFO queue slows down interactive response.
sealed class InferencePriority : Comparable<InferencePriority> {
data object CRITICAL : InferencePriority() // Current user gesture response
data class HIGH(val deadlineMs: Long) : InferencePriority() // Visible area
data class MEDIUM(val page: String) : InferencePriority() // Preload
data object LOW : InferencePriority() // Background computation
private val ordinal: Int get() = when (this) {
CRITICAL -> 0; is HIGH -> 1; is MEDIUM -> 2; LOW -> 3
}
override fun compareTo(other: InferencePriority) =
ordinal.compareTo(other.ordinal)
}
val requestQueue = PriorityQueue<InferenceRequest>()
suspend fun schedule(request: InferenceRequest) {
requestQueue.add(request)
// CRITICAL requests cancel queued low-priority requests of the same type
// to avoid duplicate inference work.
if (request.priority == InferencePriority.CRITICAL) {
requestQueue.removeAll { it.taskId == request.taskId &&
it.priority > InferencePriority.CRITICAL }
}
}
CRITICAL priority also needs deduplication. When the user scrolls quickly, the same task can accumulate multiple queued requests. Keep only the latest one to avoid wasting GPU resources.
Tradeoffs behind several decisions
Looking back at the whole architecture, three tradeoffs matter most.
Should routing decisions live on the client or the server? I chose the client. The server has more global information, but it adds another network round trip, which is a poor latency tradeoff. A client can make a local decision from a 60-second sliding window of historical data, which covers most scenarios. The cost is slightly heavier client logic: roughly 200 KB of additional package size, which is acceptable.
The cost of evaluating model fit. Using on-device model confidence as a routing signal means running an additional inference before each request. On a Snapdragon 8 Gen 2, we measured this extra cost at about 15-30 ms. That is an order of magnitude cheaper than a failed cloud request, so it is completely acceptable.
How users perceive fallback quality. Technically, L3 can switch in seamlessly, but product still needs to decide whether L3 results should be labeled as “offline mode.” Our conclusion was not to label them. Labeling actually amplified the user’s perception that quality was reduced; leaving it unlabeled produced a lower complaint rate. That was a counterintuitive finding.
If you are building a similar on-device and cloud collaboration system, spend the early effort on solid network-quality evaluation. It is the foundation of the whole routing engine. No matter how clever the other strategies are, an inaccurate foundation ruins them. Quantized-model loading strategy is also worth planning up front; do not wait until device-compatibility issues show up in production.