Android Hybrid AI Routing with Offline Fallback and Cloud Inference | Kai

When building on-device AI features, we hit a classic failure mode: an intelligent cutout feature worked perfectly on Wi-Fi, but QA tried it on the subway and the whole page froze into an ANR. The root cause was that the routing logic only checked whether the device was online. It ignored network quality and model complexity. With a 200 MB model, simply uploading an image over a weak network took 30 seconds.

The core problem was coarse decision-making. A routing policy cannot be a simple if-else. It needs a multidimensional decision engine that understands context and dynamically weighs tradeoffs.

The four dimensions of a routing decision engine

A routing engine is essentially a scoring function. It takes the current context as input and returns an inference target: local, cloud, or hybrid. The engine I designed uses four dimensions:

Routing score = W1 * model fit + W2 * network quality + W3 * latency budget + W4 * cost constraint

Each dimension is normalized to 0-1, and the weights can be adjusted by business scenario.

Model fit reflects what the on-device model can cover. For example, the device may have a 50 MB general classification model that handles 80% of common cases, but fine-grained recognition still needs a larger cloud model as a backstop. Model fit should not be binary. You can use the on-device model’s output confidence directly as a signal: when confidence falls below a threshold such as 0.7, trigger a cloud review automatically.

data class RoutingContext(
    val inputSizeBytes: Long,
    val taskType: TaskType,
    val latencyBudgetMs: Long,
    val modelConfidence: Float = 0f
)

fun computeRoute(ctx: RoutingContext, networkScore: Float): InferenceTarget {
    val matchScore = if (ctx.modelConfidence >= 0.7f) 1.0f else ctx.modelConfidence / 0.7f

    val composite = matchScore * 0.4f + networkScore * 0.3f +
                   (1f - ctx.latencyBudgetMs / 5000f).coerceIn(0f, 1f) * 0.3f

    return when {
        composite > 0.7f -> InferenceTarget.LOCAL
        composite > 0.3f -> InferenceTarget.CLOUD_FALLBACK
        else -> InferenceTarget.CLOUD_ONLY
    }
}

The 0.4/0.3/0.3 weights were not guessed. We stress-tested for three weeks, ran 5,000 inference requests under different network conditions, and backed into the best ratio from P95 latency and success rate. In production, these should be configurable and adjusted through remotely delivered parameters.

Network awareness: more than checking whether a network exists

Most apps rely on an isConnected() check. For on-device and cloud collaboration, that is nowhere near enough. What you need is a network-quality curve, not a boolean.

On Android, there are three useful ways to collect network-quality data:

// 1. ConnectivityManager provides bandwidth estimates
val cm = context.getSystemService(ConnectivityManager::class.java)
val caps = cm.getNetworkCapabilities(cm.activeNetwork)
val downStream = caps?.linkDownstreamBandwidthKbps ?: 0  // Estimated downstream bandwidth
val upStream = caps?.linkUpstreamBandwidthKbps ?: 0      // Estimated upstream bandwidth

// 2. Active probing: RTT for a small request
suspend fun probeRtt(): Long {
    val start = SystemClock.elapsedRealtime()
    val response = httpClient.head("https://api.example.com/ping")
    return SystemClock.elapsedRealtime() - start
}

// 3. Historical statistics with a sliding window
class NetworkHistory(val windowMs: Long = 60_000) {
    private val samples = LinkedList<Pair<Long, Float>>()

    fun record(quality: Float) {
        samples.add(SystemClock.elapsedRealtime() to quality)
        // Remove expired samples
        while (samples.isNotEmpty() &&
               samples.first.first < SystemClock.elapsedRealtime() - windowMs) {
            samples.removeFirst()
        }
    }
    fun avg(): Float = if (samples.isEmpty()) 0f
        else samples.map { it.second }.average().toFloat()
}

After cross-checking those three signals, the system outputs a 0-1 network-quality score. When the score is below 0.3, the routing engine should not send a cloud request even if the model fit is high. It should go directly to local inference or a fallback strategy.

A real-world pitfall: linkDownstreamBandwidthKbps returns 0 on some domestic Android ROMs, so you need to fall back to active probing. When WLAN switches to 5G, system callbacks can lag by 1-2 seconds. Requests sent during that window have a high probability of failing, so the routing decision queue should be frozen as soon as a network-switch event arrives.

On-device degradation and offline disaster recovery

Networks are sometimes unavailable. Offline handling is not about asking “what do we do without the cloud?” It is about where the boundary of on-device capability is, and how to make degradation invisible to the user.

We designed a three-tier on-device capability model:

Level	Capability	Typical latency	Scenarios
L1 Full capability	Full on-device large-model inference	200-500 ms	Offline Wi-Fi, high-quality on-device model
L2 Medium capability	Quantized model or simplified inference path	50-150 ms	Unstable mobile network
L3 Basic capability	Rule engine plus cache fallback	<10 ms	Fully offline, low-end devices

The key is that the transition between L2 and L3 must be invisible to the user. Take intelligent photo editing as an example. L2 can still produce reasonable results with a quantized model, while L3 uses preset filters and templates. The user sees “edit complete”; only the quality differs. We A/B tested the quality gap: 90% of users could not distinguish L2 from L1, and 60% could not distinguish L3 from L2.

class DegradationManager(
    private val deviceCap: DeviceCapability,
    private val networkState: StateFlow<NetworkQuality>
) {
    fun currentLevel(): DegradationLevel {
        if (networkState.value.score > 0.5f && deviceCap.isHighEnd) {
            return DegradationLevel.L1
        }
        if (deviceCap.ramMb >= 4096 && networkState.value.score > 0.2f) {
            return DegradationLevel.L2
        }
        return DegradationLevel.L3
    }

    fun selectExecutor(task: InferenceTask): ModelExecutor = when (currentLevel()) {
        DegradationLevel.L1 -> LocalLargeModel.smallDispatcher()
        DegradationLevel.L2 -> QuantizedModel // 8-bit quantization, model size reduced by 75%
        DegradationLevel.L3 -> RuleEngineCache
    }
}

Quantized-model load time also matters. The L1 model stays resident in memory, while the L2 quantized model is loaded on demand. The first load has a 200-400 ms cold-start cost. We use a simple warm-up strategy: when network quality is below 0.5 for three consecutive samples, preload the quantized model in the background without occupying the main thread.

Request queues and concurrent scheduling

Request-queue management is another easy place to make mistakes in hybrid inference. At any moment, there may be multiple inference requests: the user rapidly switches filters, scrolls through AI-generated content, and a background task precomputes the next frame. These requests have different priorities. A single FIFO queue slows down interactive response.

sealed class InferencePriority : Comparable<InferencePriority> {
    data object CRITICAL : InferencePriority()    // Current user gesture response
    data class HIGH(val deadlineMs: Long) : InferencePriority()  // Visible area
    data class MEDIUM(val page: String) : InferencePriority()    // Preload
    data object LOW : InferencePriority()         // Background computation

    private val ordinal: Int get() = when (this) {
        CRITICAL -> 0; is HIGH -> 1; is MEDIUM -> 2; LOW -> 3
    }
    override fun compareTo(other: InferencePriority) =
        ordinal.compareTo(other.ordinal)
}

val requestQueue = PriorityQueue<InferenceRequest>()

suspend fun schedule(request: InferenceRequest) {
    requestQueue.add(request)
    // CRITICAL requests cancel queued low-priority requests of the same type
    // to avoid duplicate inference work.
    if (request.priority == InferencePriority.CRITICAL) {
        requestQueue.removeAll { it.taskId == request.taskId &&
                                it.priority > InferencePriority.CRITICAL }
    }
}

CRITICAL priority also needs deduplication. When the user scrolls quickly, the same task can accumulate multiple queued requests. Keep only the latest one to avoid wasting GPU resources.

Tradeoffs behind several decisions

Looking back at the whole architecture, three tradeoffs matter most.

Should routing decisions live on the client or the server? I chose the client. The server has more global information, but it adds another network round trip, which is a poor latency tradeoff. A client can make a local decision from a 60-second sliding window of historical data, which covers most scenarios. The cost is slightly heavier client logic: roughly 200 KB of additional package size, which is acceptable.

The cost of evaluating model fit. Using on-device model confidence as a routing signal means running an additional inference before each request. On a Snapdragon 8 Gen 2, we measured this extra cost at about 15-30 ms. That is an order of magnitude cheaper than a failed cloud request, so it is completely acceptable.

How users perceive fallback quality. Technically, L3 can switch in seamlessly, but product still needs to decide whether L3 results should be labeled as “offline mode.” Our conclusion was not to label them. Labeling actually amplified the user’s perception that quality was reduced; leaving it unlabeled produced a lower complaint rate. That was a counterintuitive finding.

If you are building a similar on-device and cloud collaboration system, spend the early effort on solid network-quality evaluation. It is the foundation of the whole routing engine. No matter how clever the other strategies are, an inaccurate foundation ruins them. Quantized-model loading strategy is also worth planning up front; do not wait until device-compatibility issues show up in production.

Android Hybrid AI Routing and Offline Fallback: End-to-end On-device and Cloud Inference Scheduling

The four dimensions of a routing decision engine

Network awareness: more than checking whether a network exists

On-device degradation and offline disaster recovery

Request queues and concurrent scheduling

Tradeoffs behind several decisions

Android On-device AI Benchmarking: Latency, Throughput, Power, and Thermal Degradation

Android On-device AI Chat Compose UI Architecture: Streaming Rendering and Multi-turn Conversation State

Android Push Delivery: FCM, Vendor Channels, and Doze

Android On-Device LLM Latency: From Tap to First Token

Read Next

Android On-device AI Image Preprocessing: From Bitmap Pixels to Tensor Input

Android Picture-in-Picture: Lifecycle, SurfaceView, and MediaSession

The four dimensions of a routing decision engine

Network awareness: more than checking whether a network exists

On-device degradation and offline disaster recovery

Request queues and concurrent scheduling

Tradeoffs behind several decisions

Related Posts

Android On-device AI Benchmarking: Latency, Throughput, Power, and Thermal Degradation

Android On-device AI Chat Compose UI Architecture: Streaming Rendering and Multi-turn Conversation State

Android Push Delivery: FCM, Vendor Channels, and Doze

Android On-Device LLM Latency: From Tap to First Token

Read Next

Android On-device AI Image Preprocessing: From Bitmap Pixels to Tensor Input

Android Picture-in-Picture: Lifecycle, SurfaceView, and MediaSession