Android App Resilience: Retries, Graceful Fallbacks, and Multi-Level Cache

During a 2025 peak-traffic event, an ecommerce app I worked on hit a strange failure: when one product-detail API timed out intermittently, the entire detail page went blank, even though stale cached data existed. The so-called resilience strategy was just a try-catch plus a toast. Users saw an error message instead of a graceful fallback.

That incident clarified the point: client-side resilience is not a few if statements. It is a system that covers networking, data, UI, and observability.

Defining Client-Side Resilience

Server-side resilience has mature patterns: multi-region deployment, rate limiting, circuit breakers, and failover. Mobile clients face a different environment: unstable networks, fragmented devices, and processes that can be killed at any time.

I split client resilience into three dimensions:

  • Recoverability: can the request recover automatically without user action?
  • Degradability: can noncritical features step aside when core flows are under stress?
  • Observability: can client failures be quantified and explained instead of relying on user reports?

Recoverability is the baseline, degradability is the product experience layer, and observability makes the system improvable.

Network Layer: Exponential Backoff and Conditional Retry

A common mistake is enabling OkHttp’s retryOnConnectionFailure and assuming retry is solved. That only handles part of the connection-failure space. DNS timeouts, TLS handshake issues, and server 5xx responses need explicit policy.

The retry logic should be conditional:

class ConditionalRetryInterceptor(
    private val maxRetries: Int = 3,
    private val initialDelayMs: Long = 1000L
) : Interceptor {
    private val retryableMethods = setOf("GET", "HEAD", "OPTIONS")
    private val retryableCodes = setOf(502, 503, 504)

    override fun intercept(chain: Interceptor.Chain): Response {
        var lastException: Exception? = null
        var delayMs = initialDelayMs

        for (attempt in 0..maxRetries) {
            try {
                val request = chain.request()
                val response = chain.proceed(request)

                if (response.isSuccessful ||
                    request.method !in retryableMethods ||
                    response.code !in retryableCodes) {
                    return response
                }
                response.close()
            } catch (e: IOException) {
                lastException = e
                if (e !is SocketTimeoutException && e !is UnknownHostException) {
                    throw e
                }
            }

            if (attempt < maxRetries) {
                Thread.sleep(delayMs)
                delayMs *= 2
            }
        }
        throw lastException ?: IOException("Max retries exceeded")
    }
}

Three decisions are important:

  1. Retry only idempotent methods.
  2. Use exponential backoff instead of fixed intervals.
  3. Distinguish recoverable network errors from deterministic failures.

Also limit concurrent retries. If every failed request retries aggressively, the retry storm can fill thread pools and damage healthy APIs.

Data Layer: Three-Level Cache Fallback

The data layer should not treat “network failed” as “data unavailable”. A useful fallback chain is:

Network -> Memory cache -> DataStore/MMKV -> Room -> Empty state

The order depends on data type, but the principle is stable: return the freshest usable data and mark its freshness explicitly.

Memory Cache: Fastest Path

Memory cache is best for recent detail pages, home modules, and lightweight configuration:

class MemoryCache<K, V>(private val maxSize: Int) {
    private val cache = object : LruCache<K, V>(maxSize) {}

    fun get(key: K): V? = cache.get(key)
    fun put(key: K, value: V) = cache.put(key, value)
}

Memory cache is fast but volatile. It disappears with process death, so it should not be the only fallback for important content.

DataStore or MMKV: Persistent Middle Layer

Small structured records and feature state can live in DataStore or MMKV. This layer is useful for the last successful home response, user preferences, feature flags, and lightweight detail summaries.

dataStore.updateData { current ->
    current.toBuilder()
        .setLastHomePayload(payload)
        .setUpdatedAt(System.currentTimeMillis())
        .build()
}

Always store freshness metadata. Stale-but-usable data is different from fresh data, and the UI should be able to display that distinction.

Room: Final Defense for Structured Data

Room is appropriate when data needs queries, relations, or paging. Product details, order history, and content feeds often benefit from Room as the durable fallback.

@Query("SELECT * FROM product WHERE id = :id")
suspend fun getProduct(id: String): ProductEntity?

When network fails, the repository can return cached Room data with a degraded state:

sealed interface DataResult<out T> {
    data class Fresh<T>(val value: T) : DataResult<T>
    data class Stale<T>(val value: T, val reason: String) : DataResult<T>
    data class Unavailable(val reason: String) : DataResult<Nothing>
}

This makes fallback state visible to the UI without throwing away usable data.

Orchestration: Adaptive Degradation Engine

Retries and caches are local tools. A complete resilience system also needs orchestration.

The degradation engine evaluates network quality, API health, business priority, and device state:

data class DegradationContext(
    val networkQuality: NetworkQuality,
    val apiHealth: ApiHealth,
    val batterySaver: Boolean,
    val businessPriority: Priority
)

For low-priority modules, the engine can skip network refresh and use cached data. For core purchase flows, it can allow longer retry windows and stronger telemetry. For expensive visual features, it can disable animations or high-resolution images under pressure.

The UI should receive explicit state:

data class PageState(
    val product: Product?,
    val isStale: Boolean,
    val degradedReason: String?
)

Do not hide graceful degradation completely. A subtle “Last updated” label is better than silent inconsistency, especially for commerce, finance, and operations workflows.

Practical Points

First, define retry policy per API category. Login, payment, search, and recommendations have different risk profiles.

Second, record fallback source in telemetry: network, memory, persistent cache, database, or empty state. Without this, you cannot evaluate whether resilience is working.

Third, test failure paths intentionally. Use mock servers, DNS failure simulation, timeouts, and process death tests. Resilience code that is never exercised tends to rot.

Finally, treat stale data as a product decision. Sometimes stale content is better than a blank page. Sometimes it is dangerous. The architecture should support both decisions.

Good client-side resilience does not eliminate failures. It changes how failures appear: from blank pages and user confusion into controlled fallback, explainable states, and measurable recovery.