Android Cold Start Optimization: From Zygote Fork to First Frame

One project had an app that felt fast in local testing, but production P90 cold start was stuck at 3.2 seconds. Instrumentation said the problem happened after Application.onCreate, but the code did not reveal an obvious bottleneck. A complete Perfetto trace finally showed the real cost hidden in a Binder call stack: an unrelated third-party SDK was doing synchronous IPC on the main thread.

That kind of “invisible by intuition, obvious in a trace” case is common in startup optimization. This article does not repeat generic advice such as “lazy-load and initialize asynchronously.” Instead, it starts from trace signals and walks through each cold-start phase.


The four cold-start phases and their trace regions

Cold start begins when the user taps the Launcher icon and ends when the first app frame is drawn. The chain looks like this:

tap event -> Zygote forks the process -> Application initialization ->
Activity creation / layout / drawing -> SurfaceFlinger composes the first frame

In Perfetto, these phases map to different trace markers:

  • Zygote fork: the ZygoteForkChild slice in the zygote64 system process
  • Application initialization: bindApplication -> ActivityThread.handleBindApplication
  • Activity creation: activityStart -> performCreate -> performResume
  • First-frame composition: Choreographer#doFrame plus SurfaceFlinger’s commit slice

When capturing with adb shell perfetto, enable sched, binder_driver, gfx, and view. Otherwise Binder calls and rendering-pipeline slices may be missing.

adb shell perfetto \
  -c - --txt \
  -o /data/misc/perfetto-traces/trace.pftrace \
  <<EOF
buffers: { size_kb: 63488 fill_policy: RING_BUFFER }
data_sources: {
  config {
    name: "linux.ftrace"
    ftrace_config {
      ftrace_events: "sched/sched_switch"
      ftrace_events: "power/suspend_resume"
      atrace_categories: "gfx"
      atrace_categories: "view"
      atrace_categories: "binder_driver"
      atrace_categories: "am"
    }
  }
}
duration_ms: 10000
EOF

Load the trace into ui.perfetto.dev, search for the process name, and start with the main-thread slices.


Phase 1: Zygote fork, where app code has limited control

Many startup articles skip this phase. It is worth calling out for the opposite reason: most of this cost is not controlled by business code, so do not spend too much optimization effort here.

Zygote preloads the ART runtime and system classes during system boot. fork() itself uses Copy-on-Write and should be fast in theory. In real traces, ZygoteForkChild often takes 10 to 30 ms, usually because of two factors:

  • GC under memory pressure: when system memory is tight, GC_FOR_ALLOC can happen around fork time, and the sched tracks show heavy CPU preemption.
  • Binder thread initialization: thread creation in ProcessState::startThreadPool() can be delayed on some devices.

If there is a clear gap between fork and bindApplication, inspect CPU usage in system processes during the same time window. It is often caused by overall device load. On the app side, the only realistic levers are indirect: reduce resident processes and lower memory footprint.


Phase 2: bindApplication, the main battleground

The window from bindApplication to activityStart is where app-side optimization has the most room. In the main-thread track, this region is covered by the ActivityThread.handleBindApplication slice.

ContentProvider is the first trap

Many SDKs use ContentProvider.onCreate() for automatic initialization. Firebase and LeakCanary have both used this pattern. ContentProviders initialize after Application.attachBaseContext and before Application.onCreate, and they all run serially on the main thread.

In a trace, this appears as many child slices inside installContentProviders, each representing some SDK’s initialization logic. One real issue I hit: a map SDK provider read a local config file in onCreate, costing 200 ms on low-end devices.

The investigation is direct: expand installContentProviders in Perfetto, list every provider taking more than 10 ms, and decide whether it is needed. Remove what can be removed, and push the rest toward asynchronous initialization where possible.

Layered initialization in Application.onCreate

The usual recommendation is to “move nonessential SDKs to background threads,” but there is a hidden problem: if a background-initialized SDK is first used on the main thread before it finishes, it may block the main thread on CountDownLatch.await(). That only moves the cost from Application to Activity.

A more stable approach is to split initialization by startup phase:

class App : Application() {

    override fun onCreate() {
        super.onCreate()
        // Layer 1: must finish on the main thread because it affects the first frame.
        initCrashReporter()    // Crash capture must be registered early.
        initRouterSync()       // Router table loaded synchronously.

        // Layer 2: run in parallel on background threads; not required before first frame.
        AppScope.launch(Dispatchers.IO) {
            initAnalyticsSDK()
            initPushSDK()
        }

        // Layer 3: delay until IdleHandler; does not block any frame.
        mainLooper.queue.addIdleHandler {
            initLocationSDK()
            false
        }
    }
}

The decision rule is simple: does the first frame depend on this SDK’s return value? If not, it belongs in layer 2. If it is needed only after first-frame interaction, layer 3 through IdleHandler is usually better.

When validating in a trace, compare wall time and CPU time for Application.onCreate. A large gap means the main thread is waiting on Binder or I/O. A small gap means pure CPU work. Those two cases require different fixes.


Phase 3: Activity creation to first measure/layout

The window from performCreate to the first Choreographer#doFrame usually bottlenecks in three places.

Layout hierarchy is too deep

Inflation time grows roughly with View tree depth. In traces, LayoutInflater.inflate duration is a direct signal of layout complexity. Any inflate over 50 ms deserves attention. Common fixes:

  • ViewStub: delay inflation for Views not visible on the first screen
  • AsyncLayoutInflater: inflate on a background thread, then switch to the main thread to addView

AsyncLayoutInflater has a limitation: inflated Views cannot directly depend on main-thread Handler logic during inflation, or they may crash. In practice, I usually prefer ViewStub plus explicit control because it is more predictable.

Synchronous SharedPreferences reads

Reading SharedPreferences in Activity.onCreate is common, but the first getSharedPreferences() triggers file loading. On the main thread, that is synchronous I/O. In a trace, it appears as blocking around SharedPreferencesImpl.startLoadFromDisk.

Alternatives: move to Jetpack DataStore’s Flow API, or warm up SharedPreferences during the asynchronous phase of Application.onCreate so Activity reads hit memory.

Binder call backlog

This is the easiest startup cost to miss. Activity launch already involves many Binder calls for window tokens, window registration, permission checks, and other framework work. Those calls are unavoidable. If business code adds extra Binder calls in onCreate, the cost can become substantial.

In Perfetto, switch to the binder_driver tracks to inspect each Binder transaction’s duration and caller. A typical case I saw: an SDK called PackageManager.getInstalledPackages() in onCreate. On Android 11 and later, that API may enumerate installed packages, and the Binder return time reached 80 ms.


Phase 4: first-frame composition, from VSYNC to pixels

The final mile of the rendering pipeline runs from Choreographer#doFrame to SurfaceFlinger commit.

SurfaceFlinger has its own process track in Perfetto. Find the app’s Layer name, usually something like SurfaceView[package] or com.xxx.MainActivity#0, and inspect which vsync cycle first completes latchBuffer.

If the window from doFrame to latchBuffer crosses more than one vsync cycle, you have jank. Common causes:

  • Expensive work in onDraw, such as Bitmap decoding or many Path calculations
  • Repeated measure/layout, often from nested requestLayout
  • GPU composition timeout, such as a hardware layer missing GPU cache

First-frame Bitmap loading is a frequent issue. If the first screen contains images, predecode and cache them on a background thread from Application, then let Activity read from memory cache. That skips disk I/O and decode time on the first frame.


Closing the optimization loop

Startup optimization is not a one-off project. A sustainable workflow records cold-start traces in CI before each release on fixed low-end, mid-range, and high-end devices. A script extracts key slice durations into monitoring.

I usually gate on two regression metrics:

  1. bindApplication duration, which reflects SDK initialization quality
  2. First-frame doFrame wall time, which reflects layout and rendering quality

Looking only at “total cold-start time” hides phase-level regressions. A release can make Application faster and Activity slower, with the total hiding the responsible phase. Segment-level metrics point directly to the owner.

For tooling, Perfetto’s command-line traceconv can convert a trace to JSON, and scripts can parse the slice tree automatically. That is far more efficient than manual UI inspection. If you still rely mainly on Android Studio Profiler for startup work, moving to Perfetto UI gives you much denser signal.

Further reading