Inside Android TTS: From TextToSpeech API to On-Device Vocoders

While building an in-car voice assistant, we hit a problem: TTS time to first audio (TTFA) was 300 ms slower than a competing product. After tracing the call chain, we found that the bottleneck was not the network or model inference. The engine service binding path alone consumed more than 100 ms, and every synthesis request repeated setup work.

The Android TTS framework is designed around a unified upper-level API, while engine selection, binding, and audio synthesis all run in a separate-process Service. Without understanding this bridge, it is hard to reduce latency.

The real cost of TextToSpeech initialization

The new TextToSpeech(context, listener) constructor does three things:

  1. Uses TtsEngines to query the list of registered engines on the system.
  2. Chooses a target engine based on the user’s default setting or locale matching rules.
  3. Uses bindService to bind across processes to the engine’s TextToSpeechService.

The core code path is initTts() in TextToSpeech.java:

// Simplified TextToSpeech initialization path
private int initTts(Context ctx, OnInitListener listener) {
    // 1. Get engine information
    mEnginesHelper = new TtsEngines(ctx);
    String engine = mEnginesHelper.getDefaultEngine();

    // 2. Build the Service Intent
    Intent intent = new Intent(TextToSpeech.Engine.INTENT_ACTION_TTS_SERVICE);
    intent.setPackage(engine);

    // 3. Cross-process binding, the real source of latency
    boolean bound = ctx.bindService(intent, mServiceConnection,
        Context.BIND_AUTO_CREATE);

    // 4. Set language after onServiceConnected fires
    // setLanguage() is called asynchronously after connection succeeds
    return bound ? SUCCESS : ERROR;
}

Two details matter: setLanguage only takes effect after the onInit callback, so calling speak immediately after construction can fail because the language has not loaded; and every new TextToSpeech call triggers a bindService, which can cost 80-120 ms on low-end devices.

Engine plugins: an ecosystem built on one intent-filter

How does the framework discover third-party engines? Through a Service declaration in AndroidManifest.xml. The built-in Pico TTS and third-party engines such as Google TTS and iFlytek use the same registration model:

<!-- Service declaration for a third-party TTS engine -->
<service
    android:name=".MyTtsService"
    android:exported="true"
    android:label="@string/engine_name">
    <intent-filter>
        <action android:name="android.intent.action.TTS_SERVICE" />
    </intent-filter>
    <meta-data
        android:name="android.speech.tts"
        android:resource="@xml/tts_engine_info" />
</service>

TtsEngines.getEngines() scans every Service declaring this action through PackageManager.queryIntentServices, then parses tts_engine_info.xml for metadata such as supported languages, sample rate, and gender preference. Engine priority is not based on user ratings. It comes from the “preferred engine” setting in system settings, stored in Settings.Secure.TTS_DEFAULT_SYNTH.

<!-- res/xml/tts_engine_info.xml -->
<tts-engine xmlns:android="http://schemas.android.com/apk/res/android"
    android:locale="zh-CN"
    android:gender="female"
    android:networkRequired="false"
    android:sampleRateInHz="24000" />

The key value of this design is that the framework does not care whether the engine uses rule-based synthesis or neural inference. As long as the engine implements the onSynthesizeText callback and writes standard PCM audio, it can plug into the system.

Synthesis callback chain: async text-to-PCM

The call path for speak(text, queueMode, params, utteranceId) is more complex than it appears. The framework does not directly call the engine’s onSynthesizeText; there is an Action queue in between:

speak() -> mProcessingQueue -> runAction()
       -> mServiceConnection.getService().speak()
       -> engine process: onSynthesizeText(text, params, utteranceId)
       -> engine internals: text processing -> audio generation -> writeAudio()
       -> framework process: AudioTrack playback

mProcessingQueue is a single-threaded sequential queue. The difference between QUEUE_FLUSH and QUEUE_ADD is only whether the queue is cleared; it does not change how later Actions are executed.

In my measurements, the interval from entering engine onSynthesizeText to the first writeAudio callback dominates TTFA. On high-end devices, Google TTS takes roughly 50-80 ms here. Add queue waiting and cross-process overhead, and TTFA can easily exceed 200 ms.

On-device neural vocoders: quality close to natural speech

Traditional parameter synthesis, such as formant synthesis, and concatenative synthesis both tend to sound mechanical. On-device neural TTS usually has two stages:

  • Acoustic model: text to Mel spectrogram. Common architectures include Tacotron2 and FastSpeech, which infer a sequence of 80-dimensional Mel frames.
  • Vocoder: Mel spectrogram to audio waveform. GAN-based approaches such as WaveRNN and HiFi-GAN offer a good tradeoff between quality and inference speed.

The core on-device deployment tradeoff is model size versus inference latency. I once deployed a HiFi-GAN vocoder with TensorFlow Lite. The original model was 14 MB; INT8 quantization reduced it to 3.5 MB, and single-frame inference dropped from 12 ms to 4 ms:

// Simplified TFLite vocoder inference
class HifiGanVocoder(modelPath: String) {
    private val interpreter = Interpreter(File(modelPath))

    fun melToAudio(mel: FloatArray): ShortArray {
        val output = Array(1) {
            ShortArray(mel.size * HOP_LENGTH) // hop_length=256
        }
        interpreter.run(mel, output)
        return output[0] // 16-bit PCM
    }
}

Quantization slightly reduced audio quality, with MOS dropping from 4.2 to 4.0. In exchange, inference was stable at 4 ms per frame on Snapdragon 865. With a hop length of 256, the real-time factor (RTF) was about 0.06, far below the real-time threshold of 1.0.

Real-time synthesis pipeline: chunked inference plus ring buffer

Long-text synthesis is another challenge for on-device models. Waiting for the full text to be processed before generating audio creates an obvious pause. The solution is sentence-level synthesis plus streaming audio delivery:

Input text -> sentence splitter -> [sentence 1, sentence 2, ..., sentence N]
                    |
            acoustic model inference -> Mel frame sequence
                    |
            vocoder inference -> PCM segment
                    |
            AudioTrack.write() -> immediate playback

In implementation, two threads and a ring buffer connect the stages. The foreground thread handles model inference and audio generation. A background thread, using AudioTrack in WRITE_NON_BLOCKING mode, consumes the buffer:

// Ring-buffer streaming playback
class StreamingTtsPlayer(bufferSizeMs: Int = 200) {
    private val ringBuffer = ShortArray(SAMPLE_RATE * bufferSizeMs / 1000)

    fun onAudioReady(pcm: ShortArray) {
        // Write into the ring buffer and wait automatically for AudioTrack consumption
        ringBuffer.write(pcm)
    }

    // The AudioTrack callback thread reads and plays automatically
    private val track = AudioTrack.Builder()
        .setBufferSizeInBytes(ringBuffer.size * 2)
        .setTransferMode(AudioTrack.MODE_STREAM)
        .build()
}

One pitfall: AudioTrack.getMinBufferSize returns an overly large value on Android 10 and below, around 8 KB, which adds 50-80 ms of playback startup latency. I eventually hardcoded the buffer size to one tenth of the sample rate, a 100 ms buffer, and it ran stably on both Qualcomm and MTK platforms.

Practical tradeoffs

Engine initialization reuse: Do not create new TextToSpeech in every Activity. Keep an application-level singleton instance. This can save more than 80 ms of first-audio latency.

Silent-frame prefill: Before the first real audio segment arrives, write 20-30 ms of silence into AudioTrack. This is not about filling audio; it gives the underlying mixer thread time to warm up. In testing, it reduced playback jitter by 15-20 ms.

Model warm loading: If TTS may be needed immediately after app launch, load the model in Application.onCreate. The first construction of a TFLite Interpreter allocates native memory and compiles the operator graph, which can take 200-500 ms. Loading only when the user triggers TTS is too late.

Engine choice: My default is to use the system engine first, usually Google TTS, with the in-house on-device model as a fallback. Google TTS naturalness is backed by large-scale training data. It is hard for an in-house model to beat it on short text, while the maintenance cost is orders of magnitude higher.