← Blog

2025-09-15

Progressive Audio Streaming in the Browser

Point an <audio> tag at a URL that’s still being generated — a TTS response, a live transcode, anything streamed chunk by chunk — and you get one of two bad outcomes. It stalls, waiting for enough of the file to look “seekable,” or your own code buffers the entire response into a blob before handing it over. Either way the user watches a spinner for as long as the whole clip takes to arrive, even though the first few hundred milliseconds of audio were decodable ages ago.

The <audio> element is built around an assumption: that it knows (or can Range-probe) the whole file up front. A stream with no known length breaks that assumption, and the element degrades to “download it all first.”

The fix isn’t a bigger buffer — it’s a different contract with the browser: MediaSource Extensions (MSE). Instead of giving <audio> a URL and hoping, you give it a MediaSource object as its src, then push chunks into a SourceBuffer yourself as they arrive over fetch. The browser decodes and plays as soon as there’s enough buffered to render — it never needs the total length. This isn’t exotic; per MDN, MSE is the same primitive that “makes it possible to play media on the Web without the use of plugins” — the machinery every adaptive video player has run on for a decade. It’s just rarely pointed at the much smaller surface of audio, where most projects still reach for a plain <audio> and eat the latency.

That’s the mechanism behind audio.libx.js. The pieces that matter:

  • Chunked fetch: streamFromUrl() reads the response body as a stream; streamFromResponse() takes a Response you already fetched (e.g. behind auth headers).
  • MediaSource on desktop, ManagedMediaSource on iOS: the library sniffs which one exists and picks it, so you don’t special-case iOS.
  • Format detection: it inspects the byte stream to decide MP3 / WAV / WebM / OGG, because MSE throws on a wrong MIME string on the SourceBuffer rather than degrading.
  • IndexedDB caching: chunks are persisted as they land, keyed by an audio ID. Replay the same clip — even after a reload — and playFromCache() serves it with zero network. Naive streaming setups skip this: they either re-fetch every time or cache nothing, because the whole point was to avoid buffering the full response first.
  • bufferThreshold: require N seconds buffered before playback starts, trading a little startup latency for fewer stutters on a slow link.

Minimal usage:

import { createAudioStreamer } from 'audio.libx.js';

const audioElement = document.getElementById('audio') as HTMLAudioElement;

const streamer = createAudioStreamer(audioElement, {
  bufferThreshold: 5,      // start playing after 5s buffered
  enableCaching: true,     // persist to IndexedDB
  enableTrimming: true,    // strip leading/trailing silence
});

const result = await streamer.streamFromUrl('https://example.com/audio.mp3');
await result.onLoaded;  // playable
await result.onEnded;   // done

The recording side is the mirror image, and it exists for a reason beyond symmetry. If you’re feeding a speech-to-text service, you don’t want to wait for MediaRecorder to finish and hand you one big blob. createAudioRecorder({ enableRealtimeChunks: true, chunkInterval: 500, chunkFormat: 'wav', chunkSampleRate: 16000 }) emits chunks every 500ms in a format most STT APIs actually accept, so you can pipe them over a WebSocket while the user is still talking.

Now the honest part, because this approach lives and dies on browser quirks.

iOS is its own animal. Safari doesn’t hand arbitrary pages a plain MediaSource the way you’d want for background and lock-screen playback; iOS 17.1+ has ManagedMediaSource, where the user agent can evict your buffered content whenever it likes. And it won’t even start unless you ask correctly. Straight from MDN:

On Safari, ManagedMediaSource only activates when remote playback is explicitly disabled on the media element (by setting HTMLMediaElement.disableRemotePlayback to true), or when an AirPlay source alternative is provided […] Without either of these, the sourceopen event will not fire.

Miss that and your stream silently never opens. It’s also why the library ships a Media Session integration: once you push your own SourceBuffers, you’ve opted out of the free lock-screen behavior a plain <audio src="..."> gets, and you have to wire MediaSession.setActionHandler back yourself.

MSE support isn’t universal codec support. MediaSource.isTypeSupported() answers per-browser, per-platform, per-OS-codec-pack. WAV and MP3 are near-universal; WebM/Opus is solid on Chrome/Firefox but historically shakier on Safari. Don’t assume — check getCapabilities().supportedMimeTypes at runtime and fall back to a plain non-progressive <audio src> if MSE genuinely isn’t there.

Autoplay policies don’t care that your stream is clever. First playback still needs a user gesture on most browsers, MSE or not. Sound effects and background audio silently no-op until you provide one.

IndexedDB caching has no cross-browser quota eviction. You own the cleanup — call cleanup() with a maxAge/maxEntries policy, or you accumulate audio blobs until the browser starts complaining.

None of these are bugs in the approach; they’re the tax for taking manual control of the buffer. For a clip that fits comfortably in memory and has a known length, a plain <audio src> is genuinely the right call — reach for MSE only when the stream is open-ended or you need cache-once-play-forever.

If you’ve built progressive audio in a browser: did you go the MSE route, or find <audio> with byte-range requests good enough — and where did iOS bite you?

Code: Livshitz/audio.libx.js · npm · live demo

References