Point an <audio> tag at a URL that’s still being generated — a TTS response, a live transcode, anything streamed chunk by chunk — and you get one of two bad outcomes. It stalls, waiting for enough of the file to look “seekable,” or your own code buffers the entire response into a blob before handing it over. Either way the user watches a spinner for as long as the whole clip takes to arrive, even though the first few hundred milliseconds of audio were decodable ages ago.
The <audio> element is built around an assumption: that it knows (or can Range-probe) the whole file up front. A stream with no known length breaks that assumption, and the element degrades to “download it all first.”
The fix isn’t a bigger buffer — it’s a different contract with the browser: MediaSource Extensions (MSE). Instead of giving <audio> a URL and hoping, you give it a MediaSource object as its src, then push chunks into a SourceBuffer yourself as they arrive over fetch. The browser decodes and plays as soon as there’s enough buffered to render — it never needs the total length. This isn’t exotic; per MDN, MSE is the same primitive that “makes it possible to play media on the Web without the use of plugins” — the machinery every adaptive video player has run on for a decade. It’s just rarely pointed at the much smaller surface of audio, where most projects still reach for a plain <audio> and eat the latency.
That’s the mechanism behind audio.libx.js. The pieces that matter:
- Chunked fetch:
streamFromUrl()reads the response body as a stream;streamFromResponse()takes aResponseyou already fetched (e.g. behind auth headers). - MediaSource on desktop,
ManagedMediaSourceon iOS: the library sniffs which one exists and picks it, so you don’t special-case iOS. - Format detection: it inspects the byte stream to decide MP3 / WAV / WebM / OGG, because MSE throws on a wrong MIME string on the
SourceBufferrather than degrading. - IndexedDB caching: chunks are persisted as they land, keyed by an audio ID. Replay the same clip — even after a reload — and
playFromCache()serves it with zero network. Naive streaming setups skip this: they either re-fetch every time or cache nothing, because the whole point was to avoid buffering the full response first. - bufferThreshold: require N seconds buffered before playback starts, trading a little startup latency for fewer stutters on a slow link.
Minimal usage:
import { createAudioStreamer } from 'audio.libx.js';
const audioElement = document.getElementById('audio') as HTMLAudioElement;
const streamer = createAudioStreamer(audioElement, {
bufferThreshold: 5, // start playing after 5s buffered
enableCaching: true, // persist to IndexedDB
enableTrimming: true, // strip leading/trailing silence
});
const result = await streamer.streamFromUrl('https://example.com/audio.mp3');
await result.onLoaded; // playable
await result.onEnded; // done
The recording side is the mirror image, and it exists for a reason beyond symmetry. If you’re feeding a speech-to-text service, you don’t want to wait for MediaRecorder to finish and hand you one big blob. createAudioRecorder({ enableRealtimeChunks: true, chunkInterval: 500, chunkFormat: 'wav', chunkSampleRate: 16000 }) emits chunks every 500ms in a format most STT APIs actually accept, so you can pipe them over a WebSocket while the user is still talking.
Now the honest part, because this approach lives and dies on browser quirks.
iOS is its own animal. Safari doesn’t hand arbitrary pages a plain MediaSource the way you’d want for background and lock-screen playback; iOS 17.1+ has ManagedMediaSource, where the user agent can evict your buffered content whenever it likes. And it won’t even start unless you ask correctly. Straight from MDN:
On Safari,
ManagedMediaSourceonly activates when remote playback is explicitly disabled on the media element (by settingHTMLMediaElement.disableRemotePlaybacktotrue), or when an AirPlay source alternative is provided […] Without either of these, thesourceopenevent will not fire.
Miss that and your stream silently never opens. It’s also why the library ships a Media Session integration: once you push your own SourceBuffers, you’ve opted out of the free lock-screen behavior a plain <audio src="..."> gets, and you have to wire MediaSession.setActionHandler back yourself.
MSE support isn’t universal codec support. MediaSource.isTypeSupported() answers per-browser, per-platform, per-OS-codec-pack. WAV and MP3 are near-universal; WebM/Opus is solid on Chrome/Firefox but historically shakier on Safari. Don’t assume — check getCapabilities().supportedMimeTypes at runtime and fall back to a plain non-progressive <audio src> if MSE genuinely isn’t there.
Autoplay policies don’t care that your stream is clever. First playback still needs a user gesture on most browsers, MSE or not. Sound effects and background audio silently no-op until you provide one.
IndexedDB caching has no cross-browser quota eviction. You own the cleanup — call cleanup() with a maxAge/maxEntries policy, or you accumulate audio blobs until the browser starts complaining.
None of these are bugs in the approach; they’re the tax for taking manual control of the buffer. For a clip that fits comfortably in memory and has a known length, a plain <audio src> is genuinely the right call — reach for MSE only when the stream is open-ended or you need cache-once-play-forever.
If you’ve built progressive audio in a browser: did you go the MSE route, or find <audio> with byte-range requests good enough — and where did iOS bite you?
Code: Livshitz/audio.libx.js · npm · live demo
References
- MDN — Media Source Extensions API: developer.mozilla.org
- MDN — ManagedMediaSource (Safari/iOS activation rules): developer.mozilla.org
- MDN — MediaRecorder: developer.mozilla.org
- MDN — Autoplay guide for media and Web Audio APIs: developer.mozilla.org