NoCloud Media

Video tool

Auto-Generate Subtitles for Video

Drop a video. Get an SRT file. The transcription model runs locally — your audio never leaves this tab.

How it works

  1. 1

    Drop a video or audio file

    Anything FFmpeg can decode — MP4, MOV, WebM, MKV, AVI, MP3, WAV, M4A. The file stays on your device.

  2. 2

    Wait for the speech model (first run only, ~75 MB)

    We download Whisper-tiny.en the first time you use this tool. After that it's cached and re-runs instantly. The model never leaves your browser either.

  3. 3

    Transcribe in your browser

    Audio is extracted via FFmpeg and fed to Whisper. Per-segment timestamps come back. We format them as a standards-compliant SRT file.

  4. 4

    Download the .srt

    Use it in any video player that supports external subtitles, upload it to YouTube, or burn it into the picture with our subtitle burn-in tool.

Why use Auto-subtitles?

Your audio never leaves your tab — no API key, no third-party service, no per-minute pricing. Whisper runs locally via WebAssembly the same way our other tools do.

Standards-compliant SRT output — works in VLC, mpv, YouTube, Vimeo, Premiere, DaVinci Resolve, or our own subtitle burn-in tool.

Free with no caps. Paid alternatives (Veed, Descript, OpenAI's Whisper API) charge per minute or behind monthly subscriptions. Ours is the same model running on your hardware.

Common use cases

  • Caption a podcast episode for accessibility before publishing
  • Generate a transcript of a recorded lecture or interview
  • Add subtitles to a video before sharing on social platforms
  • Get a starting-point transcript that you'll edit by hand for accuracy
  • Caption a screen recording for a tutorial or how-to video
  • Generate SRT for a private family video without uploading it anywhere

About MP4 and SRT

SRT (SubRip) is the most widely supported subtitle format on the web. Each cue is a sequence number, a time range like '00:00:05,000 --> 00:00:08,500', and one or more lines of text. Our auto-subtitles tool uses Whisper-tiny.en — OpenAI's open-source speech recognition model in its English-only variant — running entirely in your browser via WebAssembly. The model is ~75 MB on the first run, cached afterwards. Accuracy is good on clear English speech, lower on heavy accents, low-quality audio, or domain-specific jargon. The SRT output is a starting point you can refine in any text editor.

Frequently asked questions

Is my video uploaded to a server?
No. NoCloud Media transcribes your video entirely in your browser using WebAssembly. The Whisper speech model also runs locally — both your audio and the transcribed text stay on your device.
What's the maximum file size I can transcribe?
It depends on your browser's available memory. Files under 500 MB transcribe smoothly on most devices. Long videos (1 hour+) may exhaust memory on phones; desktop browsers with plenty of RAM handle them better. Audio-only files are much lighter on memory than video.
How accurate is it?
Whisper-tiny.en is competitive with paid services for clear English speech in normal acoustic conditions. Expect more errors on heavy accents, noisy backgrounds, technical jargon, or quiet recordings. The output is a great starting point you can refine in 10-15 minutes for an hour of video.
What languages does it support?
Today: English only. We use the English-only Whisper-tiny.en model because it's smaller, faster, and slightly more accurate on English than the multilingual variant. Multilingual support is on the roadmap.
Why does the first run take so long?
The first time you use this tool, we download Whisper's speech recognition model — about 75 MB. After that it's cached in your browser and subsequent runs start in under a second. The model never expires; subsequent visits will use the cached copy unless you clear browser data.
Which browsers are supported?
Chrome, Edge, Firefox, and Safari 15+. We require WebAssembly, SharedArrayBuffer, and cross-origin isolation, all standard in modern browsers.

Related tools