Video tool
Auto-Generate Subtitles for Video
Drop a video. Get an SRT file. The transcription model runs locally — your audio never leaves this tab.
How it works
- 1
Drop a video or audio file
Anything FFmpeg can decode — MP4, MOV, WebM, MKV, AVI, MP3, WAV, M4A. The file stays on your device.
- 2
Wait for the speech model (first run only, ~75 MB)
We download Whisper-tiny.en the first time you use this tool. After that it's cached and re-runs instantly. The model never leaves your browser either.
- 3
Transcribe in your browser
Audio is extracted via FFmpeg and fed to Whisper. Per-segment timestamps come back. We format them as a standards-compliant SRT file.
- 4
Download the .srt
Use it in any video player that supports external subtitles, upload it to YouTube, or burn it into the picture with our subtitle burn-in tool.
Why use Auto-subtitles?
Your audio never leaves your tab — no API key, no third-party service, no per-minute pricing. Whisper runs locally via WebAssembly the same way our other tools do.
Standards-compliant SRT output — works in VLC, mpv, YouTube, Vimeo, Premiere, DaVinci Resolve, or our own subtitle burn-in tool.
Free with no caps. Paid alternatives (Veed, Descript, OpenAI's Whisper API) charge per minute or behind monthly subscriptions. Ours is the same model running on your hardware.
Common use cases
- Caption a podcast episode for accessibility before publishing
- Generate a transcript of a recorded lecture or interview
- Add subtitles to a video before sharing on social platforms
- Get a starting-point transcript that you'll edit by hand for accuracy
- Caption a screen recording for a tutorial or how-to video
- Generate SRT for a private family video without uploading it anywhere
About MP4 and SRT
SRT (SubRip) is the most widely supported subtitle format on the web. Each cue is a sequence number, a time range like '00:00:05,000 --> 00:00:08,500', and one or more lines of text. Our auto-subtitles tool uses Whisper-tiny.en — OpenAI's open-source speech recognition model in its English-only variant — running entirely in your browser via WebAssembly. The model is ~75 MB on the first run, cached afterwards. Accuracy is good on clear English speech, lower on heavy accents, low-quality audio, or domain-specific jargon. The SRT output is a starting point you can refine in any text editor.
Frequently asked questions
- Is my video uploaded to a server?
- No. NoCloud Media transcribes your video entirely in your browser using WebAssembly. The Whisper speech model also runs locally — both your audio and the transcribed text stay on your device.
- What's the maximum file size I can transcribe?
- It depends on your browser's available memory. Files under 500 MB transcribe smoothly on most devices. Long videos (1 hour+) may exhaust memory on phones; desktop browsers with plenty of RAM handle them better. Audio-only files are much lighter on memory than video.
- How accurate is it?
- Whisper-tiny.en is competitive with paid services for clear English speech in normal acoustic conditions. Expect more errors on heavy accents, noisy backgrounds, technical jargon, or quiet recordings. The output is a great starting point you can refine in 10-15 minutes for an hour of video.
- What languages does it support?
- Today: English only. We use the English-only Whisper-tiny.en model because it's smaller, faster, and slightly more accurate on English than the multilingual variant. Multilingual support is on the roadmap.
- Why does the first run take so long?
- The first time you use this tool, we download Whisper's speech recognition model — about 75 MB. After that it's cached in your browser and subsequent runs start in under a second. The model never expires; subsequent visits will use the cached copy unless you clear browser data.
- Which browsers are supported?
- Chrome, Edge, Firefox, and Safari 15+. We require WebAssembly, SharedArrayBuffer, and cross-origin isolation, all standard in modern browsers.