Audio File to Text — Upload Any Audio, Get a Transcript
Not all audio lives on YouTube. Podcast episodes, recorded meetings, lecture recordings, interview files, local video downloads — these don't have a URL to paste. INDXR.AI's Audio Upload tab accepts any audio or video file and runs it through the same AI transcription pipeline used for YouTube videos.
The result is a full text transcript, stored in your library, exportable in every format INDXR.AI supports.
Supported Formats and Limits
| Format | Notes |
|---|---|
MP3 | Most common audio format, fully supported |
MP4 | Video file — audio track extracted automatically |
WAV | Uncompressed audio, typically larger files |
M4A | Apple audio format, common from iOS recordings |
OGG | Open format, common from browser and web recordings |
FLAC | Lossless audio, larger files |
WEBM | Web video format, common from browser recordings |
Maximum file size: 500MB. Most web upload tools cap at 50–100MB. INDXR.AI's upload pipeline handles larger files without that constraint — we've tested uploads of over 200MB and 210+ minutes of audio without issues.
If you need to split very large files, FFmpeg handles this cleanly: ffmpeg -i large_file.mp3 -t 3600 part1.mp3 -ss 3600 part2.mp3 splits a file at the 1-hour mark.
How the Transcription Works
Once uploaded, the file is processed through AssemblyAI Universal-3 Pro — the same model used when INDXR.AI transcribes YouTube videos without captions. The model handles a wide range of audio conditions:
Languages: 99+ languages with automatic detection. You don't need to specify the language; the model identifies it from the audio.
Audio quality: The model is trained on real-world audio including phone recordings, conference recordings with background noise, and studio-quality content. YouTube's auto-captions achieve 60–80% accuracy on challenging audio — AssemblyAI Universal-3 Pro consistently reaches 94–96%+ on clean recordings.
Punctuation: Unlike auto-captions, AssemblyAI output includes proper punctuation and sentence boundaries. This matters for readability, for SRT/VTT timing quality, and for downstream uses like RAG export where sentence detection affects chunk quality.
Processing time: Approximately 1 minute of processing time per 10 minutes of audio for most files. You receive a live elapsed timer during processing, and the job continues on the server even if your connection drops briefly.
Credit Cost
Audio Upload transcription costs 1 credit per minute of audio, with a minimum of 1 credit. The credit is charged based on the actual audio duration detected after upload, not the file size.
| Audio length | Credits | Cost at Basic (€6.99/500cr) | Cost at Plus (€13.99/1,200cr) |
|---|---|---|---|
| Under 1 minute | 1 credit | €0.01 | €0.01 |
| 10 minutes | 10 credits | €0.14 | €0.12 |
| 30 minutes | 30 credits | €0.42 | €0.35 |
| 1 hour | 60 credits | €0.84 | €0.70 |
| 2 hours | 120 credits | €1.68 | €1.40 |
What You Get After Transcription
The transcript appears in your library alongside any YouTube transcripts you've extracted. From there:
Export in any format: TXT plain, TXT with timestamps, Markdown with YAML frontmatter, SRT, VTT, CSV, JSON, or RAG-optimized JSON. SRT and VTT output is resegmented to 3–7 second blocks at 42 characters per line — the professional subtitle standard.
Edit in the rich-text editor: Correct errors, add formatting, and annotate the transcript. Edits are saved separately from the original, so you can always revert.
Generate an AI summary: An AI-generated summary with key points and action items is available for any transcript in your library, at 3 credits.
Export as RAG JSON: For podcasts, lectures, or interviews you want to make searchable via a vector database — enable RAG JSON export to get chunked, metadata-rich output ready for LangChain, LlamaIndex, or direct vector database ingestion. See YouTube Transcripts for RAG Pipelines for the full pipeline.
Common Use Cases
Podcast transcription: Upload an episode MP3 directly. Export as Markdown for show notes, TXT for a newsletter, or RAG JSON to build a searchable podcast archive.
Recorded lecture notes: A 90-minute lecture recording becomes a searchable, editable transcript. Export as Markdown for an Obsidian note, or as CSV for analysis alongside other course materials.
Interview transcription: Upload a recorded interview. The transcript is accurate enough to quote from directly — useful for journalists, researchers, and user researchers.
Downloaded YouTube videos: If you've downloaded a YouTube video that has no captions, extract the audio and upload it to get a transcript the same way as a live YouTube URL.
For YouTube videos (where you have a URL rather than a downloaded file), use the YouTube Transcript Generator instead — it handles auto-caption extraction for free and AI transcription when captions aren't available. For credit packages, see the pricing page.
Frequently Asked Questions
- Is there a file size limit?
- 500MB maximum per upload. This covers approximately 8+ hours of audio at typical recording bitrates. For larger files, split them first using a tool like FFmpeg or Audacity.
- Does it work for video files as well as audio?
- Yes. MP4 and WEBM video files are supported — INDXR.AI extracts the audio track automatically. You don't need to extract audio yourself before uploading.
- How accurate is the transcription?
- AssemblyAI Universal-3 Pro achieves 94–96%+ accuracy on clean recordings. For challenging audio — heavy accents, significant background noise, overlapping speakers — accuracy varies but typically outperforms YouTube's auto-captions under the same conditions.
- Can I transcribe in languages other than English?
- Yes. 99+ languages are supported with automatic detection. The transcript will be in the language spoken in the audio.
- Does the file get stored on INDXR.AI's servers?
- The audio file is processed and then discarded. Only the resulting transcript text is stored in your library. INDXR.AI does not retain uploaded audio files after transcription is complete.
- What's the difference between audio upload and YouTube AI Transcription?
- The transcription pipeline is identical — both use AssemblyAI Universal-3 Pro and cost 1 credit per minute. The difference is the source: YouTube AI Transcription downloads the audio from a YouTube URL automatically, while audio upload lets you bring your own file from any source.



