Nvidia Releases High-Speed Parakeet AI Speech Recognition Model, Claims Top Spot on Leaderboard

Nvidia's new Parakeet-TDT-0.6B-v2 speech recognition model has achieved top ranking on the Open ASR Leaderboard, offering high speed and accuracy.

Nvidia has entered the open-source speech recognition arena with Parakeet-TDT-0.6B-v2, an automatic speech recognition (ASR) model now hosted on Hugging Face.

Made available around May 1, the model quickly distinguished itself by securing the premier position on the Hugging Face Open ASR Leaderboard. It achieved this rank with a 6.05% average Word Error Rate (WER), a measure of transcription inaccuracy. This performance places it slightly ahead of other recently prominent open models, such as Microsoft’s Phi-4-multimodal, which held the leading spot in February with a 6.14% WER. Nvidia is distributing Parakeet-TDT-0.6B-v2 under the permissive CC-BY-4.0 license, facilitating its use in commercial applications.

Architecture and Speed Optimizations

Beyond its accuracy ranking, Nvidia highlights the model’s processing velocity. Company benchmarks suggest the model can process an hour of audio in roughly one second on appropriate hardware, corresponding to a high Inverse Real Time Factor (RTFx) of 3380.

This speed is linked to its architecture: a FastConformer encoder paired with a Token-and-Duration Transducer (TDT) decoder. The TDT approach, as detailed by Nvidia, aims to accelerate inference by predicting text tokens and their durations simultaneously, reducing computational overhead from predicting numerous ‘blank’ tokens common in other methods.

Further speed enhancements reportedly stem from optimizations using NVIDIA TensorRT and FP8 quantization. Additionally, the model’s full attention mechanism allows it to handle long audio inputs, up to 24 minutes, in one go.

Performance Across Benchmarks and Conditions

While the 6.05% average WER leads the specific Hugging Face leaderboard for open models, where top proprietary systems like OpenAI’s Whisper v3 still demonstrate lower error rates on broader evaluations.

Parakeet-TDT-0.6B-v2’s 600 million parameters represent a relatively compact size compared to models like Whisper v3 (1.5B parameters). Nvidia’s testing across standard benchmarks revealed varied results: low WERs on LibriSpeech (1.69% test-clean, 3.19% test-other) contrast with higher rates on datasets like AMI meeting recordings (11.16%).

The model shows decent noise robustness, with average WER increasing to 8.39% at a challenging SNR of 5. Performance on simulated 8kHz telephony audio (6.32% WER) was only slightly worse than on standard 16kHz audio (6.05% WER). Key features include automatic punctuation, capitalization, word-level timestamps, and a noted ability for song-to-lyrics transcription.

Training Data and Availability

The model was developed using the Nvidia NeMo toolkit, the company’s platform for building various AI models. Its training began with initialization from a wav2vec self-supervised learning checkpoint pretrained on LibriLight data. Subsequent training used Nvidia’s extensive Granary dataset (~120,000 hours of English speech), which combines human-transcribed sources (like LibriSpeech, Fisher Corpus, Mozilla Common Voice 8.0, VCTK, VoxPopuli) with pseudo-labeled data from YouTube Commons and YODAS.

Nvidia plans a public release of the underlying Granary dataset after the Interspeech 2025 conference. While not specified for this version, previous Parakeet models, such as the Parakeet-TDT 1.1B, involved collaboration with Suno.ai, which just released its 4.5 AI music generation model.

Parakeet-TDT-0.6B-v2 is optimized for Nvidia GPUs across architectures like Ampere, Hopper, Volta, Blackwell, and Turing (T4), but can reportedly load with only 2GB RAM. Its speed and permissive license make it an attractive option for developers. Nvidia states no personal data was used in training and provides standard ethical notes on the model card.

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x