Fun-ASR 1.5 Review: Best Open-Source AI Speech-to-Text Model (2026)

Fun-ASR 1.5 speech recognition model interface showing multilingual transcription with Chinese dialect support on laptop screen 2026
⏱️ 30-Second Verdict: Fun-ASR 1.5 is an open-source AI speech recognition model released April 20, 2026 by Alibaba’s Tongyi Lab. It achieves 16.72% industry WER versus Whisper-large-v3’s 33.39%, supports 31 languages, 7 Chinese dialect groups, and runs free via ModelScope or via Alibaba Cloud Bailian API at $0.008/min.

Fun-ASR 1.5 is an open-source speech recognition model developed by Alibaba’s Tongyi Lab, released April 20, 2026 via ModelScope and Alibaba Cloud’s Bailian platform. Built on a Mixture of Experts (MoE) architecture and trained on tens of millions of hours of real speech, it targets the gap that competing ASR tools have never closed: production-accurate recognition for Chinese dialects and code-switched multilingual input.

Two open-source variants are available. Fun-ASR-Nano (800M parameters) covers Chinese, English, Japanese plus 7 major Chinese dialect groups. Fun-ASR-MLT-Nano (800M) extends that to 31 languages. The full proprietary Fun-ASR model scales to 7.7B parameters and is available via enterprise API on Alibaba Cloud.

Image 1

How Does AI Speech-to-Text Work?

Every AI speech recognition model follows the same fundamental pipeline:

  1. Audio preprocessing – raw audio is sliced into 25ms frames and converted to mel-spectrograms (frequency maps over time)
  2. Encoder – a neural network maps audio features to language tokens
  3. Decoder – predicts the most likely text sequence using beam search
  4. Post-processing – adds punctuation, normalizes numbers and dates, boosts custom hotwords

Fun-ASR’s MoE twist means step 2 isn’t monolithic. A gating network analyzes the first 200ms of audio and activates 2–4 specialized “expert” sub-networks matching the detected language or dialect. Shanghainese Wu audio triggers tone-contour experts trained on 200,000+ hours of Shanghai-region speech. English audio activates stress-timing modules instead. This selective routing reduces computation by ~40% versus a single large transformer while improving accuracy through specialization.

Text normalization is a post-processing layer that converts spoken numbers, dates, and phone numbers into formatted text automatically – no manual scripting needed. Auto-punctuation inserts commas and periods based on prosodic analysis rather than acoustic pauses alone.

Fun-ASR 1.5 Key Features

Dialect and language coverage:
– 30–31 languages (European, East Asian, Southeast Asian, Middle Eastern)
– 7 major Chinese dialect groups: Wu, Cantonese, Min (Hokkien), Hakka, Gan, Xiang, Jin
– 26 regional accents with dialect-specific character output (e.g., “侬” for Shanghainese “you”)
– Code-switching recognition – mixed Cantonese-English auto-detected, no language pre-tagging needed

Recognition accuracy:
– 56.2% reduction in character error rate vs. previous version for dialect scenarios
– 5 dialects at 90%+ accuracy
– Classical poetry recognition: 97% character-level accuracy (tonal errors change meaning)
– Far-field / high-noise environments (conference rooms, vehicles): 93% accuracy

Production features:
– Real-time streaming at 280ms CPU / 120ms GPU latency
– Hotword customization via JSON vocabulary files
– Music background lyric recognition
– Auto punctuation + text normalization (numbers, dates, amounts, phone numbers)

Benchmarks: Fun-ASR vs. Whisper, GLM-ASR, SenseVoice

Model Type Industry WER Languages Dialect Support
Fun-ASR (full 7.7B) Closed API 12.70% 30+ Yes, 7 groups
Fun-ASR-Nano Open-source 16.72% 31 Yes, 7 groups
GLM-ASR-Nano Open-source 26.13% Chinese only None
Whisper-large-v3 Open-source 33.39% 99 None
SenseVoice Open-source ~22–28%* 50+ Limited

*SenseVoice community estimates; no published WER.

Fun-ASR-Nano outperforms Whisper-large-v3 by 2:1 on industry WER (16.72% vs 33.39%). On a 10,000-word transcript that means 1,672 errors versus 3,339 – roughly cutting correction time in half. On Librispeech-clean English benchmarks, Fun-ASR-Nano records 1.76% WER versus Whisper’s 2.8%.

The gap is wider on code-switching. A Cantonese-English mixed recording produces 18.3% WER through Fun-ASR-Nano versus 41.7% through Whisper-large-v3, because Whisper has no tone-contour models for Cantonese.

Where Whisper still wins: 99 languages vs. Fun-ASR’s 31. For low-resource languages like Swahili or Icelandic, Whisper remains the only viable open-source option.

Where GLM-ASR fits: Zhipu AI’s GLM-ASR-Nano processes Chinese at 180ms CPU latency (vs Fun-ASR’s 280ms) – useful when sub-200ms matters more than accuracy. But it collapses dialects into Mandarin approximations with 35–40% error rates. For deeper context on Zhipu’s model strategy, see our GLM-5 Turbo analysis.

SenseVoice runs 5x faster than Fun-ASR through quantization and adds emotion detection (anger, joy, sadness classification). Better for call center sentiment analysis than accurate transcription.

Image 2

Real-World Use Cases

Conference and interview transcription: Fun-ASR processes a 90-minute meeting in ~4 minutes on GPU at $0.08 compute cost vs. $135–$180 for human transcription services, at 93% far-field accuracy.

Legal depositions: Hotword customization injects domain-specific terms (“plaintiff”, “voir dire”) with 3x probability weighting, reducing named entity errors from 8.4% to 2.1% on legal transcription benchmarks without model retraining.

Educational live streaming: Streaming mode at 280ms latency supports real-time subtitles. Classical literature platforms use the 97% poetry accuracy for recitation feedback apps.

Government hotlines: Regional dialect support lets Shanghai residents speak Shanghainese instead of switching to Mandarin, reducing call handling time by ~40% in municipal service centers.

Video subtitle generation and voice input: The Typeless AI Voice Keyboard review covers how similar voice-to-text pipelines power consumer apps – Fun-ASR’s 16.72% WER crosses the threshold where voice input is genuinely faster than typing (approximately 1 error per 6 words).

How to Set Up Fun-ASR 1.5

Option 1: Local inference (free, open-source)

Install via pip and model weights download automatically from ModelScope on first run:

bash
pip install funasr

Requirements: Python 3.8+, 8GB RAM, PyTorch 2.0+. GPU acceleration (CUDA 11.8+, 6GB VRAM) reduces inference from 4.2 sec/min to 0.9 sec/min of audio. The official GitHub repository includes 47 example scripts covering batch processing, real-time streaming, and hotword injection.

Option 2: Alibaba Cloud Bailian API (enterprise)

Metered billing at $0.008/min – 40% cheaper than Azure Speech Services ($0.013/min) while achieving 12.70% WER vs. Azure’s 18.3% on Chinese benchmarks. Suitable for organizations requiring SLA guarantees and processing 10,000+ hours monthly.

Choosing the right variant:
– General multilingual work → Fun-ASR-MLT-Nano (31 languages, free)
– Chinese + dialects only → Fun-ASR-Nano (same accuracy, slightly faster)
– Production scale with SLA → Full 7.7B model via Alibaba Cloud API

Model details are on the Hugging Face model card and the Fun-ASR technical report on arXiv.

Is Fun-ASR 1.5 Worth Using?

Use it if you’re building anything that needs Chinese dialect recognition, code-switched multilingual transcription, or simply the best open-source WER available. The 16.72% industry WER is a genuine step change, not a benchmark game.

The limitations are real but narrow: 31 languages vs. Whisper’s 99 means low-resource language coverage is a gap. 280ms CPU latency falls short of GLM-ASR’s 180ms for ultra-fast voice assistant apps.

The unique value is the dialect breakthrough: production-ready Cantonese, Shanghainese, Hokkien, and Hakka recognition in a single open-source model, deployable on-premise without cloud API dependencies. No other open-source ASR tool delivers this combination.

For the broader AI model landscape in 2026, the Gemini AI Agent review provides context on how specialized models are outpacing general-purpose systems in targeted benchmarks – a trend Fun-ASR 1.5 exemplifies clearly.

Fun-ASR 1.5 is the strongest open-source speech recognition model available for East Asian languages today. If Chinese dialect transcription is a requirement for your product, the evaluation starts and ends here.

✅ Pros:

  • 16.72% industry WER outperforms Whisper-large-v3 by 2:1 on accuracy
  • 7 Chinese dialect groups including Cantonese, Wu, Min, Hakka in one open-source model
  • MoE architecture auto-detects language and dialects without pre-tagging
  • Free on ModelScope under MIT license — full commercial use allowed
  • Real-time streaming at 280ms CPU / 120ms GPU latency for live applications
❌ Cons:

  • 31 languages vs Whisper’s 99 — low-resource language coverage is limited
  • 280ms CPU latency is slower than GLM-ASR-Nano’s 180ms for voice assistant apps
  • Full 7.7B model is proprietary and requires Alibaba Cloud API access
  • No emotion detection unlike SenseVoice for call center analytics
  • Requires Python 3.8+, 8GB RAM and CUDA setup for GPU acceleration

Frequently Asked Questions

What is Fun-ASR 1.5?

Fun-ASR 1.5 is an open-source AI speech recognition model developed by Alibaba’s Tongyi Lab, released April 20, 2026. It uses a Mixture of Experts (MoE) architecture trained on tens of millions of hours of real speech data, supports 31 languages and 7 Chinese dialect groups, and achieves 16.72% industry Word Error Rate — roughly half the error rate of Whisper-large-v3.

How does AI speech-to-text work?

AI speech-to-text converts audio waveforms into text through a multi-stage pipeline: audio is split into 25ms frames, transformed into mel-spectrograms, processed by an encoder neural network that maps acoustic features to language tokens, then a decoder predicts the most likely text sequence. Post-processing adds punctuation, normalizes numbers and dates, and applies custom hotwords. Fun-ASR adds a MoE gating layer that routes audio to language-specific expert modules for higher accuracy on dialects.

Is Fun-ASR better than Whisper?

For Chinese, Japanese, and other East Asian languages, yes. Fun-ASR-Nano achieves 16.72% industry WER vs Whisper-large-v3’s 33.39% — a 2:1 accuracy advantage. On Cantonese-English code-switching specifically, Fun-ASR records 18.3% WER versus Whisper’s 41.7%. For low-resource languages like Swahili or Icelandic, Whisper’s 99-language coverage makes it the better choice.

How do I install Fun-ASR?

Install via pip: `pip install funasr`. Model weights download automatically from ModelScope on first run. Requirements: Python 3.8+, PyTorch 2.0+, 8GB RAM minimum. For GPU acceleration: CUDA 11.8+ with 6GB VRAM reduces inference from 4.2 sec/min to 0.9 sec/min. Full code examples are on the GitHub repository at github.com/FunAudioLLM/Fun-ASR.

Is Fun-ASR free to use?

The Fun-ASR-Nano and Fun-ASR-MLT-Nano variants are fully open-source under MIT license on ModelScope and Hugging Face — free for commercial use including fine-tuning and on-premise deployment. The full 7.7B parameter model is proprietary and available via Alibaba Cloud Bailian API at $0.008 per minute of audio.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top