Qwen 3.5 Omni: Alibaba’s AI Model Can Now Hear, Watch, and Clone Your Voice

1 month ago 24

In brief

Alibaba’s Qwen 3.5 Omni brings existent real-time omnimodal AI to the frontier race.
Native audio-visual processing beats stitched multimodal pipelines successful velocity and coherence.
Voice cloning, semantic interruption, and vibe coding awesome a displacement toward afloat interactive AI agents.

Alibaba conscionable dropped its astir ambitious AI upgrade yet.

The company's Qwen squad released Qwen 3.5 Omni connected Sunday, a caller mentation of its "omnimodal" AI that simultaneously processes text, images, audio, and video, and talks backmost successful existent clip crossed 36 languages, placing its exemplary connected the aforesaid battlefield arsenic the latest state-of-the-art AI foundational models presently available.

1/10 🚀 Qwen3.5-Omni is here! Scaling up to a autochthonal omni-modal AGI.
Meet the adjacent procreation of Qwen, designed for autochthonal text, image, audio, and video understanding, with large advances successful some quality and real-time interaction.
A standout feature:
Audio-Visual Vibe… pic.twitter.com/fWWyTl9cPY

— Tongyi Lab (@Ali_TongyiLab) March 30, 2026

"Omni" isn't conscionable a selling buzzword here. Most AI models you interact with are chiefly text-in, text-out systems. Some grip images, immoderate grip voice. Qwen 3.5 Omni handles each of them natively, astatine the aforesaid time, without the request to person everything to substance done third-party tools.

The caller exemplary comes successful 3 sizes—Plus, Flash, and Light—all supporting a tiny (by today’s standards) 256,000-token discourse window. It was trained connected implicit 100 cardinal hours of audio-visual data—a standard that puts it successful a antithetic value people from astir competitors.

Qwen 3.5 Omni is an improvement of Qwen 3 Omni Flash, Alibaba's erstwhile omnimodal exemplary released successful December 2025. That mentation already impressed with its quality to process video and audio simultaneously—it could grip representation editing instructions combining aggregate ocular inputs successful ways competitors couldn't—and streamed dependable responses with latency arsenic debased arsenic 234 milliseconds.

It was besides the archetypal exemplary to effort an alternate to Google’s NotebookLM. It achieved something, but the prime was not connected par with Google’s offer.

Qwen 3.5 Omni takes each of that and adds a longer discourse window, amended reasoning, a overmuch wider connection library, and a acceptable of real-time enactment features the erstwhile procreation didn't have.

The header upgrade is what happens erstwhile you really speech to it. Qwen3.5-Omni present supports semantic interruption: It tin archer the quality betwixt you saying "uh-huh" mid-sentence and really wanting to chopped in, truthful it won't halt mid-thought each clip idiosyncratic coughs successful the background, making spoken enactment much seamless.

A caller method called ARIA, abbreviated for Adaptive Rate Interleave Alignment, besides fixes a subtle but persistent annoyance: AI systems that garble numbers oregon antithetic words erstwhile speechmaking aloud. ARIA dynamically syncs substance and code to support output earthy and accurate.

Then there's dependable cloning. Users tin upload a dependable illustration and person the exemplary follow that dependable successful its responses, a diagnostic that puts Qwen straight successful contention with ElevenLabs and different dedicated dependable tools. We weren’t capable to entree this feature, though, due to the fact that this is simply a diagnostic that, astatine slightest for now, is lone disposable via API..

On multilingual dependable stableness benchmarks, Qwen3.5 Omni- Plus bushed ElevenLabs, GPT-Audio, and Minimax crossed 20 languages. The exemplary besides present supports real-time web search, meaning it tin reply questions astir breaking quality oregon unrecorded marketplace information without pretending it already knows.

The squad is besides highlighting what they're calling "Audio-Visual Vibe Coding," the exemplary tin ticker a surface signaling oregon video of a coding task and constitute functional codification based purely connected what it sees and hears, nary substance punctual required. It's a tiny preview of however AI assistants mightiness yet run wrong your workflow alternatively than alongside it.

To recognize what "omnimodal" really means successful practice, we ran a speedy test: We fed some Qwen3.5-Omni and ChatGPT 5.4 successful “thinking” mode the aforesaid YouTube Short—a clip of Dastan President (Dastan is Decrypt’s genitor company) and commentator Farokh discussing breaking news. Qwen 3.5 Omni processed the video natively and returned a afloat investigation successful astir 1 minute: who was speaking, what they were discussing, and a substantive remark connected the taxable based connected its ain cognition of the taxable area.

ChatGPT 5.4, which is not omnimodal, had to negociate with what it got. It extracted frames from the video, ran them done a imaginativeness model, utilized Whisper to transcribe the audio, and applied an OCR instrumentality to work embedded subtitles—three abstracted processes stitched unneurotic to approximate what Qwen3.5-Omni does successful a azygous pass. The effect took 9 minutes, and that's nether perfect conditions: a well-lit video with cleanable audio and burned-in subtitles. Real-world contented seldom offers each three.

In our speedy tests crossed aggregate inputs, the exemplary besides handled prompts successful Spanish, Portuguese, and English without issue—switching languages mid-conversation without losing context.

On modular benchmarks, Qwen 3.5 Omni Plus outperformed Gemini 3.1 Pro connected wide audio understanding, reasoning, and translation tasks, and matched it connected audio-visual comprehension. Speech designation present covers 113 languages and dialects—up from 19 successful the erstwhile generation.

This is Alibaba's 2nd large AI merchandise successful six weeks. In February, it launched Qwen 3.5, a text-and-vision exemplary that matched oregon bushed frontier models connected reasoning and coding benchmarks—part of a streak that has besides included Qwen Deep Research and a lineup of tools rivaling OpenAI and Google. Qwen 3.5 Omni extends that momentum into afloat multimodal territory, astatine a clip erstwhile each large AI laboratory is racing to physique systems that grip the afloat spectrum of quality communication—not conscionable words connected a screen.

The exemplary is disposable present via Alibaba Cloud's API and tin beryllium tested straight astatine Qwen Chat oregon done Hugging Face's online demo.