Vidu Q1 Review: Shengshu Tech's Underrated Reference-Image Video Model

Why Vidu Q1 deserves a spot in your video stack

If you live inside the Western generative-video bubble, your mental model probably looks like this: Sora at the top, Veo 3 nipping at its heels, Runway Gen-3 and Luma Dream Machine in the prosumer tier, Kling occasionally showing up in your X feed when somebody posts a viral kung fu clip. Vidu likely doesn't appear at all. That's a mistake.

Vidu is built by Shengshu Technology, a Beijing outfit spun out of Tsinghua University's machine learning group. Their pitch since day one has been a Diffusion-Transformer hybrid (they called the architecture U-ViT and were among the first to publish it, before DiT was famous), and the lineage shows. Q1 is the current flagship, and it does one thing better than almost anything I've used outside the major labs: it actually keeps the subject consistent when you give it reference images.

That sounds boring until you've spent two days trying to make Sora produce the same character in two different shots, or watched Runway turn your protagonist into a vaguely-related cousin between cuts. Vidu Q1's reference-to-video pipeline is the closest thing to a working "character + setting + object" multi-reference system that's actually shipping at consumer prices. For storyboarding, brand work, and any narrative content where continuity matters, that single capability outweighs a lot of the polish you get from heavier models.

The reference-image trick that makes it different

Most video models accept either text alone, or a single first-frame image. Vidu Q1's headline feature, which they market as "Reference to Video," lets you upload up to seven reference images representing distinct entities, a person, a costume, a prop, a location, and then writes a prompt that composes them into a scene. The model attempts to preserve the identity of each reference and animate them together.

In practice this means you can drop in three character portraits and a location photo, then prompt a dialogue scene. You can provide a product shot plus a brand-style background plate and generate ad b-roll. You can hand it a costume reference and a face reference separately, and get them combined.

This is closer to ControlNet-style conditioning than the loose "vibe" of prompt-only video. It is not perfect. Faces drift more than the marketing implies, hands still betray the model's diffusion roots, and complex multi-character interactions can get muddled. But compared to the alternative (training a LoRA, faking continuity in post, or just praying), it's a real workflow shift.

Hands-on: five prompts I actually ran

I spent about a week pushing Q1 through the web studio and API. Here are five prompts that exposed how the model behaves. All ran at 1080p, mostly 5-second clips because that's the sweet spot for cost and quality.

Prompt 1, straight text-to-video, low complexity:

A neon-lit ramen shop on a rainy Tokyo backstreet at 2am.
Steam rising from a bowl on the counter.
Slow dolly-in. Anamorphic lens flares. Cinematic, shallow depth of field.
Duration: 5s

This is the kind of prompt every video model handles competently now. Vidu produced a clean clip, slight shimmer in the steam particles, reasonable bokeh. Comparable to Runway Gen-3 Alpha. Sora and Veo 3 still feel a tier above on micro-detail like rain interaction with surfaces, but the gap is narrower than the marketing suggests.

Prompt 2, reference-to-video with a single character reference:

[reference: portrait_of_woman_in_red_jacket.jpg]
The woman from the reference walks through a crowded
Hong Kong wet market, looking left and right, holding a paper bag.
Handheld documentary style, 35mm look, natural lighting.
Duration: 5s

This is where Q1 earns its keep. The character's face held together across the full clip, jacket color stayed correct, the market environment was busy and plausible. The same workflow in Runway with image conditioning gave noticeable face drift by the end of the clip. Vidu kept it. Not perfect, there was a moment where her hand morphed weirdly when it crossed in front of the bag, but the identity preservation was clearly a different class.

Prompt 3, multi-reference composition:

[reference 1: man_in_business_suit.jpg]
[reference 2: golden_retriever_puppy.jpg]
[reference 3: minimalist_white_office.jpg]
The man from reference 1 sits at his desk in the office
from reference 3, holding the puppy from reference 2 in his lap.
He smiles at the camera. The puppy looks around curiously.
Duration: 5s

This is the demo Shengshu loves to show, and it works, about two-thirds of the time. On the third generation I got a clean composite. The first attempt put the puppy at the wrong scale; the second made the office walls inherit a subtle jacket texture from the suit reference, a known failure mode in multi-reference conditioning where features bleed. When it works, no other consumer model gets close. When it doesn't, you re-roll and pay again.

Prompt 4, stress test on motion:

A capoeira fighter performing a sequence of kicks and ground spins
in an empty warehouse. Wide shot. Camera holds steady.
Dust motes in shafts of light. Realistic physics, fast motion.
Duration: 5s

Fast human motion is still the achilles heel of every diffusion video model. Vidu Q1 produced something watchable but with the classic limbs-duplicate-at-peak-velocity artifact you also see in Kling and to a lesser extent in Runway. Sora and Veo 3 handle this category meaningfully better. If your work is action-heavy, Vidu is not the right tool.

Prompt 5, stylized anime-leaning:

Anime style, Studio Ghibli inspired. A young girl in a yellow raincoat
runs through a flooded rice paddy at sunset, splashing with each step.
Soft watercolor textures. Wide shot, cinematic. Warm orange light.
Duration: 5s

Stylized output is a Vidu strength. The Asian model labs have spent a lot of time tuning anime and semi-realistic stylized output, and it shows. Color palette held, movement felt hand-drawn rather than uncannily smooth, water splashes had visible illustrative strokes. Midjourney's still-image quality is obviously better, but Midjourney doesn't do video at all yet, and Vidu's stylized output is meaningfully ahead of Runway and Pika in this specific lane.

Pricing in USD

Shengshu prices Vidu by credits, and credits per second of generation depend on resolution, quality preset, and whether you're using the reference-to-video pipeline (which costs slightly more than text-to-video). Translating to USD as of this writing:

Free tier: enough credits for roughly 4-6 short clips per month, watermarked
Standard plan: around 8 USD per month, gives you roughly 80 credits, enough for 16-20 standard 5-second 1080p clips
Pro plan: around 20-25 USD per month, roughly 220 credits, plus higher-priority queueing
API pay-as-you-go: works out to roughly 0.20-0.40 USD per 5-second 1080p clip, with reference-to-video at the higher end

Compared to Western alternatives:

Runway Gen-3 Alpha Turbo: roughly 0.50 USD per second of video, so a 5-second clip is about 2.50 USD, several times more expensive than Vidu
Luma Dream Machine: 0.30-0.40 USD per second on the paid tiers
Sora: bundled into ChatGPT Pro at 200 USD per month, no real per-clip equivalent
Veo 3: gated through Google's tiers, generally only available bundled, but per-clip cost is in the same range as Runway

For volume work, Vidu is two to five times cheaper than the comparable Western tier. That ratio matters when you're iterating, because you will iterate. The first prompt rarely lands.

Strengths and weaknesses, honestly

Strengths first. The reference-to-video pipeline is the real thing and worth the trial alone. Stylized output (anime, semi-realistic illustration, watercolor, ink) is genuinely ahead of the Western mid-tier. Pricing is aggressive enough to be a working production tool rather than an experiment. The web studio is fast, even from outside China once you account for the latency hit. Camera control prompts (dolly, pan, push, orbit) are interpreted reasonably well, better than Pika in my testing.

Weaknesses, also real. Fast complex motion still produces artifacts. Hands and fine articulation are weak points, as with most diffusion video. English prompts work but you can feel the model was tuned more on Chinese-language training data, sometimes the model "interprets" English prompts a beat differently than you expect, and switching the prompt to a more direct, less idiomatic phrasing helps. There's no lip-sync feature in Q1 itself the way Sora and Veo 3 are now landing. Multi-character scenes with interaction (a handshake, a hug, two people talking) are where things get glitchy fastest. And the model's understanding of Western pop culture references (specific characters, brands, locations) is noticeably thinner than Sora's.

Content moderation is the other thing to flag. Chinese-developed models, including Vidu, have stricter automated moderation than most Western platforms. Anything political-adjacent, certain historical references, anything the system reads as suggestive, and most violence, will get blocked or silently neutered. If your creative work brushes against any of those edges, the friction will frustrate you. For brand, product, lifestyle, narrative fiction, and stylized work, you'll rarely hit it.

Best use cases for Western creators

The matches are clearer than the gaps:

Storyboard and previz work where character continuity across shots matters more than final polish
Product and brand video, where you have a hero asset (the product) and want it composited into varied scenes consistently
Stylized short-form content for social, especially anime-leaning or illustrative styles
Narrative shorts where you want a recurring character across multiple clips and don't want to train a LoRA
High-volume A/B testing of video creative for ads, where Vidu's cost-per-clip lets you generate ten variations for the price of two on Runway

It's a poor fit for action-heavy footage, photorealistic talking-head video where lip-sync matters, or anything that depends on tight cultural specificity to the West.

Accessing Vidu from outside China

The good news is Vidu has an English-language web product (vidu.com / vidu.studio depending on the route you land on) and an English API, so you don't strictly need a Chinese phone number, payment method, or VPN to use it. That puts it ahead of several other Chinese AI products in accessibility.

A few practical paths:

Direct web access through the official English studio, paid by international card. This works, but the latency penalty from Western locations to their inference clusters is real, expect generation times to feel 20-40% slower than the platform's claims, mostly from queue and transit overhead rather than actual compute time
Direct API access through their developer portal, same caveats on latency, fine for batch work where you don't need sub-minute response
Third-party API gateways: at the time of writing, Vidu is not on Replicate's main catalog and not on OpenRouter (OpenRouter is text-LLM focused). Some aggregators like fal.ai, Pollo AI, and similar generative-media gateways have integrated Vidu, with varying markups (typically 10-25% over direct pricing) but better Western-region routing
Bundled access through Chinese-AI-aggregator platforms aimed at international users, useful if you want to A/B Vidu against Kling, Hailuo, and Wan in one dashboard, less useful if you're already committed to one model

For production use I'd recommend going direct to the API once you're past evaluation. The aggregator markups add up, and the direct API is genuinely accessible. For evaluation, the free tier on the web studio is enough to know within an afternoon whether the reference-to-video workflow fits your needs.

Bottom line

Vidu Q1 is not the best video model in the world. Sora and Veo 3 are clearly more capable on raw quality, motion, and cultural fluency. But "best" is rarely the question that matters. The question is what's the best tool for the specific job, and Vidu Q1 is the best tool I've used for "I need this character to look like the same person in shot 4 as in shot 1, and I need this at a price that lets me iterate."

If you're a Western creator, marketer, or developer who has never tried a Chinese video model because you assumed they were a tier behind or hard to access, Q1 is the one to start with. The reference-to-video workflow is the standout feature, the pricing is genuinely competitive, and the access friction is lower than its competitors from the same region.

Skip it if your work is action-heavy or lip-sync-dependent, if you brush against content moderation edges regularly, or if you need maximum photorealism and don't care about per-clip cost. Pick it up if you're doing brand, narrative, stylized, or character-consistency work, you'll probably end up using it more often than you expected.