Happy Horse Prompting Guide: Best Practices for creating videos on fal

We benched a couple hundred prompts on Happy Horse — Alibaba's text-to-video model, available on fal — to figure out what kinds of prompts actually land. The short answer: most shots only need about twenty words. Subject, action, setting, one cinematography cue. Past that, faces drift toward a generic average, hands lose geometry, gait flattens — extra detail eats into the same budget the model has for everything else.

The default prompt template

[Subject] [does action] in [setting], [time of day], [one atmosphere or camera cue].

Twenty words gives the model enough to commit and not enough to dilute. A few that hit the length and worked:

"A young woman in a red coat walks down a wet city street at night, neon reflections."
"A 1965 cherry-red Mustang convertible drives along a winding California coastal highway at midday."
"An orange tabby cat coiled on a velvet sofa leaps to a tall oak bookshelf."

Skip the wardrobe novel. Skip the lighting recipe. The first sentence does most of the work.

Why brevity wins for video prompts

Long prompts dilute. Every extra detail eats into the same budget — faces drift toward a generic average, hands lose geometry, gait flattens. Most obvious on human motion: "A child runs through a field" produced cleaner running biomechanics than the same scene padded with hair detail, dust kick-up, and arm-swing notes. The padded version had shorter strides, less weight on the feet, a slightly puppety quality.

Same scene, three prompt lengths. The middle one is the cleanest.

6 words — "A person walks down a street." Generic gait, wandering camera.

About 20 words — Specific enough, not over-loaded.

About 200 words — Wardrobe, lens, lighting, mood all stacked. Gait flattens out.

Same pattern on animal motion. Three words gave us the most convincing jump.

3 words — "A cat jumps." The arc lands.

Cinematic version — Slow-motion cue, lighting detail, dust mote. More flair, less convincing arc.

Anti-slop rules: words to avoid in AI video prompts

Words to drop: beautiful, stunning, amazing, gorgeous, masterpiece, epic, breathtaking, insane detail, ultra detailed, hyperrealistic.

Words that pay off instead: overcast daylight, wet asphalt, neon pink and cyan reflections in puddles, warm amber backlight on her shoulder, 35mm telephoto, shallow depth of field, single hard top-down key, deep falloff to black, sodium vapor street lamps, mid-afternoon sun on chrome.

Hedging adjectives drag the output toward a model-default look and cost you the specifics you wrote elsewhere. Pick one strong cinematography cue — a lens, a lighting recipe, or a camera move — and stop. Stacking five mostly cancels them out.

Stacked synonyms don't push harder either. "Crimson, scarlet, ruby, deep red" doesn't crank saturation. Pick one color and move on.

Most negative cues are wasted words. "No people in frame" earns its keep when there's a real risk a person appears. "No camera shake" rarely helps because the model isn't adding shake unless you ask for it. Cut every negative that doesn't address a concrete risk.

Bare director or DP name references almost never moved the needle. "Roger Deakins cinematography" alone barely registered across the bench. Describe the look in technique terms: "backlit silhouette, soft natural haze diffusing the dawn sun, restrained cool desaturated palette, slow tracking dolly behind." That beats the name by a clear margin.

When camera language earns extra prompt length

Long prompts pay off in one situation: when the shot leans on camera language.

Happy Horse is unusually good at camera moves. Steadicam pushes, slow dolly-ins, lateral orbits with parallax, helicopter aerials, locked-off framing under wind. If the shot depends on how the camera behaves, give the model room to read the direction. Put the camera cue at the end of the prompt — that's where it gets the most weight.

No camera language — Wandering camera, generic walk.

With tracking and telephoto cues — A real tracking shot lands, with the right lens compression.

Even when you spend the words, keep the camera language tight. Two or three concrete cues outrun a paragraph of decoration.

Two structures for long prompts: shot lists and markdown sections

Plain prose at long length confuses the model. When you really do have more than a sentence of content, use one of these.

Shot list with timecodes

For multi-beat shots, label every beat and pin it to a time range. The model reads the timing and stages the action.

Shot 1 (wide establishing, 0-1s): The camera pulls into the rain-slicked Manhattan side street at night; neon storefront signs glow on both sides. Shallow puddles on the pavement. Empty.

Shot 2 (mid tracking, 1-4s): The young woman in the deep crimson wool peacoat enters frame from the right, hands in pockets, walking briskly away from camera. The camera tracks alongside her at her pace; warm amber backlight skims her shoulder, cool blue ambient fills the shadows.

Shot 3 (slow push-in close, 4-5s): A slow dolly-in onto her face. Her breath is visible in the cold air, calm expression, raindrops in her hair.

This was the strongest multi-beat structure in our tests. Each beat reads cleanly. Camera moves land in the right windows. The kitchen three-beat — lift kettle, pour, set down — hit all three actions in shot-list form and collapsed into one confused motion when we wrote it as plain prose.

Plain prose three-beat — Beats blur into one continuous motion.

Shot list with timecodes — Three distinct beats land in the right windows.

Use concrete framing terms in each shot label: "wide establishing," "mid tracking," "slow push-in close," "low-angle wide," "macro close-up."

Markdown sections for single-take prompts

When the shot is one continuous take but you need to specify a lot of axes, split the prompt into headed sections. The cleanest single-shot output across the bench came from this structure.

## Subject
A young woman in her late twenties wearing a deep crimson wool peacoat, hands tucked in the coat pockets, breath faintly visible in the cold air.

## Action
Walks briskly down a rain-slicked Manhattan side street at night, her boots clicking on wet asphalt, the camera tracking smoothly alongside her at her own pace.

## Setting
Rain-slicked Manhattan side street, night. Neon storefront signs, shallow puddles, scattered drifting steam from a manhole cover behind her.

## Camera
Steady tracking shot, 35mm telephoto, shallow depth of field, sharp on her face, soft bokeh background.

## Lighting
Warm amber backlight skimming her shoulder, cool blue ambient filling the shadows, neon pink and cyan reflections in the puddles.

## Mood
Cinematic, intimate, contemplative.

Markdown sections — Same length as a paragraph version, but the axes don't bleed into each other.

With sections, action stays separate from lighting. Camera direction doesn't bleed into the wardrobe description. Roughly the same total length as a paragraph, with cleaner parsing.

Only use this when you have content for most sections. Empty headers hurt. If you only have a subject and a camera move, write twenty words and walk away.

Skip booru tags, JSON, and weighted parentheses

Stick with prose in English. Comma-separated keyword lists (booru style), JSON objects, weighted parentheses, and Mandarin all underperformed against the same content written as a normal English sentence. Reach for shot lists or markdown sections only when prose actually runs out of room.

Prompt patterns that work for Happy Horse

Camera moves

Plain English camera direction lands. Steadicam glides, slow dolly-ins, locked-off wide shots with a subject moving through frame, helicopter aerials over coastlines.

Tracking shot at her pace — 35mm telephoto, smooth follow.

Steadicam chase — Stabilized glide through a rainy alley.

Atmospheric lighting recipes

Blue hour, neon noir with mist and puddle reflections, single hard top-down key with deep falloff. The model picks up these cues and gives back convincing color and contrast.

Blue hour alley — Deep cyan twilight, far end warming with amber pubs.

Neon noir alley — Pink and cyan neon, mist, puddle reflections.

Vehicles and large rigid objects

Chrome highlights, reflections, and metallic paint render cleanly.

1965 Mustang coastal drive — Chrome and reflections render cleanly.

Cloth and fabric in wind

Capes, flags, hair on a windswept cliff. Secondary motion holds across the duration when the cloth is the main feature.

Cape on cliff at sunset — Long cape rippling in heavy ocean wind.

Fire and embers

Flames render with the right warmth and embers trace real arcs into the dark sky. If the prompt asks for an orbit or rising camera the framing pulls back, which can read as the fire shrinking in late frames. That's the camera move, not the model losing the flame.

Bonfire crackling at night — Strong flame and ember motion. Camera rises in the second half; the fire holds while the framing pulls back.

Wide establishing shots

Drone aerials and vast landscapes carry the shot on framing alone.

Aerial coastline drone — Drone glide and gradual altitude rise read as expected.

Mirrors and reflections

The reflection stays geometrically consistent with the source figure across the take.

Woman turning to mirror — Vintage dressing room. The reflection holds during the turn.

Short legible text

Book titles in window displays, signage with two or three words, simple labels. The model rendered "THE STARS BELOW" exactly as written. Long signage and dense paragraphs in-frame still hallucinate.

Bookstore window with embossed gold title — Title renders correctly. Slow dolly past holds focus.

Style anchors with visual translation

"Wong Kar-wai aesthetic" works when you also write out the visual elements — saturated greens and reds, slow motion, telephoto compression. The same anchor with no supporting visual cues did almost nothing.

Prompt patterns that struggle

Multi-step sequences in plain prose

"First X, then Y, then Z" tends to compress into a single motion. Use the shot-list format with timecodes for multi-beat work.

Slow-motion at extreme speeds

Cues like "1000fps slow-motion" rarely produce dramatic time dilation. The model slows the action somewhat. It does not freeze droplets in midair the way the prompt suggests. If you want time dilation, write it as a normal slow shot and accept what you get.

Wardrobe specifics under heavy motion

The bench rendered "yellow sundress" and "striped t-shirt" prompts as plain white tees once the children started running. Wardrobe details survive in static or slow shots; expect drift in fast action.

Copy-paste prompt library for fal

Drop these into fal and tweak the bracketed fields.

Twenty-word single-character beat

[Subject in one phrase] [does action] in [setting], [time of day], [one atmosphere cue].

The first sentence does most of the work.

Twenty-word walk shot — Coherent gait, neon reflections, atmospheric mood.

Single-character beat with camera move (40 to 60 words)

[Subject and one wardrobe detail] [does action] in [setting]. [Camera move + lens]. [Lighting cue]. [Mood word].

Spend the extra words on the camera and the lighting. Lens and light are what the model picks up cleanly.

Walk with tracking and telephoto — Tracking shot lands; bokeh and lens compression both read.

Multi-beat action with shot list

Shot 1 ([framing], 0-Xs): [setup beat]
Shot 2 ([framing], X-Ys): [action beat]
Shot 3 ([framing], Y-Zs): [resolution beat]

Use concrete framing terms in each shot label: "wide establishing," "mid tracking," "slow push-in close," "low-angle wide," "macro close-up."

Kitchen three-beat — Lift kettle, pour, set down. Three distinct beats from one prompt.

Atmospheric establishing shot (under 25 words)

[Setting] at [time of day], [one weather or atmosphere cue], [one composition cue].

Skip the subject for pure environment shots. Atmosphere carries the frame.

Neon noir alley — Mist, neon, puddle reflections. Mood from a sentence.

Continuous take with markdown sections

## Subject
[Subject and wardrobe]

## Action
[What happens, including motion direction]

## Setting
[Location, time of day, weather, props]

## Camera
[Camera move, lens, depth of field, framing]

## Lighting
[Key light direction, color temperature, contrast]

## Mood
[Two or three mood words]

Skip a section if you have nothing concrete for it. Empty headers hurt.

Walk in markdown sections — Same content as a long paragraph; the sections keep the parsing clean.

Working checklist before you generate

A quick pass before you submit on fal:

Subject and action in the first sentence?
Under thirty words, or a real reason to go longer?
If long: shot-list timecodes or markdown sections, not plain prose?
Exactly one strong cinematography cue, not five?
Cut every adjective that wasn't specific?
If a camera move is the point of the shot, is it the last thing in the prompt?
Multi-beat action: shot-list format?
Plain English prose, not booru tags, JSON, or weighted parentheses?

Highlights: three reference shots

Three from the bench that followed the rules above end-to-end.

Twenty-word car shot — 1965 Mustang on coastal highway. One subject, one camera move, one mood.

Cape on cliff, medium prose — Subject + secondary motion

sunset rim light.

Aerial coast, markdown sections — Long form earned its keep. Altitude, banking, lens compression all land.

Most shots land on the first generation if you stick to the templates above. If you're on the fifth or sixth generation of the same shot, the prompt is doing too much.