At the current time of the Sora
,Kling
cap (a poem) Runway
AI video generation field defined by models such as Google's Veo 3
It stands out with its unique ability to generate native audio and video synchronization. It not only renders high-fidelity video footage, but also matches it with appropriate dialog, sound effects and background music. Despite its relatively high cost of use, the Veo 3
Undoubtedly one of the most technically comprehensive video generation models on the market today.
This article will provide an in-depth analysis of Veo 3
of cue word engineering, covering the full range of techniques from basic structure to advanced audio control. Mastering these methods not only significantly improves video quality, but also effectively reduces the costs associated with repeated trial and error. These core cueing principles are equally applicable to other major video generation models.
Core Composition of Cues
Precise, specific cue words are the foundation for getting the ideal video. A well-structured cue word usually contains the following two types of key information:
1. Description of core content
This section defines the "what" and "where" of the video.
- Subject. The main character of the video. This can be one or more characters, animals or objects. Their physical appearance should be described as specifically as possible, such as ethnicity, hairstyle, dress, etc.
- Scene. The environment in which the subject is located, e.g., indoors, on a city street, in a forest, by the sea, etc.
- Action. An action being performed by the subject, such as walking, jumping, talking, or manipulating an object.
2. Audiovisual style settings
This section defines the "feel" and "presentation" of the video.
- Style. The overall artistic style of the video, e.g., cinematic, anime, claymation, Ghibli style.
- Camera Movement. Describe the dynamics of the shot, such as pushing (dolly in), pulling (dolly out), panning (pan), tracking shot, and so on. Professional camera commands can greatly enhance the cinematic feel of your video.
- Composition. The range of the frame, such as close-up, medium shot, or long shot. It is possible to directly follow the
MidJourney
The word for a mature compositional cue in the middle. - Mood/Lighting. Describe the lighting and color tones of the image, such as warm tone, cool tone, eerie glow, or golden hour.
The huge impact of cue word detail on the generated results can be visualized by the following two examples.
Simple Cue Words:
A man answers a rotary phone
Detailed Cue Words:
A shaky dolly zoom goes from a far away blur to a close-up cinematic shot of a desperate man in a weathered green trench coat as he picks up a rotary phone mounted on a gritty brick wall, bathed in the eerie glow of a green neon sign. The zoom reveals the tension and the desperation etched on his face as he struggles to talk on the phone. The shallow depth of field focuses on his furrowed brow and the black rotary phone, blurring the background into a sea of neon colors and indistinct shadows, creating a sense of urgency and isolation.
Detailed cues not only define the action, but also build mood, light, shadow, and a sense of narrative, resulting in video clips of far superior quality.
Define the visual style of the video
By default, theVeo 3
The generated video favors a professional, clean commercial or cinematic quality. To create a unique visual style, it must be clearly specified in the cue.
The following examples use the same core description, but apply different style directives.
Original Core Cue Words:
A bearded man in a flannel shirt and weathered jeans sits cross-legged beside a flickering campfire, its amber light casting soft, dancing shadows across the pine-needle-strewn ground of a quiet forest clearing. Across from him, just beyond the edge of the firelight, stands a massive grizzly bear, calm and still, its fur catching the warm glow, eyes reflecting the flames with eerie intelligence. The two shake hands, like they’re old friends.
At the beginning of the above cue add In the style of [style name]
The results can be very different, for example: LEGO, Claymation, South Park, Pixar animation, 8-bit retro, Graphic novel, Origami, Blueprint, Anime or Marble. Simpsons, Blueprint, Anime or Marble.
Controlling Lens Motion
Camera movement is the cornerstone of the language of video.Veo 3
A wide range of standard mirror-running commands are supported, commonly including:
eye level
: Panoramic Lenshigh angle
: High angle lensworm’s eye
:: Elevation shot (bug view)dolly shot
:: Push-pull shots (physical movement of the camera)zoom shot
: Zoom lens (zoom in or out)pan shot
:: Panning shots (horizontal rotation of the camera in place)tracking shot
: Follow the camera
For example, you can use the Zoom in
Realize screen enlargement using Left to right pan
Realize a left-to-right panning mirror.
Generate popular Selfie style videos
Selfie style videos are favored for their authenticity and immersion. To take a selfie in Veo 3
In order to generate realistic selfie videos, you can use a combination of the following three core elements:
A selfie video of...
: Declare the video type as selfie directly.holds the camera at arm’s length. His arm is clearly visible in the frame.
: Describe the arm as being visible in the frame, a key detail that enhances realism.occasionally looking into the camera
:: The action of "looking at the camera from time to time" can make the character appear more vivid and natural.
Example:
A selfie video of a travel blogger exploring a bustling Tokyo street market. She’s wearing a vintage denim jacket and has excitement in her eyes. The afternoon sun creates beautiful shadows between the vendor stalls. She’s sampling different street foods while talking, occasionally looking into the camera before turning to point at interesting stalls. The image is slightly grainy, looks very film-like. She speaks in a British accent and says: “Okay, you have to try this place when you visit Tokyo. The takoyaki here is absolutely incredible, and the vendor just told me it’s been in his family for three generations.” She ends with a thumbs up.
Enhance the diversity of results generated
together with MidJourney
Unlike image models such asVeo 3
When dealing with simple cue words, the convergence of results generated multiple times is high. For example, using a woman laughs
Generated multiple times, the resulting videos may be extremely similar in terms of characters, attire, and scenes.
The only way to break this homogeneity and obtain more diverse results is to increase the detail and complexity of the cue words, i.e., to follow the exhaustive structure presented in the first part.
For example, by adding scene and mood details, very different results can be obtained:
Cue 1 (office scene).
a woman laughs long and loudly, she’s in an office meeting and she’s embarrassed afterwards
Cue 2 (Family Scene).
a woman laughs quietly, she’s at home watching a tv show
Ensure consistency in characterization
Maintaining character consistency across multiple videos is key to creating narrative content.
Preferred Solution: Image-to-Video
The most reliable method is to utilize Veo 3
Support for image input. The recommended workflow is to first use a specialized image tool (such as the MidJourney
(used form a nominal expression) omni reference
maybe Flux.1
(used form a nominal expression) Kontext
mode) to generate a character design diagram with consistency, which is then used as a visual reference input Veo 3
The
Option: Use of textual cues
If you don't use a reference chart, you can utilize the Veo 3
Generate traits with similar results under the same cue word. The trick is to provide extremely detailed and consistent descriptions of the character's physical characteristics in the cue words.
The following two video clips use cue words containing the same character descriptions, generating little difference in characterization.
Tip 1.
John, a man in his 40s with short brown hair, wearing a blue jacket and glasses, looking thoughtful, he says: Hello, I am also John, and I look kind of the same as that guy over there (no subtitles!). He is in a bright light room.
Tip 2.
John, a man in his 40s with short brown hair, wearing a blue jacket and glasses, looking thoughtful, he says: Hello, my name is John, I am a character invented for this blog post (no subtitles!)
Advanced video generation techniques in the Flow platform
Veo 3
Integrated in Google's Flow
There are some unique advanced features offered in the platform.
- Specify Start and End Frames. The user can upload a start image and an end image that
Veo 3
A transition video between the two is automatically generated, perfect for creating dynamic transitions. - Extend and Jump to. These are two ways of extending and expanding video.
Extend
Used to continue generating content based on the last frame of the current video, suitable for linear extension of the story.Jump to
It extracts a character from a video and places him/her in a new scene, which is suitable for creating "character crossing" style creative videos. - Ingredients to Video. This is a powerful fusion feature that allows users to upload multiple reference images (e.g., a character, an object, a background), the
Veo 3
These "ingredients" will be merged into the same generated video. Currently this feature is only available toUltra
Subscription users ($250/month) are open.
Strategies for Audio Cue Words
Veo 3
The core strength of the program is in audio generation, and here's how you can precisely control the audio content.
Generating character dialog
1. Precise designation of lines
You can write the complete lines that your character needs to say directly in the cue. But be careful. Veo 3
There is a limit to the length of a single generation (usually 8 seconds). Lines that are too long will result in fast and unnatural speech; lines that are too short may result in large stretches of silence or characters uttering meaningless filler words.
- Example of a long line.
John, a man in his 40s with short brown hair, wearing a blue jacket and glasses, looking thoughtful, he says: You have given me a really long prompt, and I have to speak very quickly and unnaturally to try and fit all these words into just 8 seconds, I’m going to be out of breath at the end of this, phew.
- Example of a short line.
John, a man in his 40s with short brown hair, wearing a blue jacket and glasses, looking thoughtful, he says: Hello, I’m John.
2. Setting goals and creating lines by AI
A more efficient way to do this is to not provide specific lines, but to set a scene and a goal for the Veo 3
Generate dialog content on your own. This approach tends to yield more natural results.
- AI creates its own jokes.
a standup comic tells an awkward joke at a music festival, sounds of distant bands, noisy crowd, ambient background of a busy festival field (no studio audience)
- Specify the content of specific jokes.
a standup comic tells an awkward joke at a music festival: You know what’s great about music festivals? Watching 20,000 people pretend they knew this band before today while filming vertical videos they’ll never watch.
Examples of scenarios where AI can be useful include stand-up comedy, two-person discussions, phone arguments, and characters telling stories.
Challenges and Current Status of Generating Chinese Speech
Currently, through Veo 3
Generating high quality Chinese speech is still a challenge.
- exist
Flow
In the platform: The platform currently only accepts English prompt words. A workaround for generating Chinese speech is to use Hanyu Pinyin with clear instructionsin Mandarin Chinese
. But even then, the generated speech is usually only similar in pitch and mouth shape, not standard Mandarin. - exist
Gemini
In the platform:Gemini
It allows multi-language input and can write Chinese subtitles directly. However, its backend model (currently mostlyVeo 3 Fast
) in Chinese processing is still unsatisfactory.
Objectively speaking, due to the differences in training data and segmentation technology, at present, in terms of Chinese speech generation, some domestic models (such as ByteDance's 即梦
) has shown a greater ability to do so.
How to avoid generating subtitles
Veo 3
of the training data contains a large number of videos with subtitles, so the generated results often come with subtitles as well. To suppress this, try the following two approaches:
- Putting lines in English colons
:
After, instead of the English quotation marks""
Within. Text within quotation marks is more likely to be interpreted by the model as a subtitle to be displayed. - Explicitly include at the end of the prompt
no subtitles
The
Generate Music
Music generation is relatively simple. You can describe the musical style, instrumentation, and tempo in detail in the cue, or you can just give a general direction (e.g. dramatic orchestral music
), let Veo 3
Create your own.