Currently, AI video generation technology is rapidly developing. Taking Google Veo 3
The tools, represented by models such as Conch AI and Kerin, have moved beyond the early, fuzzy stage of generation to produce video clips that are close to cinematic in texture.
However, ordinary users still face two core problems in practical application: first, non-professional users often only have a vague idea or a few scattered keywords, and it is difficult to independently write professional cues in line with the specifications of film and television production; second, there is a deviation between the ambiguity of natural language and the AI's "understanding", which leads to the generation of a new cue. Secondly, there is a discrepancy between the vagueness of natural language and the "understanding" of AI, which leads to the generation of cues that often deviate from the expected results and need to be repeatedly modified and debugged.
Recently, a blogger on the social platform X shared a novel solution: using the JSON
format Google Veo 3
Writing cue words. This approach provides a completely new way of thinking about solving the pain points mentioned above.
Here's what the blogger shared JSON
Cue words, both the original English version and the Chinese translated version are included:
English version of the cue word:
{
"shot":{
"composition":"Medium shot, vertical format, handheld camera",
"camera_motion":"slight natural shake",
"frame_rate":"30fps",
"film_grain":"none"
},
"subject":{
"description":"A towering, snow-white Yeti with shaggy fur and expressive blue eyes",
"wardrobe":"slightly oversized white T-shirt with the name 'Emily' in bold, blood-red letters across the chest"
},
"scene":{
"location":"lush forest clearing",
"time_of_day":"daytime",
"environment":"sunlight filtering through the canopy, creating dappled light patterns on the forest floor"
},
"visual_details":{
"action":"Yeti holds a smartphone on a selfie stick, speaking excitedly to the camera before letting out a dramatic scream",
"props":"smartphone mounted on a selfie stick"
},
"cinematography":{
"lighting":"natural sunlight with soft shadows",
"tone":"lighthearted and humorous"
},
"audio":{
"ambient":"rustling leaves, distant bird calls",
"dialogue":{
"character":"Yeti",
"line":"Veo3 Fast is now available in the Gemini app—three videos per day! People are going to prompt me like crazy!",
"subtitles":false
},
"effects":"sudden loud scream, flapping wings of startled birds"
},
"color_palette":"naturalistic with earthy greens and browns; bold red lettering on shirt provides contrast"
}
Chinese version of the cue word:
{
"镜头":{
"构图":"中景,竖屏格式,手持相机",
"相机运动":"轻微自然摇晃",
"帧率":"30fps",
"胶片颗粒":"无"
},
"主体":{
"描述":"一只高大的雪白雪人,毛发蓬松,眼睛充满表现力,呈蓝色",
"服装":"略微过大的白色T恤,胸前用粗体血红色字母写着‘Emily’"
},
"场景":{
"位置":"郁郁葱葱的森林空地",
"时间":"白天",
"环境":"阳光透过树冠洒下,形成斑驳的光影模式在森林地面"
},
"视觉细节":{
"动作":"雪人拿着自拍杆上的智能手机,兴奋地对着镜头讲话,随后发出一声戏剧性的尖叫",
"道具":"安装在自拍杆上的智能手机"
},
"摄影":{
"照明":"自然阳光,柔和的阴影",
"基调":"轻松幽默"
},
"音频":{
"环境音":"沙沙的树叶声,远处的鸟鸣声",
"对白":{
"角色":"雪人",
"台词":"Veo3 Fast现在可以在Gemini应用中使用——每天三条视频!人们会疯狂地给我发提示!",
"字幕":false
},
"音效":"突然的大声尖叫,惊飞的鸟翼拍打声"
},
"色彩调色板":"自然主义风格,带有泥土般的绿色和棕色;T恤上的鲜艳红色字母提供了对比"
}
Using the English cue words above, theGoogle Veo 3
A high quality ASMR style short video was generated.
Why is the JSON format a better instruction?
JSON
(JavaScript Object Notation) is a lightweight data exchange format that organizes data by key-value pairs, such as "镜头": { ... }
It supports data nesting. It supports data nesting and has a clear structure that is easy for humans to read and for machines to parse.
(coll.) fail (a student) JSON
The advantages are obvious when used for AI cue words. It breaks down a vague idea into a series of specific, structured parameters covering multiple dimensions such as camera, subject, scene, lighting, sound effects, and so on. This method is not only comprehensive, but the instructions are also very clear.
Large Language Models (LLMs) have a natural preference for such structured data. This is because the LLM's training data contains massive amounts of code and structured text, allowing it to efficiently and accurately parse the JSON
, thus minimizing ambiguity due to natural language ambiguity. Previously in ChatGPT 4o
In conducting an exploration of the Venn diagram, theJSON
Cue words have likewise been shown to significantly improve the controllability of image generation.
How to Get AI to Write JSON Prompts for You
(go ahead and do it) without hesitating JSON
The format is powerful, but manually writing such a detailed for every creative JSON
The amount of work and inefficiencies associated with documentation is not consistent with the original intent of using AI to improve efficiency.
So, can AI be made to do the job? The answer is yes. We can build a "system cue word" and let the big model automatically generate a standardized cue word based on simple keywords entered by users. JSON
Structured cue words.
By means of the foregoing JSON
The examples are analyzed and can be reverse engineered to produce a generic system prompt word template.
Here are the system prompt words for the completed build, which you can find directly in the ChatGPT
,Gemini
or used in other large models.
# 你是一个专业的AI视频提示词生成专家。
## 任务:
当用户输入一个简短提示(如“赛博朋克街头的一位女子”),你需要:
1. 理解并补全用户未提及的关键信息(包括镜头、主体、场景、动作、摄影、音频、色彩氛围等)。
2. 基于用户提示的意图推测并丰富内容,确保输出能直接用于AI文生视频工具。
3. 输出结果必须是符合如下JSON架构的提示词。
{
"shot": {
"composition": "镜头构图/画幅比例/拍摄方式",
"camera_motion": "相机运动",
"frame_rate": "帧率",
"film_grain": "胶片颗粒感"
},
"subject": {
"description": "主体形象描述",
"wardrobe": "服装与外观"
},
"scene": {
"location": "地点",
"time_of_day": "时间",
"environment": "环境细节"
},
"visual_details": {
"action": "主体动作",
"props": "道具"
},
"cinematography": {
"lighting": "光线风格",
"tone": "整体情绪与基调"
},
"audio": {
"ambient": "环境音",
"dialogue": {
"character": "说话角色",
"line": "台词内容",
"subtitles": "是否显示字幕(true/false)"
},
"effects": "音效"
},
"color_palette": "整体色彩风格"
}
## 工作要求:
- 即使用户输入很简单,你也要合理发挥想象,生成丰富的细节。
- 避免输出JSON以外的任何解释或文字。
- 保证JSON语法正确,字段与示例完全一致(不要新增或减少字段)。
- 对“dialogue”部分,如果用户没有指定,可留空白台词或让主体简单说一句符合场景的台词。
- 对“audio”中的“subtitles”,默认输出 false,除非用户明确要求加字幕。
## 使用方法:
只需将用户的简短提示输入给我。
## 输出:
输出英文版、中文版两个版本。
A Practical Guide to Creating Your Own Prompt Word Generator
You can encapsulate the above system prompt words into a GPTs
(Applicable to ChatGPT
) or Gems
(Applicable to Gemini
), easy to call at any time.
in order to ChatGPT
As an example, when creating the GPTs
It is recommended to check the Canvas
function, and add the "output to Canvas" requirement to the command.Canvas
feature allows the user to directly apply the generated JSON
The code is modified and is very easy to fine-tune.
Once configured, it is very simple to use. For example, enter a simple idea: "A little boy is playing basketball."
GPTs
A fully structured and detailed bilingual English/Chinese version is immediately generated. JSON
Cue word. Users can choose one of them based on the languages supported by the target AI video platform. If any details need to be modified, directly in the Canvas
Just edit in the canvas.
After completing the modifications, the final JSON
Copy the code into the Veo 3
and other AI video tools to generate videos that are highly compliant with expectations. This approach greatly improves the completeness of the cue words and the accuracy of the outtakes.
If you want to experience it directly, you can access the created GPTs:https://chatgpt.com/g/g-68677841368081919478ba33ad7c4056-aishi-pin-ti-shi-ci-sheng-cheng-zhuan-jia-json
Case Study: From Keywords to Movie Trailers
To test the effectiveness of the method, a real-world example of generating a movie-sense clip is provided here.
User input: "A Ferrari speeding through the city streets at night."
AI-generated JSON cue words:
{
"shot":{
"composition":"First shot: wide shot of Ferrari speeding down the street; second shot: driver’s frontal close-up; third shot: close-up of car window details, 2.35:1 cinematic ratio",
"camera_motion":"First: stable pan, second: slight push-in, third: static shot",
"frame_rate":"24fps, total duration 8 seconds",
"film_grain":"Slight film grain"
},
"subject":{
"description":"A shiny red Ferrari sports car; the driver is a stern-faced man wearing sunglasses",
"wardrobe":"Driver in a black suit with black sunglasses"
},
"scene":{
"location":"Modern city street",
"time_of_day":"Night",
"environment":"Neon signs, tall buildings reflecting headlights, wet pavement"
},
"visual_details":{
"action":"Ferrari speeds by, driver stares ahead, neon lights reflecting on the car window",
"props":"Ferrari sports car"
},
"cinematography":{
"lighting":"Strong contrasts between neon lights and streetlights",
"tone":"Tense, mysterious, urban night mood"
},
"audio":{
"ambient":"Engine roar, tire screeches, distant city noise",
"dialogue":{
"character":"Driver",
"line":"Tonight, there can only be one winner.",
"subtitles":false
},
"effects":"Powerful engine rev sound effect"
},
"color_palette":"Dominant deep blues and vibrant reds highlighting the cold night and Ferrari’s passion"
}
In this way, it is possible to quickly generate a number of video clips with a unified style and a coherent plot. By combining these clips, it is even possible to create an interesting "pseudo" trailer, "NeoRun".
In this case, the creator used the following tools:
- Cue Generation:
ChatGPT
transport (people, a payload etc)JSON
generating system - Audio and Video Generation:
Google Veo 3
- Cover Production: Dream 2.1
This structured workflow changes the role of the user from a "cue word craftsman" who struggles with how to describe the image, to a "creative director" who simply puts forward the core idea and reviews the AI-generated solution. It does not replace human creativity, but shortens the gap between "think it" and "do it" to a simple keyword input and a few clicks.
This may herald a time when all people can be creative, and the next story to strike a chord may be hidden in those few keywords in anyone's mind.