AiHubMix provides a unified video generation API, compatible with the OpenAI Sora interface format, with backend support for models from multiple vendors
Kling image-to-video, supports end frame and multi-image reference (up to 4 images)
Image-to-video requires passing the reference image via the input_reference parameter (Alibaba Tongyi Wanxiang); Doubao Seedance passes it via the extra_body.content array, which supports image, video, and audio reference types; Kling uses image / image_tail / image_list to pass images — see the Kling section below for details.
Once the status is completed, call this endpoint to download the MP4 video file.Response: Returns the video binary stream directly (Content-Type: video/mp4).
Tip: The seconds parameter for all models is always passed as a string (e.g. "8").
Example
curl -X POST https://aihubmix.com/v1/videos \ -H "Authorization: Bearer $AIHUBMIX_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "sora-2", "prompt": "A cinematic drone shot soaring over a misty mountain range at sunrise, golden light filtering through the clouds", "seconds": "8", "size": "1280x720" }'
720p (default), 1080p, 4k (4K only for Veo 3+), or pixel format such as 1280x720, 1920x1080
Aspect Ratio
16:9 (default), 9:16
Image-to-Video
Supported, pass the first-frame image via input_reference (Veo 3.1); when used, seconds is fixed at "8"
Example
curl -X POST https://aihubmix.com/v1/videos \ -H "Authorization: Bearer $AIHUBMIX_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "veo-3.1-generate-preview", "prompt": "A tranquil Japanese garden, cherry blossom petals slowly drifting down, koi swimming in the pond, with the melodious sound of wind chimes in the background", "seconds": "8", "size": "1280x720" }'
Tip: Veo supports native audio generation; you can describe sound effects in the prompt, such as “the sound of birds chirping in the background” or “a piano melody”.
Note: wan2.6 supports only 720P and 1080P; wan2.5 supports 480P, 720P, and 1080P; wan2.2 supports only 480P and 1080P.
Example
curl -X POST https://aihubmix.com/v1/videos \ -H "Authorization: Bearer $AIHUBMIX_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "wan2.6-t2v", "prompt": "A winding stream flows through an autumn forest, golden fallen leaves drifting on the water surface, sunlight casting dappled light and shadow through the leaves", "seconds": "5", "size": "1920x1080" }'
Tip: wan2.5 and above generate videos with sound by default (automatic dubbing); Chinese prompts work better.
Supported, pass the image URL or base64 via input_reference
Supported Aspect Ratios and Corresponding Resolutions
Aspect Ratio (size)
Actual Resolution
16:9 or 1920x1080
1920×1088
9:16 or 1080x1920
1088×1920
4:3 or 1664x1248
1664×1248
3:4 or 1248x1664
1248×1664
1:1 or 1440x1440
1440×1440
21:9 or 2176x928
2176×928
Example
curl -X POST https://aihubmix.com/v1/videos \ -H "Authorization: Bearer $AIHUBMIX_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "jimeng-3.0-pro", "prompt": "A young woman in Hanfu dances gracefully amid a bamboo forest, her long dress flowing in the wind, with a faint morning mist in the background", "seconds": "5", "size": "16:9" }'
Defaults to true; set to false to generate a silent video
Watermark (watermark)
Defaults to false
Multimodal Reference
Supports image, video, and audio
Reference Types Supported by extra_body.content
Type
type Value
role Value
Description
Reference Image
image_url
reference_image
Visual/style reference image
Reference Video
video_url
reference_video
Camera movement/composition reference video
Reference Audio
audio_url
reference_audio
Background music audio file
Example
Seedance 2.0 / 2.0 Fast
curl -X POST "https://aihubmix.com/v1/videos" \ -H "Authorization: Bearer $AIHUBMIX_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "doubao-seedance-2-0-260128", "prompt": "Use the first-person POV framing from Video 1 throughout, and use Audio 1 as the background music for the entire clip. Create a first-person fruit tea commercial featuring the Seedance brand limited-edition apple fruit tea, "Ping Ping An An." Opening frame: Image 1. From a first-person perspective, your hand picks a dew-covered Aksu red apple, accompanied by a crisp, satisfying bite-like tapping sound.Seconds 2–4: Fast-paced cuts. Your hand drops freshly cut apple chunks into a shaker, adds ice and tea base, then shakes vigorously. The sound of ice clinking and shaking syncs with upbeat percussion. Background voiceover: "Freshly cut, freshly shaken."Seconds 4–6: First-person close-up of the finished drink. The layered fruit tea is poured into a clear cup. Your hand gently squeezes a creamy topping across the surface. A pink label is placed on the cup. The camera pushes in to highlight the rich texture and layering.Seconds 6–8: First-person hand holding the drink. You raise the fruit tea from Image 2 toward the camera, as if offering it directly to the viewer. The label is clearly visible. Background voiceover: "Take a refreshing sip."Final frame: Freeze on Image 2. All background voiceovers should be in a female voice.", "extra_body": { "content": [ { "type": "image_url", "image_url": { "url": "https://ark-project.tos-cn-beijing.volces.com/doc_image/r2v_tea_pic1.jpg" }, "role": "reference_image" }, { "type": "image_url", "image_url": { "url": "https://ark-project.tos-cn-beijing.volces.com/doc_image/r2v_tea_pic2.jpg" }, "role": "reference_image" }, { "type": "video_url", "video_url": { "url": "https://ark-project.tos-cn-beijing.volces.com/doc_video/r2v_tea_video1.mp4" }, "role": "reference_video" }, { "type": "audio_url", "audio_url": { "url": "https://ark-project.tos-cn-beijing.volces.com/doc_audio/r2v_tea_audio1.mp3" }, "role": "reference_audio" } ], "ratio": "16:9", "duration": 11, "watermark": false } }'
Kling supports four types of capabilities: text-to-video, image-to-video, multi-image reference video, and OmniVideo multimodal. All are invoked through the unified /v1/videos endpoint, and the gateway automatically routes to the corresponding Kling endpoint based on “model name + input form”, with no need for the caller to differentiate.
Frame: 16:9 / 9:16 / 1:1 (required for omni pure text-to-video and video reference; defaults to 16:9 if omitted)
cfg_scale
float
Prompt relevance [0, 1], default 0.5 (not supported by kling-v2.x)
image
string
Image-to-Video: single image, image URL or Base64 (Base64 without the data:image/...;base64, prefix)
image_tail
string
Image-to-Video: end-frame image (optional)
image_list
array
Multi-image Reference: array of image URLs, up to 4 images
sound
string
omni: on/off, whether to generate native audio, default off
video_list
array
omni: reference video [{ "video_url": "...", "refer_type": "feature" }]; refer_type takes feature (video reference) / base (video editing)
Unsupported or unmapped key parameters will raise an explicit error rather than being silently dropped. Other native Kling parameters can be placed in extra_body to pass through to the upstream.
Three asynchronous steps: submit to get video_id → poll GET /v1/videos/{video_id} until status is completed → GET /v1/videos/{video_id}/content to download the MP4. Status values: in_progress / completed / failed.
Video output usually takes 1–3 minutes; result video URLs are cleaned up after 30 days, so transfer and save them promptly.
Delete task: Kling has no delete endpoint; DELETE /v1/videos/{video_id} returns 501 not_supported.
Billing: charged by model × mode × duration × capability (with or without reference video / audio); no charge for failed generation, and queries and downloads are not billed.
What is the difference between seconds and duration?
The two have the same meaning, both representing the video duration. The API supports both parameter names (except Sora, which only accepts seconds). We recommend using seconds consistently.