Skip to main content

Quick Start

Video generation is an asynchronous operation. The whole process is divided into three steps:
1. Submit task → get video_id
2. Poll status → wait until status becomes completed
3. Download video → get the MP4 file
Minimal Example
# Step 1: Submit the video generation task
curl -X POST https://aihubmix.com/v1/videos \
  -H "Authorization: Bearer $AIHUBMIX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "wan2.6-t2v",
    "prompt": "A cat playing jazz on a piano, warm lighting, cinematic shot",
    "seconds": "5",
    "size": "1280x720"
  }'

# Example response:
# {
#   "id": "eyJtb2RlbCI6IndhbjI...",
#   "object": "video",
#   "status": "in_progress",
#   "model": "wan2.6-t2v",
#   "duration": 5,
#   "width": 1280,
#   "height": 720,
#   ...
# }

# Step 2: Poll the status (query every 15 seconds until status is completed)
curl https://aihubmix.com/v1/videos/{video_id} \
  -H "Authorization: Bearer $AIHUBMIX_API_KEY"

# Step 3: Download the video
curl https://aihubmix.com/v1/videos/{video_id}/content \
  -H "Authorization: Bearer $AIHUBMIX_API_KEY" \
  --output video.mp4

API Overview

EndpointMethodPathDescription
Create VideoPOST/v1/videosSubmit a video generation task
Query StatusGET/v1/videos/{video_id}Query task status and progress
Download VideoGET/v1/videos/{video_id}/contentDownload the generated MP4 video
Delete TaskDELETE/v1/videos/{video_id}Delete a video task
Base URL: https://aihubmix.com Authentication: Bearer Token
Authorization: Bearer $AIHUBMIX_API_KEY

Supported Models

Text-to-Video

VendorModel NameFeatures
OpenAIsora-2Standard video generation, supports audio-video sync
OpenAIsora-2-proHigh-quality version, more refined and stable visuals
Googleveo-3.1-generate-previewLatest Veo 3.1, native audio, supports 4K
Googleveo-3.1-fast-generate-previewVeo 3.1 fast version, faster generation speed
Googleveo-3.0-generate-previewVeo 3.0, high-fidelity video
Googleveo-2.0-generate-001Veo 2.0, stable version
Alibabawan2.6-t2vLatest Tongyi Wanxiang, audio-video sync
Alibabawan2.5-t2v-previewTongyi Wanxiang 2.5, optimized for Chinese
Alibabawan2.2-t2v-plusTongyi Wanxiang 2.2
ByteDancejimeng-3.0-proJimeng 3.0 Pro, 1080P HD
ByteDancejimeng-3.0-1080pJimeng 3.0 1080P
ByteDancedoubao-seedance-2-0-260128Professional-grade multimodal creative video model Seedance 2.0
ByteDancedoubao-seedance-2-0-fast-260128Seedance 2.0 fast version
Kuaishoukling-v3, kling-v2-6, kling-v2-5-turbo, kling-v2-1Kling text-to-video / image-to-video, newer versions support 3–15 seconds
Kuaishoukling-v3-omni, kling-video-o1Kling OmniVideo multimodal, supports reference video, native audio, multi-shot

Image-to-Video

VendorModel NameFeatures
Alibabawan2.6-i2vLatest Tongyi Wanxiang image-to-video
Alibabawan2.5-i2v-previewTongyi Wanxiang 2.5 image-to-video
Alibabawan2.2-i2v-plusTongyi Wanxiang 2.2 image-to-video
ByteDancedoubao-seedance-2-0-260128Multimodal reference inputs, supports image/video/audio
ByteDancedoubao-seedance-2-0-fast-260128Seedance 2.0 fast version
Kuaishoukling-v1-6, etc.Kling image-to-video, supports end frame and multi-image reference (up to 4 images)
Image-to-video requires passing the reference image via the input_reference parameter (Alibaba Tongyi Wanxiang); Doubao Seedance passes it via the extra_body.content array, which supports image, video, and audio reference types; Kling uses image / image_tail / image_list to pass images — see the Kling section below for details.

API Details

Request Headers

Authorization: Bearer $AIHUBMIX_API_KEY
Content-Type: application/json

Create a Video Generation Task

POST /v1/videos

Request Body

ParameterTypeRequiredDescription
modelstringYesModel name, e.g. wan2.6-t2v, sora-2
promptstringYesVideo description text
secondsstringNoVideo duration (seconds), always passed as a string, e.g. "5", "8" (see per-model details)
sizestringNoResolution, format widthxheight, e.g. 1920x1080 (supported values vary by model)
input_referencestring/objectNoReference image (image-to-video), supports URL or base64
Response formats vary slightly across models, but all include the id (video_id) and status fields. Just use status to determine task progress.

Example Response (Tongyi Wanxiang / Veo / Jimeng AI)

{
  "id": "eyJtb2RlbCI6IndhbjI...",
  "object": "video",
  "created": 1772460274,
  "model": "wan2.6-t2v",
  "status": "in_progress",
  "prompt": "A cat watching the rain on a windowsill",
  "duration": 5,
  "width": 1920,
  "height": 1080,
  "url": null,
  "error": null
}
Example Response (Sora)
{
  "id": "eyJtb2RlbCI6InNvcmEtMi...",
  "object": "video",
  "created_at": 1772451930,
  "status": "queued",
  "model": "sora-2",
  "progress": 0,
  "prompt": "A cinematic drone shot over mountains",
  "seconds": "8",
  "size": "1280x720"
}

Common Status Values

StatusDescription
queuedQueued (Sora-specific)
in_progressGenerating
completedGeneration complete, ready to download
failedGeneration failed

Query Video Status

GET /v1/videos/{video_id}
Poll this endpoint to check whether the task is complete. We recommend querying every 15 seconds.

Example Response (Generation Complete - Tongyi Wanxiang)

{
  "id": "eyJtb2RlbCI6IndhbjI...",
  "object": "video",
  "status": "completed",
  "model": "wan2.5-t2v-preview",
  "duration": 5,
  "width": 1920,
  "height": 1080,
  "url": "https://aihubmix.com/v1/videos/eyJtb2RlbCI6IndhbjI.../content",
  "error": null
}

Example Response (Generation Complete - Sora)

{
  "id": "eyJtb2RlbCI6InNvcmEtMi...",
  "object": "video",
  "created_at": 1772451930,
  "status": "completed",
  "completed_at": 1772452114,
  "expires_at": 1772538330,
  "model": "sora-2",
  "progress": 100,
  "prompt": "A cinematic drone shot over mountains",
  "seconds": "8",
  "size": "1280x720"
}
All models use status == "completed" to determine the completion state, then call the /content endpoint to download.

Download Video Content

GET /v1/videos/{video_id}/content
Once the status is completed, call this endpoint to download the MP4 video file. Response: Returns the video binary stream directly (Content-Type: video/mp4).
curl https://aihubmix.com/v1/videos/{video_id}/content \
  -H "Authorization: Bearer $AIHUBMIX_API_KEY" \
  --output my_video.mp4
Note: Video download links usually have a 24-hour validity period, so download and save them promptly.

Delete a Video Task

This endpoint is used to delete an already-created video task.
DELETE /v1/videos/{video_id}

Per-Model Parameter Details

OpenAI Sora

ParameterSupported Values
Modelsora-2, sora-2-pro
Duration (seconds)"4" (default), "8", "12"
Resolution (size)720x1280 (default), 1280x720, 1024x1792, 1792x1024
Image-to-VideoSupported, pass the image via input_reference
Tip: The seconds parameter for all models is always passed as a string (e.g. "8").
Example
curl -X POST https://aihubmix.com/v1/videos \
  -H "Authorization: Bearer $AIHUBMIX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sora-2",
    "prompt": "A cinematic drone shot soaring over a misty mountain range at sunrise, golden light filtering through the clouds",
    "seconds": "8",
    "size": "1280x720"
  }'

Google Veo

ParameterSupported Values
Modelveo-3.1-generate-preview (recommended), veo-3.1-fast-generate-preview (fast), veo-3.0-generate-preview, veo-2.0-generate-001
Duration (seconds)Veo 3/3.1: "4", "6", "8"; Veo 2: "5"~"8" (default "8")
Resolution (size)720p (default), 1080p, 4k (4K only for Veo 3+), or pixel format such as 1280x720, 1920x1080
Aspect Ratio16:9 (default), 9:16
Image-to-VideoSupported, pass the first-frame image via input_reference (Veo 3.1); when used, seconds is fixed at "8"
Example
curl -X POST https://aihubmix.com/v1/videos \
  -H "Authorization: Bearer $AIHUBMIX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "veo-3.1-generate-preview",
    "prompt": "A tranquil Japanese garden, cherry blossom petals slowly drifting down, koi swimming in the pond, with the melodious sound of wind chimes in the background",
    "seconds": "8",
    "size": "1280x720"
  }'
Tip: Veo supports native audio generation; you can describe sound effects in the prompt, such as “the sound of birds chirping in the background” or “a piano melody”.

Tongyi Wanxiang

ParameterSupported Values
Text-to-Video Modelswan2.6-t2v (recommended), wan2.5-t2v-preview, wan2.2-t2v-plus
Image-to-Video Modelswan2.6-i2v (recommended), wan2.5-i2v-preview, wan2.2-i2v-plus
Duration (seconds)Varies by model (see below), default "5"
Resolution (size)See the table below; both x and * separators are accepted (e.g. 1920x1080 or 1920*1080)
Image-to-VideoPass the image URL or base64 via input_reference
Duration Supported by Each Model
Modelseconds Allowed ValuesDefault
wan2.6-t2v / wan2.6-i2v"2"~"15" (any integer value)"5"
wan2.5-t2v-preview / wan2.5-i2v-preview"5" or "10""5"
wan2.2-t2v-plus / wan2.2-i2v-plus"5" (fixed)"5"
Supported Resolutions (width*height)
ClarityAvailable Resolutions
480P832x480, 480x832, 624x624
720P1280x720 (default), 720x1280, 960x960, 1088x832 (4:3), 832x1088 (3:4)
1080P1920x1080, 1080x1920, 1440x1440, 1632x1248 (4:3), 1248x1632 (3:4)
Note: wan2.6 supports only 720P and 1080P; wan2.5 supports 480P, 720P, and 1080P; wan2.2 supports only 480P and 1080P.
Example
curl -X POST https://aihubmix.com/v1/videos \
  -H "Authorization: Bearer $AIHUBMIX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "wan2.6-t2v",
    "prompt": "A winding stream flows through an autumn forest, golden fallen leaves drifting on the water surface, sunlight casting dappled light and shadow through the leaves",
    "seconds": "5",
    "size": "1920x1080"
  }'
Tip: wan2.5 and above generate videos with sound by default (automatic dubbing); Chinese prompts work better.

Jimeng AI

ParameterSupported Values
Modeljimeng-3.0-pro (recommended), jimeng-3.0-1080p
Duration (seconds)"5" or "10" (default "5")
Resolution (size)Supports aspect ratio format or pixel format
Image-to-VideoSupported, pass the image URL or base64 via input_reference
Supported Aspect Ratios and Corresponding Resolutions
Aspect Ratio (size)Actual Resolution
16:9 or 1920x10801920×1088
9:16 or 1080x19201088×1920
4:3 or 1664x12481664×1248
3:4 or 1248x16641248×1664
1:1 or 1440x14401440×1440
21:9 or 2176x9282176×928
Example
curl -X POST https://aihubmix.com/v1/videos \
  -H "Authorization: Bearer $AIHUBMIX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jimeng-3.0-pro",
    "prompt": "A young woman in Hanfu dances gracefully amid a bamboo forest, her long dress flowing in the wind, with a faint morning mist in the background",
    "seconds": "5",
    "size": "16:9"
  }'

Doubao Seedance

ParameterSupported Values
Modeldoubao-seedance-2-0-260128, doubao-seedance-2-0-fast-260128
Resolution (resolution)"480p", "720p" (default)
Duration (duration)Integer, range 4~15, or -1 (model decides automatically)
Aspect Ratio (ratio)"adaptive" (default, auto-adapts), "16:9", "9:16", "1:1", "4:3", "3:4", "21:9"
Audio Video (generate_audio)Defaults to true; set to false to generate a silent video
Watermark (watermark)Defaults to false
Multimodal ReferenceSupports image, video, and audio
Reference Types Supported by extra_body.content
Typetype Valuerole ValueDescription
Reference Imageimage_urlreference_imageVisual/style reference image
Reference Videovideo_urlreference_videoCamera movement/composition reference video
Reference Audioaudio_urlreference_audioBackground music audio file
Example
Seedance 2.0 / 2.0 Fast
curl -X POST "https://aihubmix.com/v1/videos" \
  -H "Authorization: Bearer $AIHUBMIX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "doubao-seedance-2-0-260128",
    "prompt": "Use the first-person POV framing from Video 1 throughout, and use Audio 1 as the background music for the entire clip. Create a first-person fruit tea commercial featuring the Seedance brand limited-edition apple fruit tea, "Ping Ping An An." 

Opening frame: Image 1. From a first-person perspective, your hand picks a dew-covered Aksu red apple, accompanied by a crisp, satisfying bite-like tapping sound.

Seconds 2–4: Fast-paced cuts. Your hand drops freshly cut apple chunks into a shaker, adds ice and tea base, then shakes vigorously. The sound of ice clinking and shaking syncs with upbeat percussion. Background voiceover: "Freshly cut, freshly shaken."

Seconds 4–6: First-person close-up of the finished drink. The layered fruit tea is poured into a clear cup. Your hand gently squeezes a creamy topping across the surface. A pink label is placed on the cup. The camera pushes in to highlight the rich texture and layering.

Seconds 6–8: First-person hand holding the drink. You raise the fruit tea from Image 2 toward the camera, as if offering it directly to the viewer. The label is clearly visible. Background voiceover: "Take a refreshing sip."

Final frame: Freeze on Image 2. 

All background voiceovers should be in a female voice.",
    "extra_body": {
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://ark-project.tos-cn-beijing.volces.com/doc_image/r2v_tea_pic1.jpg"
          },
          "role": "reference_image"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://ark-project.tos-cn-beijing.volces.com/doc_image/r2v_tea_pic2.jpg"
          },
          "role": "reference_image"
        },
        {
          "type": "video_url",
          "video_url": {
            "url": "https://ark-project.tos-cn-beijing.volces.com/doc_video/r2v_tea_video1.mp4"
          },
          "role": "reference_video"
        },
        {
          "type": "audio_url",
          "audio_url": {
            "url": "https://ark-project.tos-cn-beijing.volces.com/doc_audio/r2v_tea_audio1.mp3"
          },
          "role": "reference_audio"
        }
      ],
      "ratio": "16:9",
      "duration": 11,
      "watermark": false
    }
  }'

Kling

Kling supports four types of capabilities: text-to-video, image-to-video, multi-image reference video, and OmniVideo multimodal. All are invoked through the unified /v1/videos endpoint, and the gateway automatically routes to the corresponding Kling endpoint based on “model name + input form”, with no need for the caller to differentiate.
CapabilityModels
Text-to-Video / Image-to-Videokling-v1, kling-v1-5, kling-v1-6, kling-v2-1, kling-v2-5-turbo, kling-v2-6, kling-v3
Multi-image Referencekling-v1-6
OmniVideo Multimodalkling-video-o1, kling-v3-omni
Parameters
ParameterTypeDescription
modelstringRequired, kling-*, determines capability and version
promptstringText prompt
negative_promptstringNegative prompt
modestringGeneration mode: std (720P) / pro (1080P) / 4k, default std
duration / secondsstringDuration (seconds); older models 5/10, newer models 3~15, default 5
aspect_ratiostringFrame: 16:9 / 9:16 / 1:1 (required for omni pure text-to-video and video reference; defaults to 16:9 if omitted)
cfg_scalefloatPrompt relevance [0, 1], default 0.5 (not supported by kling-v2.x)
imagestringImage-to-Video: single image, image URL or Base64 (Base64 without the data:image/...;base64, prefix)
image_tailstringImage-to-Video: end-frame image (optional)
image_listarrayMulti-image Reference: array of image URLs, up to 4 images
soundstringomni: on/off, whether to generate native audio, default off
video_listarrayomni: reference video [{ "video_url": "...", "refer_type": "feature" }]; refer_type takes feature (video reference) / base (video editing)
Unsupported or unmapped key parameters will raise an explicit error rather than being silently dropped. Other native Kling parameters can be placed in extra_body to pass through to the upstream.
Example
curl https://aihubmix.com/v1/videos \
  -H "Authorization: Bearer $AIHUBMIX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kling-v1-6",
    "prompt": "An orange cat running on a sunlit grassy meadow",
    "mode": "std",
    "duration": "5"
  }'
Notes
  • Three asynchronous steps: submit to get video_id → poll GET /v1/videos/{video_id} until status is completedGET /v1/videos/{video_id}/content to download the MP4. Status values: in_progress / completed / failed.
  • Video output usually takes 1–3 minutes; result video URLs are cleaned up after 30 days, so transfer and save them promptly.
  • Delete task: Kling has no delete endpoint; DELETE /v1/videos/{video_id} returns 501 not_supported.
  • Billing: charged by model × mode × duration × capability (with or without reference video / audio); no charge for failed generation, and queries and downloads are not billed.

Complete Invocation Examples

import requests
import time

API_KEY = "AIHUBMIX_API_KEY"
BASE_URL = "https://aihubmix.com"
HEADERS = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

# Step 1: Create the video generation task
response = requests.post(
    f"{BASE_URL}/v1/videos",
    headers=HEADERS,
    json={
        "model": "wan2.6-t2v",
        "prompt": "A desert under a starry sky, a meteor streaking across the night sky, the glow of a distant campfire flickering in the breeze",
        "seconds": "5",
        "size": "1920x1080"
    }
)
result = response.json()
video_id = result["id"]
print(f"Task created, video_id: {video_id}")

# Step 2: Poll the status
while True:
    status_response = requests.get(
        f"{BASE_URL}/v1/videos/{video_id}",
        headers=HEADERS
    )
    status_data = status_response.json()
    current_status = status_data["status"]
    print(f"Current status: {current_status}")

    if current_status == "completed":
        print("Video generation complete!")
        break
    elif current_status == "failed":
        error_msg = status_data.get("error", {})
        if isinstance(error_msg, dict):
            error_msg = error_msg.get("message", "Unknown error")
        print(f"Generation failed: {error_msg}")
        break

    time.sleep(15)  # Query every 15 seconds

# Step 3: Download the video
video_response = requests.get(
    f"{BASE_URL}/v1/videos/{video_id}/content",
    headers=HEADERS
)
with open("output.mp4", "wb") as f:
    f.write(video_response.content)
print(f"Video saved as output.mp4 ({len(video_response.content) / 1024 / 1024:.1f} MB)")

FAQ

How long does video generation take?

Video generation usually takes 1-5 minutes, depending on the model, resolution, and duration. We recommend setting a 15-second polling interval.

How do I use the input_reference parameter?

input_reference is used in image-to-video scenarios and supports three ways of passing input:
// Method 1: Pass the image URL directly
"input_reference": "https://example.com/image.jpg"

// Method 2: Pass a base64-encoded image (object format)
"input_reference": {
  "mime_type": "image/jpeg",
  "data": "<BASE64_ENCODED_IMAGE>"
}

// Method 3: Pass a data URL
"input_reference": "data:image/jpeg;base64,<BASE64_ENCODED_IMAGE>"
Generated video download links usually have a 24-hour validity period, so download and save them promptly.

What are the differences in the seconds parameter across models?

ModelAllowed ValuesDefault
Sora (sora-2 / sora-2-pro)"4", "8", "12""4"
Veo 3/3.1 (veo-3.1-generate-preview, etc.)"4", "6", "8""8"
Veo 2 (veo-2.0-generate-001)"5"~"8""8"
Tongyi Wanxiang wan2.6"2"~"15""5"
Tongyi Wanxiang wan2.5"5", "10""5"
Tongyi Wanxiang wan2.2"5" (fixed)"5"
Jimeng AI (jimeng-3.0-pro, etc.)"5", "10""5"
Doubao Seedance (doubao-seedance-2-0-*)integer duration4~15 or -15
Kling new versions (kling-v2-x / kling-v3, etc.)"3"~"15""5"
Kling old versions (kling-v1 / kling-v1-5 / kling-v1-6)"5", "10""5"
> Tip: The seconds parameter for all models is always passed as a string (e.g. "8"), and the API handles it automatically.

What are the differences in the size parameter format across models?

ModelSupported size Values
Sora1280x720, 720x1280, 1024x1792, 1792x1024
Veopixel format (1280x720, etc.) or resolution labels (720p, 1080p, 4k)
Tongyi Wanxiangpixel format, both x and * accepted (e.g. 1920x1080 or 1920*1080)
Jimeng AIaspect ratio format (16:9, 9:16, etc.) or pixel format
Doubao Seedanceaspect ratio format ("adaptive", "16:9", "9:16", etc.)
Klingdoes not use size; uses mode (std/pro/4k controls clarity) + aspect_ratio (16:9/9:16/1:1 controls frame)

What is the difference between seconds and duration?

The two have the same meaning, both representing the video duration. The API supports both parameter names (except Sora, which only accepts seconds). We recommend using seconds consistently.

How do I write better prompts?

  • Describe specific scenes: include subject, action, environment, lighting, atmosphere
  • Specify camera language: such as “close-up”, “aerial shot”, “push-in shot”, “slow motion”
  • Describe style: such as “cinematic”, “documentary style”, “animation style”
  • Chinese models work better with Chinese prompts: Tongyi Wanxiang is optimized for Chinese
  • Veo supports audio descriptions: you can describe sounds in the prompt, such as “birds chirping” or “a piano melody”

How do I handle a failed task?

When status is failed, the error field in the response contains error information:
{
  "status": "failed",
  "error": {
    "message": "Video generation failed due to content policy violation",
    "type": "video_generation_error"
  }
}

Common failure reasons include: content violations, prompt too long, unsupported image format, etc. Adjust based on the error message and retry.

Last updated: 2026-06-01