Tencent’s Hunyuan-GameCraft transforms single images into interactive gaming videos

Summary

Tencent has released Hunyuan-GameCraft, an AI system that generates interactive videos from individual images.

Unlike standard video generators that produce fixed video clips, GameCraft lets users steer the camera in real time using WASD or arrow keys, allowing free movement through the generated scenes. The system is built on Tencent’s open-source text-to-video model, HunyuanVideo. Tencent says the system delivers especially smooth, consistent camera motion.

Video: Tencent

The framework supports three axes of translation—forward/backward, left/right, and up/down—plus two axes of rotation for looking around. Rolling the camera is left out, which Tencent says is uncommon in games. An action encoder translates keyboard input into numerical values the video generator understands. The system also factors in speed, so movement can be adapted based on how long a key is pressed.

Ad

Technical system architecture with caption, image/video, noisy input, mask, and action as inputs. On the left are the 3DVAE encoder and action encoder, in the middle are the double stream and single stream DiT blocks, and on the right is continuous video generation with history frames and binary masks for various actions such as W, A, S, and D.
Hunyuan-GameCraft’s architecture: keyboard input is processed by an action encoder, then passed to HunyuanVideo for video generation. A binary mask separates existing frames from new ones. | Image: Tencent

Hybrid training for long, consistent videos

To keep video quality high over longer sequences, GameCraft uses a training technique called Hybrid History-Conditioned Training. Instead of generating everything at once, the model creates each new video segment step by step, drawing on earlier segments. Each video is broken into roughly 1.3-second chunks. A binary mask tells the system which parts of each frame already exist and which still need to be generated, helping the model stay both consistent and flexible.

Comparison of three video enhancement methods using a game scene with buildings and trees. The top row shows the training-free approach with significant quality loss, the middle row shows History Clip Condition with control issues, and the bottom row shows Hybrid History Condition with optimal results for W, S, and A key presses.
Hunyuan-GameCraft uses Hybrid History Conditioning (c) to improve video consistency. Training-free methods (a) lower quality, while pure history clip conditioning (b) reduces responsiveness. | Image: Tencent

According to Tencent, training-free approaches often result in visible quality drops, while pure history conditioning hurts responsiveness to new input. The hybrid strategy combines both, producing videos that stay smooth and consistent while reacting instantly to user input, even during long sessions.

Training on more than a million gameplay videos

GameCraft was trained on over a million gameplay recordings from more than 100 AAA titles, including Assassin’s Creed, Red Dead Redemption, and Cyberpunk 2077. Scenes and actions were automatically segmented, filtered for quality, annotated, and given structured descriptions.

Data processing flowchart: Raw video data is divided into clips using scene partitioning and action partitioning, then filtered using a quality filter, annotated with interaction annotations, and described using structured captioning. Shows rejection of oversaturated, overly dark videos and UI elements.
Hunyuan-GameCraft’s dataset: automatic splitting of scenes and actions, quality filtering, annotation, and structured video descriptions. | Image: Tencent

Developers also created 3,000 motion sequences from digital 3D objects. Training ran in two phases across 192 Nvidia H20 GPUs for 50,000 iterations. In head-to-head tests with Matrix-Game, GameCraft cut interaction errors by 55 percent. It also delivered better image quality and more precise control than specialized camera control models like CameraCtrl, MotionCtrl, and WanX-Cam.

To make GameCraft practical, Tencent added a Phased Consistency Model (PCM) that speeds up video generation. Instead of running every step of the typical diffusion process, PCM skips intermediate steps and jumps straight to plausible final frames, boosting inference speed by 10 to 20 times.

Recommendation

Study shows: 'Test-time compute scaling' is a path to better AI systems

Study shows: 'Test-time compute scaling' is a path to better AI systems

GameCraft reaches a real-time rendering rate of 6.6 frames per second, with input response times under five seconds. Internally, it runs at 25 fps, processing video in 33-frame segments at 720p resolution. This balance of speed and quality makes interactive control practical.

The full code and model weights are available on GitHub, and a web demo is in the works.

GameCraft joins a growing field of interactive AI world models. Tencent’s earlier Hunyuan World Model 1.0 can generate 3D scenes from text or images but is limited to static panoramas. Competitors include Google DeepMind’s Genie 3 and the open-source Matrix-Game 2.0 from Skywork.

Continue Reading