Tencent’s HunyuanWorld-Voyager can generate a spatially consistent 3D scene from a single photo, without relying on traditional 3D modeling pipelines. The system combines RGB and depth data with a memory-efficient “world cache” to produce video sequences that reflect user-defined camera movement.
With Voyager, users upload a photo and specify a camera path through the scene. Voyager then generates a continuous video simulating the camera’s motion, aiming to simplify the creation of virtual 3D environments without extensive modeling or technical setup.
At the core is joint RGB and depth (RGB-D) video generation. The depth information helps Voyager estimate distances in the scene and avoid common errors when objects are viewed from unusual angles.
Video: Tencent
Ad
Memory for 3D worlds
Voyager’s “world cache” saves previously seen and generated regions of the scene, updating as the camera moves. When hidden parts of the environment reappear, the system restores them from the cache. Redundant data is removed to optimize memory, keeping long camera paths stable and geometrically consistent.
Tencent trained Voyager on a large dataset of real videos and Unreal Engine scenes, each labeled with estimated camera poses and metric depth. This approach helped the model learn how cameras move through real spaces and how objects look from different angles.
Benchmark performance and direct 3D output
Tencent says Voyager achieved high scores in multiple categories on the WorldScore benchmark, including camera control and spatial consistency. A practical benefit of generating RGB and depth video together is that the system can output direct 3D reconstructions – like point clouds or Gaussian proxies – with less need for post-processing.
Tencent reports that Voyager can also extract 3D objects from single images, analyze depth in video, and transfer styles while preserving geometric structure. The code and inference weights are publicly available. Tencent lists 60 GB GPU RAM as the minimum for 540p output.
Building on HunyuanWorld 1.0
Voyager is designed to complement HunyuanWorld 1.0. HunyuanWorld 1.0 focused on semantic, layered 3D mesh representations with mesh export and interactivity, but faced challenges with exploration range and occluded areas. Voyager addresses these with RGB-depth coupling and the world cache, making longer, more consistent camera paths possible. The two systems are meant to work together: HunyuanWorld 1.0 is best for exporting meshes, while Voyager focuses on stable video and 3D scene generation. HunyuanWorld 1.0 has been available in a “Lite” version since August, Voyager is now being released.
Recommendation
Competing systems target different use cases
Other systems take different approaches. Google’s Genie 3 targets interactive worlds where users trigger “world events” via text. Google says scene consistency lasts a few minutes, but access is currently limited to a research preview.
Mirage 2 from Dynamics Lab also offers browser-based interactive demos with keyboard and text input. While these systems focus on live gameplay, interactivity, and robot training, Voyager is aimed at video production and 3D content pipelines.