VLK

Projectactive

VLK addresses perception-based humanoid loco-manipulation by generating vision-language-kinematics (VLK) synthetic data from reconstructed scenes. The pipeline leverages 3D Gaussian Splatting to reconstruct metric-scale indoor environments, synthesizes navigation and object-interaction trajectories using privileged scene information, and renders paired egocentric observations. It produces 48,000 paired trajectories with no human intervention. A VLK policy trained on this data predicts short-horizon whole-body kinematic trajectories, which are converted to actions on the physical Unitree G1 via a whole-body tracker. Evaluated on navigation and single-object transport tasks.