The generative AI industry has been swamped by a tsunami of video models over the past six months. Companies are rushing to showcase videos created by neural networks, but behind the beautiful imagery often lies complete absurdity: models generate non-existent streets, buildings, and objects that live a life of their own. Naver, the South Korean internet giant known for its search engine and mapping service, appears to have found a way to curb these AI hallucinations. Their Seoul World Model (SWM) relies not on fabricated data, but on the real geometry of the city, extracted from its own 1.2 million Street View panoramas. Theoretically, this should enable the AI to generate video that maintains a connection to reality.
The main difference between SWM and most competitors, which produce entirely synthetic worlds, is its reliance on real data. When a user specifies coordinates, desired camera movement, and a text prompt, the model accesses Naver Map's proprietary panorama database. Using the nearest Street View images as reference points, SWM constructs the video step by step. Researchers from Naver and Naver Cloud claim this is the first approach of its kind that grounds video generation in the physical world. This method ensures better visual and temporal consistency, eliminating surprises like unexpected objects appearing in the frame or roads leading nowhere.
Working with real data presents its own challenges. Street View images are essentially frozen moments in time. Cars and pedestrians captured in the frame are unrelated to the dynamic scene the model is intended to generate. Without specialized training, SWM risked simply copying random objects from the reference images. Korean engineers solved this problem using "cross-temporal alignment." They intentionally train the model by correlating images from different time periods. This teaches SWM to distinguish between permanent elements of the urban environment, such as buildings, and transient ones, like parked cars. In Naver's reports, this mechanism is described as the model's most effective component. To bridge the gaps between images captured every 5-20 meters, the model also uses simulations to fill in missing viewpoints and visual anchors.
Testing revealed that SWM surpasses six other existing video models in image quality and temporal consistency. Most interestingly, the model demonstrated an impressive ability to generalize. Without any additional training, SWM successfully generated videos in unfamiliar cities, including Busan, South Korea, and Ann Arbor, USA. This opens up broad prospects for industries requiring accurate and realistic world representations, from VR/AR and virtual world creation to complex simulations for training and testing.
What this means for business right now: Naver has introduced not just another generative model, but a solution that could fundamentally alter the approach to creating realistic digital content. Relying on real data instead of pure fabrication is the next logical step in technological development, crucial for anyone involved in visualization, from game developers to metaverse creators.