Now, you’ve probably heard of large language models, machine learning models, image generation models, and so on … but “world model” may be a new one. To help explain the concept, we sat down with Googlers Shlomi Fruchter and Jack Parker-Holder.
Congratulations on the launch of Project Genie! What were your roles on the team?
Shlomi: Jack and I lead Genie development together. I mostly focus on our next-gen video and world models and work with the team to investigate new improvements.
Jack: I am a researcher and co-lead Genie. My job is mostly about coming up with new opportunities for our models and then making sure there is a team, roadmap and plan to make it happen.
What exactly is Project Genie?
Jack: Project Genie is a tool where you can create your own world with characters and environments and explore them in real time. For example, travel to an alien planet or dive underwater with sea creatures. Whatever you can think of.
Shlomi: The worlds that we typically want to simulate are variants of the world we live in because that is what we know and care about. Genie predicts what will happen based on the mechanics of such a variant world: “OK, if I’m going to walk into the room that looks like the picture you gave of your room, what will it look like when I walk around? What will the mirror look like? How will the light reflect off the hardwood floor?” All the environmental dynamics – if water is spilled or it rains – the model simulates end-to-end without a game engine running in the background. And you can actually interact. If there’s a ball on the floor, you can actually bump into it and it starts rolling, which is what you’d expect to happen in reality. When the model does a good job, it looks realistic.
Is Genie the first world model?
Jack: There are actually lots of historical papers on world models, but one of the ones that popularized the idea is from 2018 from what was then called Google Brain. Google Brain was our deep learning and AI research team, and it is now part of Google DeepMind. That paper was by David Ha and Jürgen Schmidhuber; it was the first time anyone trained a world model from a visual domain. This is what really popularized the term “world model” in the developer community.
What is the difference between a world model and, for example, a large language model?
Shlomi: Think of it like this: A language model tries to predict the next word. From that, it learns a representation of the language. Later we can learn to have a full conversation with a person and even maybe think about a math problem. Similarly, a world model tries to predict what will happen next in the world based on the sequence of actions that an agent performs. Basically, it is simulating an entire environment, moment by moment, in response to an agent. Through this simple task, the model learns a representation of the world.
So a world model predicts that world based on an environment it’s been trained in. And not just the world, but how things react in that world. Is that right?
Shlomi: Yes. An important part of what happens in a world model is what we call “observation.” When we use this word, it has a narrow definition: visual observation. Observation, more generally, does not have to be visual – you can observe how something feels, feels, or smells. But at this point we’re talking about visuals.
Got it. How do you encourage Genie?
Jack: The best way to start encouraging Genie is with an image or images – we often use Nano Banana for this – and some text. You can do text only, but it’s also more entertaining to use a visual image. For example, you can upload a photo of a dog on the beach and the text can describe the dynamics of the scene – perhaps something like how the sea is choppy.
What could we use world models for?
Jack: One application is training AI agents to learn how to do things in the real world. Giving them access to our actual world would be dangerous and expensive, but if we could simulate it, it would give us a testing ground. Another is education: you could use a world model to teach a classroom about science and history—imagine 35 kids in a class who aren’t paying attention. Suddenly the teacher brings up a world model on the board: “OK, we’re going to walk around ancient Rome. What are we going to do? Let’s go and ask that person what’s going on.” We can ensure that the model is more historically accurate and make it an interactive experience. For science, you can explore underwater diving – we already have examples of this.
