Interview examines how xAI built Grok Imagine and what comes next for video models

Lede

An interview with former xAI engineer Ethan He offers a look inside how frontier video models are built, what slowed development, and why the next step may be less about generating clips and more about creating video agents that can plan, edit and iterate.

He, who previously worked on NVIDIA's Cosmos world model effort before joining xAI, described a rapid build cycle for Grok Imagine, xAI's video generation system. According to the conversation, the company started with no dedicated infrastructure, data pipeline or model for the product, then shipped an early version in roughly three months with a small team.

Building a video system from scratch

He said his prior work on Cosmos helped him approach the xAI project with a better sense of what needed to be done. He also pointed to team quality as a major factor in the speed of development, saying that a tight group of capable engineers reduced communication overhead and helped the group move quickly.

The interview framed Grok Imagine as part of a broader shift in generative media. He argued that as video models improve in realism, consistency and instruction following, the next leap may not come from raw generation quality alone. Instead, he said future systems may resemble software agents that can plan a creative task, generate output, review it, make edits and repeat the process.

That idea echoes changes seen in AI coding tools, where progress has moved from one-shot output toward systems that can reason across multiple steps, debug problems and coordinate work. In the interview, He suggested video could follow a similar path.

Data, pipelines and the hidden costs

The discussion went beyond model architecture to the less visible parts of the stack. He described the importance of data quality, training pipelines and small implementation details, saying that even minor bugs can have outsized effects on model performance. The conversation also touched on the challenges of storing and moving large video datasets, which can add substantial cost to training and iteration.

He discussed several technical ingredients behind frontier video systems, including variational autoencoders, diffusion transformers, audio-video alignment and inference optimizations. The interview also covered synthetic captions, which are used to help train image and video models, and the tradeoffs involved in compressing temporal information while still keeping models fast enough for interactive use.

A recurring theme was speed. He said iteration rate matters more than meetings, especially in a field where model quality can improve quickly through repeated experiments and pipeline fixes. He also noted that some of the biggest gains can come from debugging training data rather than redesigning the whole model.

Beyond video generation

The conversation expanded into world models, real-time generation and future interfaces. He defined world models in terms of systems that can support interactive, long-horizon behavior, and said that the most interesting versions will need to operate in real time. He also pointed to generative user interfaces as a possible direction for future products, where user intent could translate directly into pixels rather than through traditional web layers.

Another thread in the interview was the role of language models. He argued that video systems may ultimately depend more on LLMs and agentic orchestration than on diffusion alone, especially if they are expected to handle planning, prompt rewriting and complex multi-step tasks. That view led to his broader conclusion that the next major milestone may be a "video agent" rather than simply a better text-to-video model.

He also said that after working on video systems, he shifted his attention toward language models and context management, suggesting that self-managed memory and long-term context could become important frontiers in their own right.

The interview presents a picture of an industry still racing through the basics of video generation while already looking toward systems that can reason, revise and coordinate creation end to end.