Yoko Li says visual AI is moving from pixels to code

Yoko Li argues visual AI is shifting toward code

Visual AI has long been measured by how good its images and videos look. But Andreessen Horowitz researcher Yoko Li says the next major step is not better pixels. It is code.

In a recent essay, Li argued that many design and graphics workflows are better served by models that generate editable representations such as HTML, SVG, React components, Lottie files or 3D scene descriptions. Those outputs can be rendered into pixels, but they also remain open to revision, reuse and integration with software tools.

The distinction matters, Li wrote, because a finished image is often only the first draft in professional creative work. Designers, animators and 3D artists usually need the source structure behind the visual, not just a screenshot or render. A logo may need a path edited. A user interface may need responsive behavior or accessibility checks. A 3D object may need geometry, materials and part hierarchy that can be changed later.

Two different approaches

Li separates visual generation into two broad stacks. The first is pixel-native generation, where systems create images or video directly. That approach remains strong for tasks that prioritize realism, texture or atmosphere.

The second is code-native generation, where a model produces a structured artifact that is executed by another engine. In that setup, the model is not the final renderer. It is the author of a program that can later be displayed in a browser, animation player, modeling tool or game engine.

According to Li, this code-first method creates a more practical workflow for many production tasks because the result can be inspected and edited at the source level. If spacing is off in a layout, for example, a system can change CSS rather than regenerate an image. If a motion curve is wrong, it can adjust animation timing rather than start over.

That makes the process more than a one-shot generation task. Li described it as a loop in which a model generates code, renders it, inspects the result and revises the underlying artifact. In her view, that approach gives visual AI a path to benefit from more inference time in a way that is more precise than sampling many separate images.

Why 3D may matter most

Li said product design and 2D graphics are the clearest early use cases, but she sees 3D as a particularly important frontier.

A rendered chair image may look convincing, she noted, but it is still only an image. For use in games, simulations or 3D editors, the model must produce a consistent structure with the right shape, hierarchy and behavior. Doors should open, wheels should spin and joints should move correctly. That raises the bar from appearance to function.

She pointed to emerging work in 3D reconstruction and articulated asset generation as examples of systems that depend on a render-and-revise loop. These tools rely on engines such as Blender or other simulators as feedback environments, allowing agents to test whether generated assets behave as intended.

A market built around runtimes

Li also argued that the market for visual code generation may organize around the runtime where content is rendered. That could mean the browser for web design, an SVG renderer for vector art, a Lottie player for motion graphics, or a 3D engine for assets and scenes.

The underlying idea is that the renderer becomes part of the feedback system. Instead of judging only the output image, the model can inspect how the source behaves in context and then make targeted edits.

Li said that creates an opening for new products that do not just generate attractive visuals, but own the full loop from generation to inspection to revision. In her view, the biggest gains in visual AI may come not from making pixels alone, but from learning to write the code that creates them.