Researchers connected to Stability AI have introduced a new method for improving image layer decomposition without relying on paired training data. The system, called Stable-Layers, uses reinforcement learning and feedback from a vision-language model to fine-tune a pretrained image layering model.
The work focuses on a task that breaks an image into editable layers, a capability useful for image manipulation and compositing. According to the researchers, existing models can struggle with consistent separation between objects, may duplicate parts of the image across layers, and sometimes produce blank or artifact-filled outputs. Stable-Layers is designed to address those problems while training only on unlabeled images.
Stable-Layers starts from Qwen-Image-Layered, a pretrained layer decomposition model. The researchers then apply Flow-GRPO, a reinforcement learning approach, together with LoRA adaptation to refine the model. For each input image, the system samples multiple candidate decompositions, then scores them using a vision-language model judge.
That reward design is central to the method. The team said that if a vision-language model evaluates each sample on its own, its scores tend to cluster too closely together, giving the learning algorithm little useful variation. To solve that, they built a two-stage scoring pipeline.
First, the model assigns structured scores to each candidate across five criteria tied to editing quality. Then, a grid-based calibration step has the vision-language model compare the candidates side by side and rescore them relative to one another. The resulting feedback is used to update the policy with group-relative advantages.
The researchers say this lets the model improve without the paired examples often needed for supervised training. Instead, Stable-Layers learns from the judge’s assessments of unlabelled images.
In the researchers’ evaluation, Stable-Layers produced decompositions with stronger separation between layers, fewer empty outputs, and reduced artifact-heavy regions. The team also reported lower per-layer reconstruction error than the base model.
Visual comparisons in the project materials show the original Qwen-Image-Layered system often leaving the first layer degenerate or repeating the composite image across multiple foreground layers. By contrast, Stable-Layers more often isolates distinct semantic elements and creates cleaner alpha masks, with less color bleeding into transparent areas.
The reported gains were consistent across tested layer counts. The largest improvements were seen in the earliest predicted layer, which the researchers describe as the dominant first layer in the decomposition.
Image layer decomposition can make editing workflows more flexible by separating a picture into components that can be manipulated individually. A model that better identifies foreground objects, backgrounds, and occluded regions could be useful for restoration, compositing, and other editing tasks.
The Stable-Layers approach also reflects a broader trend in AI research, where models are increasingly trained or tuned using machine-generated feedback rather than hand-labeled datasets. In this case, the researchers argue that reinforcement learning with a vision-language judge can improve a visual editing model even when paired supervision is unavailable.
The project is described in a paper titled "Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored RL," which lists authors Ciara Rowles, Reshinth Adithyan, Nikhil Pinnaparaju, Vikram Voleti, and Mark Boss. The paper is available as an arXiv preprint.