Compositional Scene Understanding through Inverse Generative Modeling

1TU Delft, 2Harvard University
ICML 2025

Abstract

We explore how generative models can be used not only to synthesize visual content but also to understand the properties of a scene given a natural image. We formulate scene understanding as an inverse generative modeling problem, where we seek to find conditional parameters of a visual generative model to best fit a given natural image. To enable this procedure to infer scene structure from images substantially different than those seen during training, we further propose to build this visual generative model compositionally from smaller models over pieces of a scene. We illustrate how this procedure enables us to infer the set of objects in a scene, enabling robust generalization to new test scenes with an increased number of objects of new shapes. We further illustrate how this enables us to infer global scene factors, likewise enabling robust generalization to new scenes. Finally, we illustrate how this approach can be directly applied to existing pretrained text-to-image generative models for zero-shot multi-object perception.


Method

Compositional generative modeling for training. In a visual domain, given a set of conditioned concepts $\boldsymbol{c}^1, ..., \boldsymbol{c}^K$, we construct a generative model that can accurately represent the probability distribution $ p(\boldsymbol{x}| \boldsymbol{c}^1, \boldsymbol{c}^2, ..., \boldsymbol{c}^K)$ over the space of images $\boldsymbol{x}$. To enable effective compositional generalization to a larger number of visual concepts, we further factorize $p(\boldsymbol{x}|\boldsymbol{c}^1, \ldots, \boldsymbol{c}^K) \propto \prod_{k=1}^{K} p(\boldsymbol{x}|\boldsymbol{c}^{k})$ with each $p(\boldsymbol{x}|\boldsymbol{c}^k)$ represented as a diffusion model $\epsilon_\theta(\boldsymbol{x}^t, t|\boldsymbol{c}^k)$, and train a composition of score functions with the denoising diffusion objective $\mathcal{L}_{\theta} = \mathbb{E}_{\boldsymbol{x}, \mathbf{\epsilon}, t}\|\mathbf{\epsilon}- \sum_{k=1}^K \epsilon_\theta(\boldsymbol{x}^t, t|\boldsymbol{c}^k)\|^2$.

Inverse generative modeling for inference. Once the generative model is trained, our approach can find a set of visual concepts that maximize the log-likelihood of the observed image $\boldsymbol{x}$ by solving $\hat{\boldsymbol{c}}^1, \ldots, \hat{\boldsymbol{c}}^K = \text{argmin}_{\boldsymbol{c}^1, \ldots, \boldsymbol{c}^K} \mathbb{E}_{\mathbf{\epsilon}, t}\|\mathbf{\epsilon}- \sum_{k=1}^K \epsilon_\theta(\boldsymbol{x}^t, t|\boldsymbol{c}^k)\|^2$.

overview.

Results

Below, we illustrate additional examples to demonstrate how inverse generative modeling coupled with compositionality can not only infer concepts from images but also generalize effectively to scenes more complex than seen during training.

Infer Object Locations

Our approach can infer local factors (such as object coordinates) and object number from a test image, and effectively generalize to scenes containing a larger number of objects and more complex objects than those seen during training.


Global Decomposition. Image 2

In-distribution Object Discovery. We train our model with CLEVR images containing 3-5 objects. On the left, given an in-distribution image (also containing 3-5 objects), our approach accurately identifies object coordinates. On the right, we illustrate our approach can determine object number by selecting a number with the lowest denoising error.

Global Decomposition.

Out-of-distribution Object Discovery. Object perception results on out-of-distribution images: CLEVR images with 6-8 objects (Left) or CLEVRTex images with 6-8 objects (Right). Our model is trained with CLEVR images containing 3-5 objects. During inference time, given an out-of-distribution image that is substantially different from training data, our proposed approach can still infer the object positions accurately. In contrast, all baseline models predict object locations that significantly deviate from the ground truth.


Infer Facial Attributes

Our approach can also infer global factors, such as facial attributes, from a test image and reliably generalize to images that differ substantially from training data.


Local Decomposition.

In-Distribution and Out-of-Distribution Facial Feature Prediction.Facial feature prediction results for in-distribution (Left) and out-of-distribution (Right) CelebA images. Our model is trained on female faces from CelebA. During inference, our model can accurately predict facial features consistent with the ground truth for both in-distribution female faces and out-of-distribution male faces.


Infer Object Categories

Finally, we show how our approach can leverage pretrained diffusion models, such as Stable Diffusion, for zero-shot multi-object perception tasks without requiring any additional training.


Multi-Modal Decomposition.

Zero-Shot Multi-Object Perception on Real-World Images.


Related Projects

Check out a list of our related papers on compositional generation and energy based models. A full list can be found here!


We present an unsupervised approach to discover generative concepts from a collection of images. We show how such generative concepts can accurately represent the content of images, be recombined and composed to generate new artistic and hybrid images, and be used as a representation for downstream classification tasks.


We propose new samplers, inspired by MCMC, to enable successful compositional generation. Further, we propose an energy-based parameterization of diffusion models which enables the use of new compositional operators and more sophisticated, Metropolis-corrected samplers.

We propose COMET, which discovers and represents concepts as separate energy functions, enabling us to represent both global concepts as well as objects under a unified framework. COMET discovers energy functions through recomposing the input image, which we find captures independent factors without additional supervision.

We present a method to compose different diffusion models together, drawing on the close connection of diffusion models with EBMs. We illustrate how compositional operators enable the ability to composing multiple sets of objects together as well as generate images subject to complex text prompts.

The visual world around us can be described as a structured set of objects and their associated relations. In this work, we propose to represent each relation as an unnormalized density (an energy-based model), enabling us to compose separate relations in a factorized manner. We show that such a factorized decomposition allows the model to both generate and edit scenes that have multiple sets of relations more faithfully.

comp_carton
We present a set of compositional operators that enable EBMs to exhibit zero-shot compositional visual generation, enabling us to compose visual concepts (through operators of conjunction, disjunction, or negation) together in a zero-shot manner. Our approach enables us to generate faces given a description ((Smiling AND Female) OR (NOT Smiling AND Male)) or to combine several different objects together.