Visual Diffusion Conditioning

Overview

Text-guided diffusion models have advanced image editing by enabling intuitive control through language. However, despite their strong capabilities, we surprisingly find that SOTA methods struggle with simple, everyday transformations such as rain or blur. We attribute this limitation to weak and inconsistent textual supervision during training, which leads to poor alignment between language and vision. Existing solutions often rely on extra finetuning or stronger text conditioning, but suffer from high data and computational requirements.

We contend that the capability for diffusion-based editing is not lost but merely hidden from text. We show that similiar visual features like rain patterns and image blur are still recongnized in diffusion visual space even if they are not accesable by text. Through visual examples we can find the correct conditioning to access and manipluate these features

We introduce Visual Diffusion Conditioning (VDC), a training-free framework that learns visual conditioning signals directly from visual examples for precise, language-free image editing. Given a paired example—one image with and one without the target effect—VDC derives a visual condition that captures the transformation and steers generation through a novel condition-steering mechanism. An accompanying inversion-correction step mitigates reconstruction errors during DDIM inversion, preserving fine detail and realism. Across diverse tasks, VDC outperforms both training-free and fully fine-tuned text-based editing methods.

Observations

- Text-Visual alignment behavior in T2I Diffusion

The internal representations of Stable-Diffusion fail to accurately capture the semantics of degradations such as “rain” or “haze”. Attention maps under text-based conditioning remain object-centric and do not correspond to degradation-specific visual attributes. Omitting text and utlizing visual examples re-aligns attention focus toward true visual cues, recovering meaningful features that correspond to rain streaks and hazy regions.

- Similar visual features can be accessed with same condition

Our assumption that visual features related to the semantics of degradations such as “rain” or “haze” are already learned but can't be accessed through text. Our VDC framework finds the condition that can access these visual attributes and utilize it to process other images to achieve one-shot diffusion adaptation for degradation and editing tasks like DeRain and DeBlur. When we visualize optimized conditions using different visual example, conditions from the same task form clear clusters, showing that similar visual features (e.g., rain, blur) are recognized without textual dependency. Addtionally we see a small variance in results with the change of the used visual example.

Variance in performance when changing the visual example

T-SNE visualization of optimized conditions

- Condition steering more effective for content preservation

In our experiments, we notice that optimizing the generation condition (C_g) directly produces low fidelity output as it works by regenerating a new image instead of altering the existing one. Instead, we optimize Condition Steering (C_s) that guides the unconditional diffusion trajectory toward the desired edit instead of generating a new image. This focuses the optimization on the edit itself, avoiding entanglement with example content and reducing artifact

Methodology

Proposed VDC framework. (a) Given a real image, we first invert it through DDIM and apply the learned steering condition C^s_t to guide sampling toward the desired visual feature (e.g., removing rain) while preserving content and quality. (b) A lightweight Condition Generator produces per-step steering embeddings from token indices, representing the target visual feature. These conditions modulate the diffusion outputs through weighted score blending, enabling training-free visual editing without textual prompts.

Results

- VDC efficient and fast condition optimization with zero inference overhead

VDC entails one-time task optimization; once learned, the condition is applied instantly to any new image. While we reported 30 mins on a single GPU for condition optimization for peak fidelity, our ablation below shows that VDC outperforms OmniGen in just 10 steps (∼2 mins). Unlike prior works requiring millions of samples, VDC defines tasks from a single pair, yielding a significant efficiency gain. Inference incurs zero overhead, as VDC replaces CFG, leaving latency determined by the underlying diffusion model.

- VDC allows one-shot editing adaptations

With just one visual example VDC yields clean results supassing other methods incluing specific purpose restoration methods. Text- and example-based methods struggle with complex edits due to misalignment or degradation priors.

- VDC generalize to real-data from synthetic examples

Our method optimize the condition to access already learned visual attributes related to the semantics of degradations such as “rain” or “noise” learned from real data. This makes it easier for our method to extend to out-of-distribution data, allowing it to work on real images. We compare our method to state-of-the-art All-in-One Image Restoration (IR) on real image De-Rain. Our method is able to generalize to real data while prior works fail.

- VDC can extend to a general editing method

We center our benchmark on fine-detail edits, global adjustments, and image restoration tasks—categories where existing methods often struggle due to visual–text misalignment. Nonetheless, our approach is a general editing framework: it extracts the transformation from a given example and applies it to a new input.

However, VDC prioritizes structural fidelity over non-rigid flexibility to prevent hallucinations, which limits large changes. VDC resolves this by supporting textual control: as shown below, VDC handles visual patterns (DeRain) while text drives semantic shifts (e.g., bears→cats) and non-rigid edits (e.g., closing eyes).

Contacts

Omar Elezabi: omar.elezabi@uni-wuerzburg.de

Language-Free Generative Editing from One Visual Example