KV-Edit: Training-Free Image Editing for Precise Background Preservation

1 Shenzhen International Graduate School, Tsinghua University 2 Institute of Artificial Intelligence (TeleAI), China Telecom
teaser

We propose KV-Edit to address the challenge of background preservation in image editing, thereby enhancing the practicality of AI editing. Rather than designing complex mechanisms, we achieve impressive results by simplypreserving the key-value pairs of the background. Our method effectively handles common semantic editingoperations, including adding, removing, and changing objects.

🖼 Results

💭 Abstract

Background consistency remains a significant challenge in image editing tasks. Despite extensive developments, existing works still face a trade-off between maintaining similarity to the original image and generating content that aligns with the target. Here, we propose KV-Edit, a training-free approach that uses KV cache in DiTs to maintain background consistency, where background tokens are preserved rather than regenerated, eliminating the need for complex mechanism or expensive training, ultimately generating new content that seamlessly integrates with the background within user-provided regions. We further explore the memory consumption of the KV cache during editing and optimize the space complexity to O(1) using an inversion-free method. Our approach is compatible with any DiT-based generative model without additional training. Experiments demonstrate that KV-Edit significantly outperforms existing approaches in terms of both background and image quality, even surpassing training-based methods.

⭐️ Pipeline

We implemented KV Cache in our DiT-based generative model, which stores the key-value pairs of background tokens during the inversion process and concatenates them with foreground content during denoising. Since background tokens are preserved rather than regenerated, KV-Edit can strictly maintain background consistency while generating seamlessly integrated new content.

🗒 Quantitative Comparison

Comparison with previous methods on PIE-Bench. VAE* denotes the inherent reconstruction error through direct VAE reconstruction. P2P and MasaCtrl are DDIM-based methods, while RF Inversion and RF Edit are flow-based. BrushEdit and FLUX fill represent training-based methods. NS indicates there is no skip step during inversion. RI indicates the addition of reinitialization strategy.Bold and underlined values denote the best and second-best results respectively.

BibTeX

If you find our work useful, please cite our paper:

@misc{}, 
  }