stepfun-ai/Step1X-Edit-v1p1-diffusers

🔥🔥🔥 News!!

Jul 09, 2025: 👋 We updated the step1x-edit model and released it as step1x-edit-v1p1:
- Added support for text-to-image (T2I) generation tasks
- Improved image editing quality and better instruction-following performance. Quantitative evaluation on GEdit-Bench-EN (Full set). G_SC, G_PQ, and G_O refer to the metrics evaluated by GPT-4.1, while Q_SC, Q_PQ, and Q_O refer to the metrics evaluated by Qwen2.5-VL-72B. To facilitate reproducibility, we have released the intermediate results of our model evaluations.
  
  Models G_SC ⬆️ G_PQ ⬆️ G_O ⬆️ Q_SC ⬆️ Q_PQ ⬆️ Q_O ⬆️
  
  Step1X-Edit (v1.0) 7.13 7.00 6.44 7.39 7.28 7.07
  
  Step1X-Edit (v1.1) 7.66 7.35 6.97 7.65 7.41 7.35
Apr 25, 2025: 👋 We release the inference code and model weights of Step1X-Edit. inference code
Apr 25, 2025: 🎉 We have made our technical report available as open source. Read

Models	G_SC ⬆️	G_PQ ⬆️	G_O ⬆️	Q_SC ⬆️	Q_PQ ⬆️	Q_O ⬆️
Step1X-Edit (v1.0)	7.13	7.00	6.44	7.39	7.28	7.07
Step1X-Edit (v1.1)	7.66	7.35	6.97	7.65	7.41	7.35

Step1X-Edit: a unified image editing model performs impressively on various genuine user instructions.

🧩 Model Usages

Install the diffusers package from the following command:

git clone -b step1xedit https://github.com/Peyton-Chen/diffusers.git
cd diffusers
pip install -e .

Here is an example for using the Step1XEditPipeline class to edit images:

import torch
from diffusers import Step1XEditPipeline
from diffusers.utils import load_image


pipe = Step1XEditPipeline.from_pretrained("stepfun-ai/Step1X-Edit-v1p1-diffusers", torch_dtype=torch.bfloat16)
pipe.to("cuda")

print("=== processing image ===")
image = load_image("examples/0000.jpg").convert("RGB")
prompt = "给这个女生的脖子上戴一个带有红宝石的吊坠。"
image = pipe(
    image=image,
    prompt=prompt,
    num_inference_steps=28,
    true_cfg_scale=6.0,
    generator=torch.Generator().manual_seed(42),
).images[0]
image.save("0000.jpg")

The results will look like:

📑 Model introduction

Framework of Step1X-Edit. Step1X-Edit leverages the image understanding capabilities of MLLMs to parse editing instructions and generate editing tokens, which are then decoded into images using a DiT-based network.More details please refer to our technical report.

We release GEdit-Bench as a new benchmark, grounded in real-world usages is developed to support more authentic and comprehensive evaluation. This benchmark, which is carefully curated to reflect actual user editing needs and a wide range of editing scenarios, enables more authentic and comprehensive evaluations of image editing models. Part results of the benchmark are shown below:

Citation

@article{liu2025step1x-edit,
      title={Step1X-Edit: A Practical Framework for General Image Editing}, 
      author={Shiyu Liu and Yucheng Han and Peng Xing and Fukun Yin and Rui Wang and Wei Cheng and Jiaqi Liao and Yingming Wang and Honghao Fu and Chunrui Han and Guopeng Li and Yuang Peng and Quan Sun and Jingwei Wu and Yan Cai and Zheng Ge and Ranchen Ming and Lei Xia and Xianfang Zeng and Yibo Zhu and Binxing Jiao and Xiangyu Zhang and Gang Yu and Daxin Jiang},
      journal={arXiv preprint arXiv:2504.17761},
      year={2025}
}