CLIPSym: Delving into Symmetry Detection with CLIP
Abstract
CLIPSym, a vision-language model using CLIP, enhances symmetry detection through a rotation-equivariant decoder and semantic-aware prompting, outperforming existing methods on standard datasets.
Symmetry is one of the most fundamental geometric cues in computer vision, and detecting it has been an ongoing challenge. With the recent advances in vision-language models,~i.e., CLIP, we investigate whether a pre-trained CLIP model can aid symmetry detection by leveraging the additional symmetry cues found in the natural image descriptions. We propose CLIPSym, which leverages CLIP's image and language encoders and a rotation-equivariant decoder based on a hybrid of Transformer and G-Convolution to detect rotation and reflection symmetries. To fully utilize CLIP's language encoder, we have developed a novel prompting technique called Semantic-Aware Prompt Grouping (SAPG), which aggregates a diverse set of frequent object-based prompts to better integrate the semantic cues for symmetry detection. Empirically, we show that CLIPSym outperforms the current state-of-the-art on three standard symmetry detection datasets (DENDI, SDRW, and LDRS). Finally, we conduct detailed ablations verifying the benefits of CLIP's pre-training, the proposed equivariant decoder, and the SAPG technique. The code is available at https://github.com/timyoung2333/CLIPSym.
Community
The work introduces a novel framework for detecting reflection and rotation symmetries in images by leveraging pre-trained vision-language models (VLMs), specifically CLIP. The core hypothesis is that CLIP's pre-training on vast image-text pairs contains inherent symmetry cues that can be beneficial for this task.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text (2025)
- Axis-level Symmetry Detection with Group-Equivariant Representation (2025)
- Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception (2025)
- CLIP-IN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions (2025)
- ANPrompt: Anti-noise Prompt Tuning for Vision-Language Models (2025)
- Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score (2025)
- CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper