โ Paper Details โ
-
Akhil Dudhipala, Pavan Pativada, and Rahul Karne
- Computer Science and Engineering
- Paper ID: MIJRDV4I40013
- Volume: 04
- Issue: 04
- Pages: 111-120
- ISSN: 2583-0406
- Publication Year: 2025
-
Abstract โโ
Open-vocabulary semantic segmentation allows models to label pixels with arbitrary category names that go beyond the training classes that the model learned. With the advent of powerful foundation models for vision (e.g., CLIP, SAM), new and innovative training-free approaches are emerging to segment novel concepts without tuning the model. This paper presents a gap analysis of these methods from 2022 to 2024. It identifies the common inconvenient gaps such as coarse spatial localization, static unstructured prompting, lack of temporal consistency with videos, and slow computation. Next, we summarize techniques to address these inconvenient gaps, including CLIP calibration methods, diffusion-augmented segmentation, dynamic prompting with LLMs, and hierarchical classifiers. Even though the latest methods exhibit out-of-the-box improvement (i.e., zero-shot) in segmentation and we now know we operate at close-to-real time for disjointed and isolated image categories, we still find ourselves in significant trouble (e.g., resulting segmentations fill with gaps, incorrect prompts, or too slow for engagement). We conclude with a specific vision of practical, safe, efficient, and scalable open-vocabulary segmentation that is prompt, flexible, compressed in model size, and temporal-smooth.
Keywords โโ
Open-Vocabulary Segmentation, Training-Free Methods, Vision Foundation Models, CLIP and SAM, Zero-Shot Image and Video Understanding.
Cite this Publication โโ
Akhil Dudhipala, Pavan Pativada, and Rahul Karne (2025), Training-Free Open-Vocabulary Segmentation Using Vision Foundation Models: A Gap Analysis. Multidisciplinary International Journal of Research and Development (MIJRD), Volume: 04 Issue: 04, Pages: 111-120. https://www.mijrd.com/papers/v4/i4/MIJRDV4I40013.pdf