Compositional Reasoning in VLMs
A hard negative fine-tuning method to improve compositional reasoning in CLIP
Overview
Vision-language models (VLMs) have achieved remarkable success in zero-shot recognition tasks. However, their ability to perform compositional reasoning—understanding the relationships between objects, attributes, and spatial configurations—remains limited. On the right are examples of the kinds of compositional reasoning tasks these models struggle with.
While recent fine-tuning methods aim to address this, our analysis reveals that improvements often stem from exploiting simplistic negative captions, rather than encouraging genuine reasoning.
Our Approach: High-quality Targeted Caption Generation
We propose a new fine-tuning framework called High-quality Targeted Caption Generation (HTCG). The key idea is to use semantically coherent positive and negative captions, specifically crafted to teach models about the fine-grained structure of visual scenes.
This method not only encourages better alignment between text and image, but also builds the capacity for true compositional understanding.
Key Results
Our experiments show:
- Improved performance on challenging compositional reasoning benchmarks, including SOTA performance on SugarCrepe at the time of writing.
- Clear benefits in caption quality over brute-force dataset scaling.
- A promising pathway toward VLMs that reason more like humans do—by understanding the structure of what they see.
| Challenging Benchmarks | Hackable Benchmarks | ||||
|---|---|---|---|---|---|
| Model | ColorSwap | SugarCrepe | Winoground | ARO | CREPE |
| ViT-B-32 | 0.137 | 0.765 | 0.080 | 0.588 | 0.648 |
| NegCLIP | 0.183 | 0.837 | 0.080 | 0.801 | 0.303 |
| CLoVe | 0.186 | 0.845 | 0.065 | 0.829 | 0.416 |
| CE-CLIP | 0.133 | 0.856 | 0.0525 | 0.763 | 0.346 |
| GNM-CLIP | 0.126 | 0.787 | 0.103 | 0.571 | 0.173 |
| TSVLC | 0.107 | 0.769 | 0.0675 | 0.835 | 0.359 |
| DAC-SAM | 0.122 | 0.866 | 0.080 | 0.725 | 0.902 |
| HTCG | 0.159 | 0.897 | 0.070 | 0.666 | 0.777 |
Table: Comparison of fine-tuning methods on compositional reasoning benchmarks (ColorSwap, SugarCrepe, Winoground) and hackable benchmarks (ARO, CREPE). All models use the base OpenAI CLIP ViT-B-32. Metrics vary by benchmark: text score or group score.
More Details
The full write-up can be found here. This project was done in collaboration with Patrick Lutz and Adivk Vyas at Boston University.