Compositional Reasoning in VLMs

Overview

Vision-language models (VLMs) have achieved remarkable success in zero-shot recognition tasks. However, their ability to perform compositional reasoning—understanding the relationships between objects, attributes, and spatial configurations—remains limited. On the right are examples of the kinds of compositional reasoning tasks these models struggle with.

While recent fine-tuning methods aim to address this, our analysis reveals that improvements often stem from exploiting simplistic negative captions, rather than encouraging genuine reasoning.

Examples of compositional reasoning tasks

Our Approach: High-quality Targeted Caption Generation

We propose a new fine-tuning framework called High-quality Targeted Caption Generation (HTCG). The key idea is to use semantically coherent positive and negative captions, specifically crafted to teach models about the fine-grained structure of visual scenes.

This method not only encourages better alignment between text and image, but also builds the capacity for true compositional understanding.

Figure: Overview of our proposed approach. We first generate high-quality captions for 1.7M images from the Open Images dataset. Positive captions are generated using BLIP-2-6.7b to ensure alignment with the visual content, while negative captions are generated by Mistral-7B-Instruct-v0.3 to be compositionally distinct from the positive caption while remaining semantically coherent. The generated captions are then used to fine-tune the base ViT-B-32 model.

Key Results

Our experiments show:

Improved performance on challenging compositional reasoning benchmarks, including SOTA performance on SugarCrepe at the time of writing.
Clear benefits in caption quality over brute-force dataset scaling.
A promising pathway toward VLMs that reason more like humans do—by understanding the structure of what they see.

	Challenging Benchmarks			Hackable Benchmarks
Model	ColorSwap	SugarCrepe	Winoground	ARO	CREPE
ViT-B-32	0.137	0.765	0.080	0.588	0.648
NegCLIP	0.183	0.837	0.080	0.801	0.303
CLoVe	0.186	0.845	0.065	0.829	0.416
CE-CLIP	0.133	0.856	0.0525	0.763	0.346
GNM-CLIP	0.126	0.787	0.103	0.571	0.173
TSVLC	0.107	0.769	0.0675	0.835	0.359
DAC-SAM	0.122	0.866	0.080	0.725	0.902
HTCG	0.159	0.897	0.070	0.666	0.777

Table: Comparison of fine-tuning methods on compositional reasoning benchmarks (ColorSwap, SugarCrepe, Winoground) and hackable benchmarks (ARO, CREPE). All models use the base OpenAI CLIP ViT-B-32. Metrics vary by benchmark: text score or group score.

More Details

The full write-up can be found here. This project was done in collaboration with Patrick Lutz and Adivk Vyas at Boston University.