CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions

In this project we construct the CLEVR-Ref+ dataset, by repurposing and augmenting the CLEVR dataset from VQA towards referring expressions. This synthetic diagnostic dataset will complement existing real-world ones:

clevr-ref+-overview

There are two main reasons we did this:

One is to test and diagnose state-of-the-art referring expression models at a more detailed level than the detection accuracy or segmentation IoU, including our proposed IEP-Ref model that explicitly captures compositionality.
The other is to perform step-by-step inspection of the model’s reasoning process. We are especially interested in whether the neural modules in Neural Module Networks indeed perform the intended functionalities when trained end-to-end. According to our experiments, the answer is quite positive:

clevr-ref+-inspection

Download Links

[Paper]

[CLEVR-Ref+-v1.0 Dataset (16GB)]

[CLEVR-Ref+-CoGenT-v1.0 Dataset (19GB)]

[Code for CLEVR-Ref+ Dataset Generation]

[Code for IEP-Ref Model]