Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models

Accepted to TMLR 02/2024

1University of North Carolina, Chapel Hill, 2University of California, Los Angeles 3Allen Institute for Artifical Intelligence (AI2)

Illustration of our evaluation framework for probing cross-task inconsistency in unified models via contrast sets. We build candidate answers for multiple tasks which correspond to different semantic understandings of an image (e.g., if the object is a keyboard or laptop), and check whether the model's preferred answers across tasks match the same semantic understanding.


As general purpose vision models get increasingly effective at a wide set of tasks, it is imperative that they be consistent across the tasks they support. Inconsistent AI models are considered brittle and untrustworthy by human users and are more challenging to incorporate into larger systems that take dependencies on their outputs.

Measuring consistency between very heterogeneous tasks that might include outputs in different modalities is challenging since it is difficult to determine if the predictions are consistent with one another. As a solution, we introduce a benchmark dataset, CoCoCON, where we use contrast sets created by modifying test instances for multiple tasks in small but semantically meaningful ways to change the gold label, and outline metrics for measuring if a model is consistent by ranking the original and perturbed instances across tasks. We find that state-of-the-art systems suffer from a surprisingly high degree of inconsistent behavior across tasks, especially for more heterogeneous tasks.

Finally, we propose using a rank correlation-based auxiliary objective computed over large automatically created cross-task contrast sets to improve the multi-task consistency of large unified models, while retaining their original accuracy on downstream tasks.

What is Cross-Task Inconsistency?


Dataset Construction


Step-by-step demonstration of the automated pipeline for generating contrast sets for CoCoCON. Contrast sets generated from this pipeline for the validation split of COCO are subjected to manual filtering and then used to prepare the \dataset{} benchmark.

Examples from CoCoCON benchmark


For each example, we show the relevant image (left), the ground truth caption, VQA question, or image generation prompt for the image with the perturbed concept in green (middle), the set of perturbations used to generate alternative answers and predictions from Unified-IO XL for VQA (V), image generation (G) and localization (L) (right columns). ✅ and ❌ indicate scenarios where the model predictions for captioning and the corresponding task for that particular contrast set are consistent and inconsistent respectively. ‘-’ denotes a lack of localization annotations for the given sample.

How to evaluate consistency between heterogeneous output modalities?


We use the unified models' likelihoods for each gold and contrast labels of each task to rank the outputs of anchor task (image captioning) and a target task (VQA, localization, text-to-image generation). If the gold outputs or contrast ouputs for both tasks are ranked higher, model is consistenct, otherwise it is inconsistent. With this method, we avoid having to compare hetergenous modalities e.g., a bounding box vs. a caption.


We evaluate pretrained models on the COCOCON benchmark. (a) % Consistency of Unified-IO XL and OFA-HUGE models for varying difficulty (k) and all tasks in COCOCON, (b) % consistency (k=1) values for different sizes of Unified-IO models and (c) comparison of % accuracy with % consistency (k=1) values for all sizes of OFA models and our OFACon model.



  • Models are more inconsistent across tasks of diverse modalities. Unified-IO and OFA demostrate higher inconsistency in image captioning vs. text-to-image generation than in image captioning vs. visual question answering (VQA).
  • Models are inconsistent at hard as well as easy contrast sets. CoCoCON contains contrast sets of varying difficulties. While models are more consistent at easier contrast sets, the consistency is far from 100%, especially across tasks of varying modalities.
  • Larger multi-task models are more accurate as well as consistent. Larger versions of Unified-IO and OFA models demonstrate higher accuracy as well as consistency on CoCoCON.
  • Models are more accurate than consistent. This suggests that when models make mistakes for one task they rarely make the same kind of mistakes on the other tasks, which is what would allow a model to achieve high consistency independently of accuracy
  • Models capable of performing more tasks are more inconsistent.Unified-IO models exhibit more inconsistency than OFA models that are trained to perform only a subset of the tasks comprising Unified-IO's capabilities.


