# CrossCheck-Bench **Repository Path**: ByteDance/CrossCheck-Bench ## Basic Information - **Project Name**: CrossCheck-Bench - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-18 - **Last Updated**: 2026-01-12 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution

Baoliang Tian^1*, Yuxuan Si^1,2*, Jilong Wang^1,3*, Lingyao Li¹, Zhongyuan Bao¹, Zineng Zhou¹, Tao Wang^1†, Sixu Li¹, Ziyao Xu¹, Mingze Wang¹, Zhouzhuo Zhang¹, Zhihao Wang¹, Yike Yun¹, Ke Tian¹, Ning Yang^3†, Minghui Qiu¹
¹ByteDance ²Zhejiang University ³Institute of Automation, Chinese Academy of Sciences *Equal Contribution
[![arXiv](https://img.shields.io/badge/arXiv-Paper-red.svg)](https://github.com/bytedance/CrossCheck-Bench) [![Dataset](https://img.shields.io/badge/🤗%20Dataset-HuggingFace-yellow)](https://github.com/bytedance/CrossCheck-Bench) [![Project Page](https://img.shields.io/badge/Project-Page-green)](https://github.com/bytedance/CrossCheck-Bench) [![GitHub](https://img.shields.io/badge/GitHub-Repo-181717?logo=github)](https://github.com/bytedance/CrossCheck-Bench)

AAAI 2026 (Oral)

--- ## 🚀 Introduction

🔥 We will open-source the full CrossCheck-Bench dataset, benchmark suite, and evaluation toolkit. Stay tuned!

Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning and perception ability. However, their **compositional robustness under conflicting multimodal signals** remains underexplored. Real-world scenarios frequently present contradictions between text and images, requiring models to choose the correct modality or resolve inconsistencies. **CrossCheck-Bench** is introduced to systematically diagnose **compositional failures** in MLLMs under multimodal conflicts. The benchmark consists of: - **Structured multimodal conflict categories** - **Compositional reasoning tasks under contradictory cues** - **Human-verified conflict annotations** - **Robust evaluation protocol and metrics** Our experiments reveal significant failure modes across state-of-the-art MLLMs, including: - Over-reliance on textual cues - Incorrect visual grounding - Multi-hop reasoning breakdowns - Failure on conflict-sensitive attributes CrossCheck-Bench provides the first comprehensive diagnostic tool for understanding these weaknesses. --- ## 📊 Benchmark Details ### 📝 Dataset Overview

CrossCheck-Bench includes **diverse multimodal conflict scenarios**, covering: - Attribute conflict - Logical inconsistencies - Text vs. image contradiction - Spatial and relational conflicts - Multi-entity compositional conflict - Instruction override conflict Each sample contains: - A conflicting multimodal input (image + text) - Metadata on the conflict type - Ground-truth resolution label - Reasoning trace (optional) --- ## 🔧 Construction Pipeline ### ✨ Pipeline

The benchmark is constructed via a multi-stage pipeline: 1. **Template-based conflict generation** 2. **LLM-assisted conflict mutation** 3. **Human verification** 4. **Consistency filtering** 5. **Compositional augmentation** --- ## 🛠️ Usage

🔥 Code is coming soon. Stay tuned!