# CrossCheck-Bench
**Repository Path**: ByteDance/CrossCheck-Bench
## Basic Information
- **Project Name**: CrossCheck-Bench
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-11-18
- **Last Updated**: 2026-01-12
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
CrossCheck-Bench: Diagnosing Compositional Failures
in Multimodal Conflict Resolution
Baoliang Tian1*, Yuxuan Si1,2*, Jilong Wang1,3*, Lingyao Li1, Zhongyuan Bao1, Zineng Zhou1,
Tao Wang1†, Sixu Li1, Ziyao Xu1, Mingze Wang1, Zhouzhuo Zhang1, Zhihao Wang1,
Yike Yun1, Ke Tian1, Ning Yang3†, Minghui Qiu1
1ByteDance 2Zhejiang University 3Institute of Automation, Chinese Academy of Sciences *Equal Contribution
[](https://github.com/bytedance/CrossCheck-Bench)
[](https://github.com/bytedance/CrossCheck-Bench)
[](https://github.com/bytedance/CrossCheck-Bench)
[](https://github.com/bytedance/CrossCheck-Bench)
AAAI 2026 (Oral)
---
## 🚀 Introduction
🔥 We will open-source the full CrossCheck-Bench dataset, benchmark suite, and evaluation toolkit. Stay tuned!
Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning and perception ability. However, their **compositional robustness under conflicting multimodal signals** remains underexplored. Real-world scenarios frequently present contradictions between text and images, requiring models to choose the correct modality or resolve inconsistencies.
**CrossCheck-Bench** is introduced to systematically diagnose **compositional failures** in MLLMs under multimodal conflicts. The benchmark consists of:
- **Structured multimodal conflict categories**
- **Compositional reasoning tasks under contradictory cues**
- **Human-verified conflict annotations**
- **Robust evaluation protocol and metrics**
Our experiments reveal significant failure modes across state-of-the-art MLLMs, including:
- Over-reliance on textual cues
- Incorrect visual grounding
- Multi-hop reasoning breakdowns
- Failure on conflict-sensitive attributes
CrossCheck-Bench provides the first comprehensive diagnostic tool for understanding these weaknesses.
---
## 📊 Benchmark Details
### 📝 Dataset Overview
CrossCheck-Bench includes **diverse multimodal conflict scenarios**, covering:
- Attribute conflict
- Logical inconsistencies
- Text vs. image contradiction
- Spatial and relational conflicts
- Multi-entity compositional conflict
- Instruction override conflict
Each sample contains:
- A conflicting multimodal input (image + text)
- Metadata on the conflict type
- Ground-truth resolution label
- Reasoning trace (optional)
---
## 🔧 Construction Pipeline
### ✨ Pipeline
The benchmark is constructed via a multi-stage pipeline:
1. **Template-based conflict generation**
2. **LLM-assisted conflict mutation**
3. **Human verification**
4. **Consistency filtering**
5. **Compositional augmentation**
---
## 🛠️ Usage
🔥 Code is coming soon. Stay tuned!