On the Reliability of Cue Conflict and Beyond

Cue-conflict benchmark has become a de-facto standard for evaluating the visual cues on which neural networks rely in their decision-making processes. But, does it truly reflect the underlying biases?

🤔 Where Cue-conflict Falls Short?

While cue-conflict offers a principled way to disentangle visual features, we argue that its current instantiation introduces artifacts and ambiguities that hinder meaningful interpretation:

Problem 1. Blurred Cue Distinction

The cues generated by stylization are inherently ambiguous, which makes them unclear not only to humans but also to models.

Problem 2. Unequally mixed cues

Stylization cannot control the relative contributions of shape and texture cues to form a balanced 50:50 mixture, which is necessary to measure true cue preference rather than cue dominance arising from unequal perceptibility.

Problem 3. Distorted model prediction

Restricting model evaluation to preselected classes prevents the benchmark from capturing true cue utilization, as predictions not included in the evaluated class set are ignored, leading to distorted interpretations of model behavior.

Problem 4. Cross-model bias misrepresentation

The cue-conflict metric fails to distinguish models that differ substantially in how strongly they rely on shape and texture cues, as it normalizes bias by relative proportions while ignoring large differences in overall cue sensitivity.

Beyond Cue-conflict — REFINED-BIAS Benchmark

Instead of only diagnosing the problem, we design both dataset and metrics to make cue analysis faithful and comparable.

⚖️ [Dataset] Pure cues with balanced cue contributions

We define shape and texture based on human perception rather than model heuristics, and generate cues that faithfully capture these characteristics.

Shape

A non-repeating geometric structure that encompasses both the global outline of an object and its distinctive local substructures.

Texture

A visual pattern that consistently repeats within patches of various image sizes.

Data construction

Examples for Shape Cue

Examples for Texture Cue

📊 [Metric] True cue utilization for fair model comparison

We propose a novel metric that evaluates how prominently the correct shape and texture labels appear in the model’s full prediction ranking.

Redefined bias

We compute the reciprocal ranks of the ground-truth shape and texture labels and denote them as $ \mathrm{RB}_S $ and $ \mathrm{RB}_T $, respectively. Unlike conventional MRR, the ranking is computed directly from the classification logits.

$$\text{RB}_S=\frac{1}{N}\sum^N_{i=1}\frac{1}{r_{\text{shape},i}},\quad \text{RB}_T=\frac{1}{N}\sum^N_{i=1}\frac{1}{r_{\text{texture},i}}$$

Here, $ N $ denotes the total number of samples, and $ r_{\text{shape}, i} $ and $ r_{\text{texture}, i} $ represent the ranks of the correct shape and texture labels for the $ i $-th sample in the model’s prediction ranking, respectively.

📃 Hightlight Results

We evaluate whether the outcomes are consistent with our intuition and whether they remain plausible given existing knowledge. To this end, we first evaluate the dataset itself using extensive training strategies for diverse pre-trained models compared with the cue-conflict benchmark. We then focus more on assessing the correctness of the revised metric.

🔍 Learning Strategies and Hypothesis

To ensure that shape–texture bias can be reliably assessed, we evaluate both our dataset and metric using models with a fixed ResNet-50 architecture trained under diverse strategies. These strategies naturally encourage different levels of shape or texture reliance, allowing us to test whether our benchmark can consistently detect such variations.

● Shape Augmentation: Models are trained on conflicting-cue images in which texture labels are incorrect but shape labels remain correct.

● Contrastive Learning: Models align representations across texture variations, promoting reliance on shape cues.

● Texture Distortion: Introduced noise disrupts texture information while preserving semantic structure.

● Mixed Augmentation: Mixing or masking encourages models to use both shape and texture cues, with a slight bias toward shape.

● Adversarial Training: Models are optimized to resist imperceptible noise without explicitly modifying shape or texture cues.

👀 Our Dataset Captures Expected Model Behaviors

We examine whether measuring relative bias on our dataset can capture meaningful cross-model differences.

Our dataset demonstrates that shape-focused strategies consistently increase shape bias. Notably, even nuanced strategies such as mixed augmentations, which apply mild texture degradation to subtly enhance shape, are accurately reflected by ours as an increased shape bias. While cue-conflict partly captures similar tendencies, many of its results are not statistically significant and show an inconsistent trend across the strategies.

For adversarial training, results on our dataset show that robustness to imperceptible noise does not significantly affect model bias. In contrast, cue-conflict reports a larger increase in shape bias than shape-focused methods, which is counterintuitive since it is primarily aimed at improving adversarial robustness, not shape preference.

💫 Our Metric Distinguishes Genuine Cue Utilization

We validate whether our metric can reveal cross-model differences that the relative bias metric fails to capture.

The relative bias metric on our dataset shows adversarial learning induces the strongest texture bias, while cue-conflict reports the strongest shape bias for the same method. However, applying our metric and dataset clearly shows that adversarial learning does not improve the utilization of either cue. In contrast, mixed augmentations lead models to genuinely leverage both shape and texture cues, exposing the differences obscured by the relative measure.

These results highlight that our sensitivity metric reliably distinguishes models with stronger cue use and those with more balanced reliance.

🖇️ References

Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2018, November). ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International conference on learning representations.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
Li, Y., Yu, Q., Tan, M., Mei, J., Tang, P., Shen, W., ... & Xie, C. (2020). Shape-texture debiased neural network training. arXiv preprint arXiv:2010.05981.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9650-9660).
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., & Hinton, G. E. (2020). Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33, 22243-22255.
Chen, X., Xie, S., & He, K. (2021). An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9640-9649).
Hendrycks, D., Mu, N., Cubuk, E. D., Zoph, B., Gilmer, J., & Lakshminarayanan, B. (2019). Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781.
Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., ... & Gilmer, J. (2021). The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8340-8349).
Hendrycks, D., Zou, A., Mazeika, M., Tang, L., Li, B., Song, D., & Steinhardt, J. (2022). Pixmix: Dreamlike pictures comprehensively improve safety measures. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16783-16792).
Modas, A., Rade, R., Ortiz-Jiménez, G., Moosavi-Dezfooli, S. M., & Frossard, P. (2022, October). Prime: A few primitives can boost robustness to common corruptions. In European Conference on Computer Vision (pp. 623-640). Cham: Springer Nature Switzerland.
Müller, P., Braun, A., & Keuper, M. (2023). Classification robustness to common optical aberrations. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3632-3643).
Wightman, R., Touvron, H., & Jégou, H. (2021). Resnet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476.
Salman, H., Ilyas, A., Engstrom, L., Kapoor, A., & Madry, A. (2020). Do adversarially robust imagenet models transfer better?. Advances in Neural Information Processing Systems, 33, 3533-3545.

REFINED-BIAS: On the Reliability of Cue Conflict and Beyond

🤔 Where Cue-conflict Falls Short?

Beyond Cue-conflict — REFINED-BIAS Benchmark

📃 Hightlight Results

🖇️ References

REFINED-BIAS: On the Reliability
of Cue Conflict and Beyond