icon
REFINED-BIAS: On the Reliability
of Cue Conflict and Beyond

Pum Jun Kim1,†, Seung-Ah Lee1,†, Seongho Park2, Dongyoon Han3,*, Jaejun Yoo1,*
1Ulsan National Institute of Science & Technology
2College of Medicine, Hanyang University
3NAVER AI Lab
Equal contributions. * Corresponding authors.

Cue-conflict benchmark has become a de-facto standard for evaluating the visual cues on which neural networks rely in their decision-making processes. But, does it truly reflect the underlying biases?

🤔 Where Cue-conflict Falls Short?

While cue-conflict offers a principled way to disentangle visual features, we argue that its current instantiation introduces artifacts and ambiguities that hinder meaningful interpretation:

Problem 1. Blurred Cue Distinction

The cues generated by stylization are inherently ambiguous, which makes them unclear not only to humans but also to models.

Problem 2. Unequally mixed cues

Stylization cannot control the relative contributions of shape and texture cues to form a balanced 50:50 mixture, which is necessary to measure true cue preference rather than cue dominance arising from unequal perceptibility.
Problem 3. Distorted model prediction

Restricting model evaluation to preselected classes prevents the benchmark from capturing true cue utilization, as predictions not included in the evaluated class set are ignored, leading to distorted interpretations of model behavior.

Problem 4. Cross-model bias misrepresentation

The cue-conflict metric fails to distinguish models that differ substantially in how strongly they rely on shape and texture cues, as it normalizes bias by relative proportions while ignoring large differences in overall cue sensitivity.

icon Beyond Cue-conflict — REFINED-BIAS Benchmark

Instead of only diagnosing the problem, we design both dataset and metrics to make cue analysis faithful and comparable.

⚖️ [Dataset] Pure cues with balanced cue contributions

We define shape and texture based on human perception rather than model heuristics, and generate cues that faithfully capture these characteristics.

Shape

A non-repeating geometric structure that encompasses both the global outline of an object and its distinctive local substructures.

Texture

A visual pattern that consistently repeats within patches of various image sizes.

Data construction
📊 [Metric] True cue utilization for fair model comparison

We propose a novel metric that evaluates how prominently the correct shape and texture labels appear in the model’s full prediction ranking.

Redefined bias

We compute the reciprocal ranks of the ground-truth shape and texture labels and denote them as \( \mathrm{RB}_S \) and \( \mathrm{RB}_T \), respectively. Unlike conventional MRR, the ranking is computed directly from the classification logits.

$$\text{RB}_S=\frac{1}{N}\sum^N_{i=1}\frac{1}{r_{\text{shape},i}},\quad \text{RB}_T=\frac{1}{N}\sum^N_{i=1}\frac{1}{r_{\text{texture},i}}$$

Here, \( N \) denotes the total number of samples, and \( r_{\text{shape}, i} \) and \( r_{\text{texture}, i} \) represent the ranks of the correct shape and texture labels for the \( i \)-th sample in the model’s prediction ranking, respectively.

📃 Hightlight Results

We evaluate whether the outcomes are consistent with our intuition and whether they remain plausible given existing knowledge. To this end, we first evaluate the dataset itself using extensive training strategies for diverse pre-trained models compared with the cue-conflict benchmark. We then focus more on assessing the correctness of the revised metric.

🔍 Learning Strategies and Hypothesis

To ensure that shape–texture bias can be reliably assessed, we evaluate both our dataset and metric using models with a fixed ResNet-50 architecture trained under diverse strategies. These strategies naturally encourage different levels of shape or texture reliance, allowing us to test whether our benchmark can consistently detect such variations.

Shape Augmentation: Models are trained on conflicting-cue images in which texture labels are incorrect but shape labels remain correct.
Contrastive Learning: Models align representations across texture variations, promoting reliance on shape cues.
Texture Distortion: Introduced noise disrupts texture information while preserving semantic structure.
Mixed Augmentation: Mixing or masking encourages models to use both shape and texture cues, with a slight bias toward shape.
Adversarial Training: Models are optimized to resist imperceptible noise without explicitly modifying shape or texture cues.

🖇️ References

  1. Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2018, November). ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International conference on learning representations.
  2. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
  3. Li, Y., Yu, Q., Tan, M., Mei, J., Tang, P., Shen, W., ... & Xie, C. (2020). Shape-texture debiased neural network training. arXiv preprint arXiv:2010.05981.
  4. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9650-9660).
  5. Chen, T., Kornblith, S., Swersky, K., Norouzi, M., & Hinton, G. E. (2020). Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33, 22243-22255.
  6. Chen, X., Xie, S., & He, K. (2021). An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9640-9649).
  7. Hendrycks, D., Mu, N., Cubuk, E. D., Zoph, B., Gilmer, J., & Lakshminarayanan, B. (2019). Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781.
  8. Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., ... & Gilmer, J. (2021). The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8340-8349).
  9. Hendrycks, D., Zou, A., Mazeika, M., Tang, L., Li, B., Song, D., & Steinhardt, J. (2022). Pixmix: Dreamlike pictures comprehensively improve safety measures. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16783-16792).
  10. Modas, A., Rade, R., Ortiz-Jiménez, G., Moosavi-Dezfooli, S. M., & Frossard, P. (2022, October). Prime: A few primitives can boost robustness to common corruptions. In European Conference on Computer Vision (pp. 623-640). Cham: Springer Nature Switzerland.
  11. Müller, P., Braun, A., & Keuper, M. (2023). Classification robustness to common optical aberrations. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3632-3643).
  12. Wightman, R., Touvron, H., & Jégou, H. (2021). Resnet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476.
  13. Salman, H., Ilyas, A., Engstrom, L., Kapoor, A., & Madry, A. (2020). Do adversarially robust imagenet models transfer better?. Advances in Neural Information Processing Systems, 33, 3533-3545.