CLEVR-XAI: A benchmark dataset for the ground truth evaluation of neural network explanations

The work "CLEVR-XAI: A benchmark dataset for the ground truth evaluation of neural network explanations", supported by iToBoS project, has been published.

Technical information

  • CLEVR-XAI: A benchmark dataset for the ground truth evaluation of neural network explanations
  • Authors: Leila Arras [1], Ahmed Osman [1], Wojciech Samek.
  • [1] Department of Artificial Intelligence, Fraunhofer Heinrich Hertz Institute, 10587 Berlin, Germany
  • The work was received 31 January 2021, Revised 23 September 2021, Accepted 6 November 2021, Available online 14 November 2021, Version of Record 6 December 2021.

Abstract

The rise of deep learning in today’s applications entailed an increasing need in explaining the model’s decisions beyond prediction performances in order to foster trust and accountability. Recently, the field of explainable AI (XAI) has developed methods that provide such explanations for already trained neural networks. In computer vision tasks such explanations, termed heatmaps, visualize the contributions of individual pixels to the prediction. So far XAI methods along with their heatmaps were mainly validated qualitatively via human-based assessment, or evaluated through auxiliary proxy tasks such as pixel perturbation, weak object localization or randomization tests. Due to the lack of an objective and commonly accepted quality measure for heatmaps, it was debatable which XAI method performs best and whether explanations can be trusted at all. In the present work, we tackle the problem by proposing a ground truth based evaluation framework for XAI methods based on the CLEVR visual question answering task. Our framework provides a (1) selective, (2) controlled and (3) realistic testbed for the evaluation of neural network explanations. We compare ten different explanation methods, resulting in new insights about the quality and properties of XAI methods, sometimes contradicting with conclusions from previous comparative studies. The CLEVR-XAI dataset and the benchmarking code can be found at https://github.com/ahmedmagdiosman/clevr-xai.

Acknowledgments

This work was supported by the German Federal Ministry of Education and Research (BMBF) under grants 01IS18025A, 01IS18037I,031L0207C and 01IS21069B. Furthermore, this work has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 965221.

Find out more at ScienceDirect.