Annotated Bibliography and Resources

Incomplete, we're still adding to this.

In this section, we provide a pointer to relevant papers for the content of the tutorial. Due to time and space limitation, we were not able to discuss all relevant papers in the tutorial, but here, we provide an expanded overview of relevant literature. We are happy to update the material here with pointers to material that we might have missed. We will proceed according to the agenda discussed in the tutorial.


  • Towards a rigorous science of interpretable machine learning. Doshi-Velez and Kim 2017 link
  • The Mythos of Model Interpretability. Lipton 2017 link
  • Transparency: Motivations and Challenges. Weller 2017 link
  • Explaining Explanations: An Overview of Interpretability of Machine Learning. Gilpin et al. 2019 link
  • Interpretable machine learning: definitions, methods, and applications. Murdoch et al. 2019 link
  • Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Rudin 2019 link

Approaches for Post hoc Explainability

In this section, we provide pointers to the approaches discussed in the tutorial.

Local Explanations

Feature Importance Approaches

  • LIME: “Why Should I Trust You?” Explaining the Predictions of Any Classifier. Ribeiro et al. 2017 link
  • SHAP: A Unified Approach to Interpreting Model Predictions. Lundberg and Lee, 2017 link
  • ANCHORS: Anchors: High Precision Model-Agnostic Explanations. Ribeiro et al. 2018 link

Feature Importance Approaches: Saliency Maps / Feature Attributions

  • Input-Gradient: How to Explain Individual Classification Decisions. Baehrens et. al. 2009 link and Deep Inside Convolutional Networks: Visualizing Image Classification Models and Saliency Maps. Simonyan et. al. 2014 link.
  • SmoothGrad: SmoothGrad: removing noise by adding noise. Smilkov et. al. 2018 link
  • Integrated Gradients: Axiomatic Explanation form Deep Networks. Sundararajan et. al. 2018 link
  • Gradient-Input: Not Just a Black Box: Learning Important Features Through Propagating Activation Differences Shrikumar et. al. 2016 link and Towards better understanding of gradient-based attribution methods for Deep Neural Networks Ancona et. al. 2018 link

Saliency map techniques that derive relevance via a modified back-propagation process.

  • Guided BackProp: Striving for simplicity: The all convolutional net. Springenberg and Dosovitskiy et. al. 2015 link
  • DeConvNet: Visualizing and understanding convolutional networks. Zeiler and Fergus 2014 link
  • Layer Relevance Propagation Family (LRP): See link for discussion on these family of methods. link

For an overview of additional saliency methods we refer to:

  • Overview Article: Toward Interpretable Machine Learning: Transparent Deep Neural Networks and Beyond. Samek and Montavon 2020 link

Prototype / Example Based Approaches

  • Influence Functions: Understanding Black-box Predictions via Influence Functions. Koh and Liang 2017 link
  • Influence Functions and Non-convex models: Influence functions in Deep Learning are Fragile. Basu et. al. 2020 link
  • Representer Points: Representer Point Selection for Explaining Deep Neural Networks. Yeh et. al. 2018 link
  • TracIn: Estimating Training Data Influence by Tracing Gradient Descent. Garima et al. 2020 link
  • Activation Maximization: Visualizing higher layer Features of a Deep Network. Erhan et al. 2009 link
  • Feature Visualization: Feature Visualization. Olah et al. 2017 link


  • Minimum Distance Counterfactuals: Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR. Wachter et al. 2017 link
  • Feasible and Least Cost Counterfactuals: Actionable Recourse in Linear Classification. Ustun et. al., 2019 link
  • Causally Feasible Counterfactuals: Preserving Causal Constraints in Counterfactual Explanations for Machine Learning Classifiers. Mahajan et al. 2020 link and Model-Agnostic Counterfactual Explanations for Consequential Decisions. Karimi et. al. 2020 link