Annotated Bibliography and Resources

Incomplete, we're still adding to this.

In this section, we provide a pointer to relevant papers for the content of the tutorial. Due to time and space limitation, we were not able to discuss all relevant papers in the tutorial, but here, we provide an expanded overview of relevant literature. We are happy to update the material here with pointers to material that we might have missed. We will proceed according to the agenda discussed in the tutorial.

Introduction

Towards a rigorous science of interpretable machine learning. Doshi-Velez and Kim 2017 link
The Mythos of Model Interpretability. Lipton 2017 link
Transparency: Motivations and Challenges. Weller 2017 link
Explaining Explanations: An Overview of Interpretability of Machine Learning. Gilpin et al. 2019 link
Interpretable machine learning: definitions, methods, and applications. Murdoch et al. 2019 link
Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Rudin 2019 link

Approaches for Post hoc Explainability

In this section, we provide pointers to the approaches discussed in the tutorial.

Local Explanations

Feature Importance Approaches

LIME: “Why Should I Trust You?” Explaining the Predictions of Any Classifier. Ribeiro et al. 2017 link
SHAP: A Unified Approach to Interpreting Model Predictions. Lundberg and Lee, 2017 link
ANCHORS: Anchors: High Precision Model-Agnostic Explanations. Ribeiro et al. 2018 link

Feature Importance Approaches: Saliency Maps / Feature Attributions

Input-Gradient: How to Explain Individual Classification Decisions. Baehrens et. al. 2009 link and Deep Inside Convolutional Networks: Visualizing Image Classification Models and Saliency Maps. Simonyan et. al. 2014 link.
SmoothGrad: SmoothGrad: removing noise by adding noise. Smilkov et. al. 2018 link
Integrated Gradients: Axiomatic Explanation form Deep Networks. Sundararajan et. al. 2018 link
Gradient-Input: Not Just a Black Box: Learning Important Features Through Propagating Activation Differences Shrikumar et. al. 2016 link and Towards better understanding of gradient-based attribution methods for Deep Neural Networks Ancona et. al. 2018 link

Saliency map techniques that derive relevance via a modified back-propagation process.

Guided BackProp: Striving for simplicity: The all convolutional net. Springenberg and Dosovitskiy et. al. 2015 link
DeConvNet: Visualizing and understanding convolutional networks. Zeiler and Fergus 2014 link
Layer Relevance Propagation Family (LRP): See link for discussion on these family of methods. link

For an overview of additional saliency methods we refer to:

Overview Article: Toward Interpretable Machine Learning: Transparent Deep Neural Networks and Beyond. Samek and Montavon 2020 link

Prototype / Example Based Approaches

Influence Functions: Understanding Black-box Predictions via Influence Functions. Koh and Liang 2017 link
Influence Functions and Non-convex models: Influence functions in Deep Learning are Fragile. Basu et. al. 2020 link
Representer Points: Representer Point Selection for Explaining Deep Neural Networks. Yeh et. al. 2018 link
TracIn: Estimating Training Data Influence by Tracing Gradient Descent. Garima et al. 2020 link
Activation Maximization: Visualizing higher layer Features of a Deep Network. Erhan et al. 2009 link
Feature Visualization: Feature Visualization. Olah et al. 2017 link

Counterfactuals

Minimum Distance Counterfactuals: Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR. Wachter et al. 2017 link
Feasible and Least Cost Counterfactuals: Actionable Recourse in Linear Classification. Ustun et. al., 2019 link
Causally Feasible Counterfactuals: Preserving Causal Constraints in Counterfactual Explanations for Machine Learning Classifiers. Mahajan et al. 2020 link and Model-Agnostic Counterfactual Explanations for Consequential Decisions. Karimi et. al. 2020 link