Making interpretations useful (CDEP) 🔨

Regularizes interpretations (computed via contextual decomposition) to improve neural networks. Official code for Interpretations are useful: penalizing explanations to align neural networks with prior knowledges (ICML 2020 pdf).

Note: this repo is actively maintained. For any questions please file an issue.



  • fully-contained data/models/code for reproducing and experimenting with CDEP
  • the src folder contains the core code for running and penalizing contextual decomposition
  • in addition, we run experiments on 4 datasets, each of which are located in their own folders
    • notebooks in these folders show demos for different kinds of text


ISIC skin-cancer classification - using CDEP, we can learn to avoid spurious patches present in the training set, improving test performance!

The segmentation maps of the patches can be downloaded here

ColorMNIST - penalizing the contributions of individual pixels allows us to teach a network to learn a digit's shape instead of its color, improving its test accuracy from 0.5% to 25.1%

Fixing text gender biases - CDEP can help to learn spurious biases in a dataset, such as gendered words

using CDEP on your own data

using CDEP requires two steps:

  1. run CD/ACD on your model. Specifically, 3 things must be altered:
    • the pred_ims function must be replaced by a function you write using your own trained model. This function gets predictions from a model given a batch of examples.
    • the model must be replaced with your model
    • the current CD implementation doesn't always work for all types of networks. If you are getting an error inside of, you may need to write a custom function that iterates through the layers of your network (for examples see
  2. add CD scores to the loss function (see notebooks)

related work

  • ACD (ICLR 2019 pdf, github) - extends CD to CNNs / arbitrary DNNs, and aggregates explanations into a hierarchy
  • PDR framework (PNAS 2019 pdf) - an overarching framewwork for guiding and framing interpretable machine learning
  • TRIM (ICLR 2020 workshop pdf, github) - using simple reparameterizations, allows for calculating disentangled importances to transformations of the input (e.g. assigning importances to different frequencies)
  • DAC (arXiv 2019 pdf, github) - finds disentangled interpretations for random forests


  • feel free to use/share this code openly
  • if you find this code useful for your research, please cite the following:
  title={Interpretations are useful: penalizing explanations to align neural networks with prior knowledge},
  author={Rieger, Laura and Singh, Chandan and Murdoch, William and Yu, Bin},
  booktitle={International Conference on Machine Learning},

Deep Explanation Penalization

Code for using CDEP from the paper "Interpretations are useful: penalizing explanations to align neural networks with prior knowledge"

Deep Explanation Penalization Info

⭐ Stars 93
🔗 Source Code
🕒 Last Update 5 months ago
🕒 Created 3 years ago
🐞 Open Issues 0
➗ Star-Issue Ratio Infinity
😎 Author laura-rieger