We use models1 as epistemological tools to gain understanding about the world or a particular phenomenon. Especially machine learning models are becoming more and more ubiquitous. However, due to the complexity of the model, often referred to as black-box, the knowledge we are able to extract seems limited. To overcome this problem there are various approaches that aim at making the model explainable or interpretable. In the literature these terms are often used interchangeably. In the following2 I will firstly argue that they are not the same, but rather represent two different approaches and secondly that interpretable AI is epistemologically more relevant and therefore more desirable. This is because, at best, it resembles human reasoning.

The black-box problem

The black-box problem arises more or less due to the nature of how current machine learning models work, especially Deep Neural Networks. Machine learning algorithms are able to learn correlations between input data and output data by themselves. In other words, machine learning maps input data to output data which represents the target phenomenon by approximating a mathematical function. This function is recursively optimized through so-called backpropagation and usually improves with the amount of data samples the algorithm ‘sees’ or processes. The exact details are less of concern. Instead, it should be highlighted that the developer of the algorithm often does not know how the algorithm works (Sullivan, 2019, p. 10). That is because of the aforementioned algorithm’s self-optimization. Nevertheless, the developer constructs the rough architecture of the model, for example a neural network with a certain amount of layers. Besides, s/he sets up the learning environment by deciding which training data to provide to the algorithm, and also whether a certain classification outcome is desired, i.e. supervised learning, or not, i.e. unsupervised learning (Sullivan, 2019, p. 14). So it is not that developers are completely in the dark, it is rather about the inner workings of the algorithm or model, thus also the name black-box.

Due to algorithmic complexity the outcome or output may ‘look’ promising, but the reasons why the algorithm derived a particular conclusion may differ from our expectations or our reasoning. Famous examples of this are the identification of wolves by the occurrence of snow in the images or the detection of tanks depending on the time of day (Guidotti et al., 2019, p. 4). These reasons are not satisfying to us humans to classify these images accordingly. The two examples also show that algorithms are very good at picking up subtle regularities, which often are not conceivable or understandable to us humans due to its numerical complexities (Zednik, 2019, p. 16). This leaves the developers to ‘wonder’ how the algorithm arrived at or computed a certain output given its input. In this case, the algorithm is also called opaque and thus, there is an epistemic lack, namely that we cannot (fully) relate input to output or cause to effect (Zednik, 2019, p. 5).

Two approaches to the black-box problem

Much research is being done on how to solve or ‘open’ the black-box and thus make the model transparent (Doshi-Velez & Kim, 2017, p. 2). There are mainly two approaches as also presented in figures 1 and 2 (Doran et al., 2017, p. 4). The first approach is concerned with accepting the black-box and trying to solve and understand it by making it transparent through another model, i.e. post-hoc (Rudin, 2019, p. 1). The second approach, by contrast, tries to avoid the black-box problem in the first place, i.e. ad-hoc, by aiming at developing a simple and understandable model.

Figure 1: The black-box problem can be separated into the problem of explaining how the model arrived at a certain outcome (Black Box Explanation) and into the problem of directly designing a transparent classifier (Transparent Box Design). The other distinctions within the ’Black Box Explanation’ are not relevant here. Source: Guidotti et al. (2019, p. 11)
Figure 2: Doran et al. (2017, p. 5) distinguish between interpretable and comprehensible models. The former is consistent with what I call explainable AI in that it provides (mathematical) explanations, while the latter is more consistent with what I call interpretable AI, in that its reasoning is similar to human reasoning

Although these two approaches may be acknowledged by researches and machine learning practitioners, related terms such as explainability, interpretability or comprehensibility are often used interchangeably (Brennen, 2020, pp. 2–4). As Arrieta puts it, “It appears from the literature that there is not yet a common point of understanding on what interpretability or explainability are” (Arrieta et al., 2019, p. 85). Moreover, this lack of agreement also exists on the “formal technical meaning” (Lipton, 2017, p. 2).

In the following I want to work towards a common understanding by arguing that explainability and interpretability are not the same, but rather represent the aforementioned approaches respectively. The epistemological value for closing the knowledge gap of the black-box therefore differs greatly between the explainable AI approach and the interpretable AI approach. In other words, the difference lies in what we are able to learn about the phenomenon and about the model itself. Work has already been done in this field (Arrieta et al., 2019; Brennen, 2020; Guidotti et al., 2019) and the following is to be understood as a more general overview.

Explanation and Interpretation

Before I move on, I want to present definitions of the terms explanation and interpretation respectively. According to the Oxford English dictionary, an explanation is “A statement or account that makes something clear; a reason or justification given for an action or belief” (Doran et al., 2017, p. 2). On the other hand, an interpretation or “interpretability [is] the ability to explain or to present in understandable terms to a human” (Doshi-Velez & Kim, 2017, p. 2); and “for a system to be interpretable, it must produce descriptions that are simple enough for a person to understand using a vocabulary that is meaningful to the user” (Gilpin et al., 2019, p. 2). Obviously these definitions seem to be very similar and I do not want to point out the exact linguistic differences. But what we can see from these definitions is that an interpretation includes an explanation, whereas an explanation needs further interpretation.3

If we look at a machine learning model and take the perspective of the algorithm, we would not ask ‘what can we say about the model’, but ‘what can the model tell us about a phenomenon’. In other words, we can either imagine how the algorithm explains its actions using its own mathematical language, which we humans have to ‘interpret’, or how the algorithm interprets what the phenomenon means to us humans and accordingly ‘explains’ it in understandable terms. Having this conception in mind, I will take a closer look at explainable AI in the following section.

Explainable AI

As mentioned earlier, explainable AI is concerned with providing post-hoc explanations about the functioning of the model under scrutiny (Arrieta et al., 2019, p. 85). This is done by using another model, which is why providing an explanation can also be seen as a form of modelling (Mittelstadt et al., 2019, p. 279).

Zednik (2019) provides a framework to “solve the black box problem”. In doing so, he identifies different stakeholders within the machine learning ecosystem, for example the decision subjects, the developers, or the examiners. As each stakeholder has different questions, everyone is concerned about a different level of the model’s functioning. Accordingly, the questions can be about the why, the how or the what and are addressed by using different techniques (Zednik, 2019, p. 8). Therefore, Zednik also refers to many black-box problems rather than just a single one and defines opaqueness as an “agent-relative” characteristic (Zednik, 2019, p. 5). So this also means that there are several explanations for one and the same model.

Problems with explanations

Although some explanations provide insights into the black-box and might be helpful in some way or the other, there are still some challenges to consider (Rudin, 2019, pp. 2–5).

First of all, there are no standardized criteria for a good explanation and thus “there is no agreement on what an explanation is” (Guidotti et al., 2019, p. 36). The problem increases with the fact that, as Zednik points out, there are different questions that different stakeholders may ask about the system. So what might seem as a sufficient explanation to the developer might not be helpful to the decision subject. Due to the multitude of explanations for the functioning of the model or even for the occurrence of an event, some of which are more appropriate than others, we can therefore not be sure that the provided explanation is sufficient or satisfactory (Doshi-Velez and Kim, 2017, p. 3; Mittelstadt et al., 2019, p. 280). In addition, algorithms often handle high-dimensional variables that are not understandable to us humans (Zednik, 2019, p. 16). And since the explanation is given post-hoc we can ascribe the meaning of the data or outcome only afterwards. In other words, the current process attempts to translate incomprehensible machine language into human understandable language. So again, how can we be sure that this translation process is correct or even complete? Certainly completeness could always be a problem, and even humans are selective when they provide an explanation (Mittelstadt et al., 2019, p. 284). Besides, a full explanation might sometimes not even be helpful because it would be too difficult to understand, i.e. it would not be interpretable (Doran et al., 2017, p. 6). Nevertheless, the point is that “Explanations are often not reliable, and can be misleading” (Rudin, 2019, p. 1).

Furthermore, it is rarely the case that the models providing the explanations are tested or validated (Mittelstadt et al., 2019, p. 282). So for example the very same heatmap4 that tries to explain why the model assigns an image to a certain class could be used to classify the same image with a completely different class (see Figure 3). This again shows that not only there are several different explanations, but also that a single explanation can be ambiguous.

Figure 3: The heatmap does not provide any insights to why the image is either classified as husky or as flute. Source: Rudin (2019, p. 5)

All in all, this does not mean, that explanations per se are useless, but that one must remain critical. Especially, since it might not always be clear whether we are learning about the workings of the model or about the phenomenon through the findings in data (Guidotti et al., 2019, p. 5). Certainly, these two tasks may be closely connected. But in the case of the algorithm that classified wolves on the basis of the occurrence of snow, the explanation did not provide any useful knowledge into the phenomenon. Maybe just that the essence of a wolf is not snow. But why do we then again try to understand how the algorithm perceives the world and not vice versa? So it should be epistemological more desirable to learn about the phenomenon through the model instead learning about the model itself.

When explanations are not necessary

Sullivan (2019) presents an interesting and, so to speak, a middle course between explainable AI and interpretable AI. I want to quickly illustrate it before I move on to interpretable AI.

In contrast to Zednik, she does not focus on how a model can be explained, but rather on how much knowledge about the target phenomenon is available that the model tries to describe. She argues that the relevant factor of understanding the outcome of the model is not the problem of its black-box characteristic or its opaqueness, but rather the “link certainty” or uncertainty. That is the extent to which the model can be connected to the phenomenon through empirical evidence. So a model might correlate an individual’s facial expression with their sexual orientation, but there is no evidence that one causes the other, so there is great link uncertainty (Sullivan, 2019, p. 21). In contrast, the more scientific background knowledge is available, the more the model can be justified in relation to the phenomenon, i.e. the link certainty is stronger. This means we can more or less comprehend how the input maps to the output, but the black-box, that is the inner workings of the model, could still remain opaque.

As a result, if the link certainty is strong enough, an “Explanation is not necessary [because] the problem is sufficiently well-studied and validated in real applications that we trust the system’s decision, even if the system is not perfect” (Doshi-Velez & Kim, 2017, p. 3). In this case, one can ask oneself why one needs a model in the first place and what epistemic value it has. As Sullivan argues, the model might still reveal hidden and unknown correlations (Sullivan, 2019, p. 20).

One problem with Sullivan’s view, however, is that we could accept the black-box, but the outcome computed by the model is based on different assumptions than with the available scientific knowledge, as the above-mentioned examples of identifying wolves or tanks illustrate. Although the model could reveal correlations, these could be less helpful or perhaps even harmful due to conflicting views of the world that machines and humans have. So the question remains as to what the machine regards as relevant features.

Interpretable AI

A closely related yet different concept to interpretable AI is causal inference (Krishnan, 2020, p. 499; Lipton, 2017, p. 3; Murdoch et al., 2019, p. 2). Current machine learning models, however, reason differently from humans (Khemlani & Johnson-Laird, 2019) and, as Pearl puts it, “Data do not understand causes and effects; humans do” (Pearl & Mackenzie, 2018, p. 26). Therefore, building on Sullivan’s account, the goal for an interpretable AI, in contrast to an explainable AI, is to develop a model ad-hoc with our understanding of causal relationships or our “mental representation” (Pearl, 2019, p. 2).

In this way we can overcome the problem of translating incomprehensible machine language to understandable human language. Furthermore, we do not need to worry whether the relevant features actually relate to the target phenomenon, since the model has a reasoning process that is “qualitatively similar to that of humans” (Chen et al., 2019, p. 2). Figure 4 illustrates such an interpretable AI model. So even though we might not fully understand the inner workings of the model, that is its mathematical operations, i.e. how, we would still know what the model is looking for by knowing why a certain output is computed (Mittelstadt et al., 2019, p. 280). Interestingly, as Zednik also mentions, most of the stakeholders in the machine learning ecosystem are not concerned by “looking into” the black-box when looking for an explanation. But they rather “look outside” to see how the model relates to the environment. (Zednik, 2019, pp. 10–11). This is similar to Sullivan’s link certainty.

Figure 4: Illustration of an interpretable AI model with a mental representation, or here knowledge base, which ideally equals ours. Based on this, the machine has a similar reasoning process to ours. Source: Doran et al. (2017, p. 7)

However, often a trade-off between accuracy and interpretablity or explainability is mentioned. This makes it appear that we have to opt for black-boxes if we want to achieve good (predictive) results, which seems as an undesirable choice. However, this trade-off is not necessarily true (Arrieta et al., 2019, p. 100; Rudin, 2019, p. 2). Instead, simple models can achieve promising results too. In order to do so, one necessary step is to reduce the amount of features and variables, also known as dimensionality reduction (Vellido et al., 2012, pp. 165–166). This requires background knowledge about the relationship between relevant features or parameters of the model and the target phenomenon. Feature selection, a task performed by humans, was a common approach in AI until neural networks became more powerful and were able to learn ‘relevant’ features independently. So one aim of interpretable AI is to reduce complexity by incorporating human knowledge instead of ‘machine knowledge’ (Doran et al., 2017, pp. 4–5).

One example is provided by Chen et al. who built a neural network, named “this looks like that”, which classifies images, for example an image of a bird, based on pre-selected features, or “prototypes”, such as wings, legs or the head. If the model finds these features it may classify the image accordingly (see figure 5). In that sense, the model has a mental representation of how we perceive the world and thus, as the authors claim, “we have defined a form of interpretability in image processing (this looks like that) that agrees with the way humans describe their own reasoning in classification” (Chen et al., 2019, p. 9). Moreover, with this kind of model it would be possible to intervene within the learning environment and change the usual cause-effect-relationship to produce new epistemological relevant questions, one of them being counterfactual questions (Pearl & Mackenzie, 2018, p. 3).

Figure 5: The ‘this looks like that’ neural network classifies images based on already known features or prototypes. Here too the model disposes a mental representation. Source: Chen et al. (2019, p. 2)

Challenges of Interpretable AI

The previous discussion is not intended to claim that interpretable AI is flawless or that it should be used in every situation. Certainly, if we teach the algorithm a mental representation, that might not solve the black-box problem in certain cases as humans or different stakeholders also have different mental representations depending on context or background knowledge. Nevertheless, it would be advantageous because we can study these mental representations, as they are based on human understandable language or meaning and therefore we do not have to try to understand the machine first.

However, interpretable models require more knowledge and time to develop (Rudin, 2019). This is because “A model can be explained, but the interpretability of the model is something that comes from the design of the model itself” (Arrieta et al., 2019, p. 85). For example it is difficult to train an algorithm that recognizes a chair regardless of its angle, but still Deep Neural Networks are promising to solve this task (Buckner, 2018, p. 5346). And time itself should not be the reason to stop this endeavour.


Explainable AI and interpretable AI are two distinct approaches to the black-box problem with different epistemic characteristics. Depending on the use-case of the model one should be aware of this fact, since in some cases ‘explanations’ might be sufficient, but in others we require a meaningful ‘interpretation’. One could argue that through explanations models become successively interpretable. This might be true, but nevertheless we should aim at interpretable AI from the beginning on in order to avoid unknown and undesirable consequences that may result from the outcome or prediction of the model.

By designing interpretable AI, this would mean a shift in the current ethos of “prediction over knowledge” (Jones, 2018, p. 674), to a preference of knowledge. Besides, by providing the model with a mental representation, it seems as if we are going back to building expert systems (Forsythe, 1993). However, now with a different conception of knowledge, which is less formal and rule-based than it was years before (Ensmenger, 2011).

Although I propose that interpretable AI is epistemologically more relevant and therefore more desirable, because explainable AI only seems to provides us with “working knowledge” (Stevens, 2017), that is knowledge about how to use and apply a model, I do not claim that all models must be interpretable. There are certainly cases, such as non-critical decision-making processes, where explainable AI is sufficient.


  • Arrieta, A. B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R. & Herrera, F. (2019). Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58, 82–115. https://doi.org/10.1016/j.inffus.2019.12.012
  • Brennen, A. (2020). What do people really want when they say they want “explainable ai?” we asked 60 stakeholders. Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, 1–7. https://doi.org/10.1145/3334480.3383047
  • Buckner, C. (2018). Empiricism without magic: Transformational abstraction in deep convolutional neural networks. Synthese, 195(12), 5339–5372. https://doi.org/10.1007/s11229-018-01949-1
  • Chen, C., Li, O., Tao, C., Barnett, A. J., Su, J. & Rudin, C. (2019). This looks like that: Deep learning for interpretable image recognition. Advances in Neural Information Processing Systems. https://arxiv.org/abs/1806.10574
  • Doran, D., Schulz, S. & Besold, T. R. (2017). What does explainable AI really mean? a new conceptualization of perspectives. Advances in Neural Information Processing Systems. https://arxiv.org/abs/1710.00794
  • Doshi-Velez, F. & Kim, B. (2017). Towards a rigorous science of interpretable machine learning. https://arxiv.org/abs/1702.08608
  • Ensmenger, N. (2011). Is chess the drosophila of artificial intelligence? a social history of an algorithm. Social Studies of Science, 42 (1), 5–30. https://doi.org/10.1177/0306312711424596
  • Forsythe, D. E. (1993). Engineering knowledge: The construction of knowledge in artificial intelligence. Social Studies of Science, 23 (3), 445–477. http://www.jstor.org/stable/370256
  • Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M. & Kagal, L. (2019). Explaining explanations: An overview of interpretability of machine learning. http://arxiv.org/abs/1806.00069
  • Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F. & Pedreschi, D. (2019). A survey of methods for explaining black box models. ACM Computing Surveys, 51(5), 1–42. https://doi.org/10.1145/3236009
  • Jones, M. L. (2018). How we became instrumentalists (again). Historical Studies in the Natural Sciences, 48(5), 673–684. https://doi.org/10.1525/hsns.2018.48.5.673
  • Khemlani, S. & Johnson-Laird, P. N. (2019). Why machines don’t (yet) reason like people. KI - Künstliche Intelligenz, 33(3), 219–228. https://doi.org/10.1007/s13218-019-00599-w
  • Krishnan, M. (2020). Against interpretability: A critical examination of the interpretability problem in machine learning. Philosophy & Technology, 33 (3), 487–502. https://doi.org/10.1007/s13347-019-00372-9
  • Lipton, Z. C. (2017). The mythos of model interpretability. Retrieved November 6, 2020, from http://arxiv.org/abs/1606.03490
  • Mittelstadt, B., Russell, C. & Wachter, S. (2019). Explaining explanations in AI. Proceedings of the Conference on Fairness, Accountability, and Transparency - FAT* ’19, 279–288. https://doi.org/10.1145/3287560.3287574
  • Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R. & Yu, B. (2019). Interpretable machine learning: Definitions, methods, and applications. Proceedings of the National Academy of Sciences, 116(44), 22071–22080. https://doi.org/10.1073/pnas.1900654116
  • Pearl, J. (2019). The limitations of opaque learning machines. In J. Brockman (Ed.), Possible minds: Twenty-five ways of looking at ai (pp. 13–19). Penguin Books.
  • Pearl, J. & Mackenzie, D. (2018). The book of why: The new science of cause and effect (1st). Basic Books, Inc.
  • Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. http://arxiv.org/abs/1811.10154
  • Stevens, H. (2017). A feeling for the algorithm: Working knowledge and big data in biology. Osiris, 32(1), 151–174. https://doi.org/10.1086/693516
  • Sullivan, E. (2019). Understanding from machine learning models. The British Journal for the Philosophy of Science. https://doi.org/10.1093/bjps/axz035
  • Vellido, A., Martın-Guerrero, J. D. & Lisboa, P. J. G. (2012). Making machine learning models interpretable. Computational Intelligence, 10.
  • Zednik, C. (2019). Solving the black box problem: A normative framework for explainable artificial intelligence. Philosophy & Technology. https://doi.org/10.1007/s13347-019-00382-7

  1. Throughout this paper I will use the terms model and algorithm synonymously. ↩︎

  2. For the most part, I present examples of computer vision, as they lend themselves to visualising the problem. But in general this discussion is applicable to all decision-making algorithms. In addition, I am only concerned with machine learning systems, which means that in the ‘hard sciences’ or other areas where models are used, the following statements may not apply. ↩︎

  3. A complete yet complex explanation might be correct, but cannot be interpreted. While an interpretation carries the risk of oversimplification and thus incompleteness of the explanation (Gilpin et al., 2019, p. 2). ↩︎

  4. Heatmaps or saliency maps are used to highlight areas of an image that were relevant to the algorithm’s decision; For further clarification see Zednik (2019, p. 11). ↩︎