The Causal Uncertainty of Modelling with Big Data

Big Data or data-intensive science is often perceived as leading to more knowledge by processing more data (boyd & Crawford, 2012, p. 663). However, this is questionable. We face causal uncertainty in big data modelling, because it is becoming increasingly difficult to represent complex phenomena adequately. Ironically, because of the increase in data. We always face uncertainty and I do not claim that we have to overcome uncertainty. But rather we should be aware of the pitfalls we can make while constructing models based on big data in order to become “more certain”.

In contrast, Pietsch (2016) argues in his paper for the causal nature of modelling in big data. As he puts it: “data-intensive science essentially aims at identifying causal structure” (p. 138). For him this is because we apply so-called “eliminative induction”. But the more data we have, the more possibilities we have to eliminate. At the same time, however, by eliminating too much information, we run the risk of oversimplifying or reducing the phenomenon, which is not helpful because the assumed causal relationships could just as well be just one correlation among others. Therefore, the challenge in Big Data modelling is to find the right balance between simplicity and complexity to adequately represent the phenomenon.

Data-intensive science and Big Data

Data-intensive science is often referred to as “the death of theory” (Anderson, 2008). Although Anderson made this claim to provoke (Frické, 2015, p. 652), we see in various fields how scientists no longer create hypotheses derived from theory and then test them, but derive causal relationships or simply correlations directly form data that represents the phenomenon (Kitchin, 2014, p. 4). Thus, it is claimed that theories are “born from the data” (Kitchin, 2014, p. 2). Consequently, the way science operates and knowledge is being produced enters a “fourth paradigm” that is more exploratory (Hey et al., 2009, p. xxx).

As the name ‘data-intensive science’ suggests, data is required, which is often referred to as Big Data. Big Data describes data in terms of variety, velocity and volume, in short the 3 V’s (Kitchin, 2014, p. 1). But Big Data also refers to computational methods and the use of software (Symons & Alvarado, 2016, p. 2). This is where the definition of data-intensive science overlaps, which rather describes the scientific process consisting of data collection, data storage and data analysis. (Pietsch, 2016, p. 140). There might be conceptual differences between the two terms ‘data-intensive science’ and ‘Big Data’. But since they are closely intertwined, I will not make a distinction between them. Rather, I will focus on the interplay of data collection and data analysis, which is based on the variety, velocity and volume of the data.

Data Analysis with Machine Learning

In order to be able to analyze large amounts of data, which can easily exceed 100GB, software is necessary (Symons & Alvarado, 2016, p. 3). Specifically machine learning algorithms that rely on statistical approaches are helpful in detecting patterns within the data set. For these algorithms to function, that is approximating a mathematical function that adequately maps inputs to outputs, features are needed. These features are contained within the collected data which represents the phenomenon (or hypothesis) under investigation (Pietsch, 2016, p. 140). Thus, the algorithms, or

Data-intensive models [...] are quite complex because little of the original data is actually discarded. (Pietsch, 2016, p. 139)

Certainly, these algorithms are complex for various reasons, but not for the reason Pietsch provides. Data themselves are “algorithmically random strings of digits” (McAllister, 2003, p. 639) and due to computational constraints on time or economic resources, it is not possible or practical to consider all “original data” or all features within the data. Besides, the more features are considered, the more data is needed to sufficiently differentiate each of these features. This is also known as the curse of dimensionality (Bishop, 2006, p. 35). Consequently, we only want features that are relevant which conversely means that some or much information within the data has to be discarded. How much information is discarded can vary from case to case and depends on whether a scientist selects the features “by hand”, or whether the algorithms themselves extract relevant information.[1] Nevertheless, this task requires tacit or background knowledge, or even just assumptions, about the interaction of collected data and the phenomenon under investigation (Frické, 2015, p. 654).

Furthermore, there is no single mathematical function or algorithm, that perfectly maps input to output. This is because these algorithms are trained on historical data from which predictions are derived, so all not-yet-seen data could be an outlier. Therefore, the overall aim is to find a function that is good at generalizing. So, on the one hand, we do not want to learn the training data one-to-one, because then the algorithm (or model) would no longer represent the phenomenon but the data (i.e. overfitting), which means that predictions would not be possible or satisfactory. On the other hand, we need a sufficient amount of features, otherwise there is a risk that the phenomenon will not be adequately represented (i.e. underfitting). So knowledge is not only needed for the selection of relevant features, but also for the selection of the algorithm or which parameters or weights of the algorithm have to be adjusted (Frické, 2015, pp. 654–5). As we can see, theory, when understood as tacit knowledge and assumptions, is not (yet?) “dead”.

Inductive Risk

As we can see, data analysis or rather the modelling with machine learning involves value judgements made by the scientist. (Douglas, 2000) calls this inductive risk. Scientists could increase the weights of certain features, because they think these features are more relevant than others, or they could even consider entire data sets as either useful or not. In doing so, the threshold for obtaining either a false positive of false negative, i.e. result, is also changed. That raises questions to the explanatory power of the model and how adequately it represents the phenomenon. The latter point is highly dependent on which data is collected.

Data Collection

Velocity

Various sensors or instruments allow to collect data in or near real-time (Kitchin, 2014, p. 1). Yet, machine learning algorithms rely on historic data, which generally says nothing about the future. Predictions are without doubt possible, but again these predictions rely on past behaviour and only represent a likelihood that a certain event will occur. This may be satisfactory for certain fields, e.g. the medical field, where the causal relationship between symptoms and disease does not change drastically. But in social sciences, human beings are not that “static”, but rather evolve over time. In that regard, it is questionable to what extent the (outdated) data is useful in the first place. So, instead of hoping that the data alone will provide novel insights, we humans have to find new ideas on how to combine “old” data. But that bears some dangers, such as reinforcing biases. Besides challenges also exist, such as standardizing data from different sources (Leonelli, 2014, p. 4).

Variety

For example, during medical examination, doctors “construct a coherent ‘picture’ of a patient from heterogeneous pieces of information” (van Baalen & Boon, 2014,p. 438), which in itself is a difficult task. But now technologies such as fitness trackers like FitBit or the AppleWatch, each of which track different things, add more information. As a result, data comes from a ”broad range of heterogeneous sources“ (Boumans & Leonelli, 2020, p. 80). Not only do the sources expand, but also the stakeholders, i.e. now the patient has the possibility to. This is certainly promising[2], but is about adding more dimensions, that can not all be computed. We should also bear in mind that data collection is limited in what the technology can do. In this regard, choosing an instrument for data collection automatically implies a decision on which data to collect and which not to collect. This is also known as sampling bias (boyd & Crawford, 2012, p. 669). This may sound trivial, but Twitter users, for example, do not adequately represent all people. In addition, political or economic factors, which are often out of the scientist’s control, also influence the data collection process (Leonelli, 2014, p. 7).

Nevertheless, these data points itself contain so-called metadata, that is data about the data[3]. A photograph taken by a smartphone, for example, contains information about time, location, resolution and more. From the information about time one could derive frequency patterns, or from the location it could be possible to infer demographic facts about the person using the smartphone.[4] So, on the one hand, metadata provides useful information to contextualize data (Boumans & Leonelli, 2020, p. 93), but on the other hand it again adds more dimension to consider. As a result, we are faced with the problem that

Every one of those sources is error-prone, and [. . . ] I think we are just magnifying that problem [when we combine multiple data sets] (Bollier, 2010, p. 13).

Volume

As we can see, today’s technologies and sensors allow us to collect various data that were not available to us before, which result in huge amounts of data. In that sense it seems as if we are able to acquire the ‘Big Picture’ and are able to understand certain phenomena better. So, one might claim that more data provides a clearer and better understanding and thus leads to new insights (boyd & Crawford, 2012, p. 663). This certainly is true in some cases. But, as should be clear by now, the increasing amount of data also introduces a certain complexity in finding relevant features that account for causal relations. In other words, the problem is not finding correlations across data sources or within a data set, but that we may find too many correlations (Canali, 2016, p. 5). Consequently, scientists have to be careful with the choices they make.

Data Preparation

Data collection is a complex task due to the almost “infinite” possibilities. Even if we collect and store all possible data, we still have to prepare it for analysis because data itself has no meaning. Data preparation in this context is understood as modelling, i.e. as representing the phenomenon, which rests heavily on data interpretation.[5] As boyd and Crawford (2012, p. 667) put it: “All researchers are interpreters of data”.

Reductionism

Although scientists face an “endless” amount of choices, Big Data or data-intensive science is very useful in areas that are very confined and theoretically grounded. In these areas algorithms often perform better than humans, especially in terms of speed (Boon, 2020, p. 6).[6] In the medical domain, for example, studies suggest that algorithms for detecting skin cancer based on medical images are more effective (Bjerring & Busch, 2020, p. 20). However, when dealing with more complex phenomena, as in the social sciences, data-intensive science could have harmful outcomes, like a biased facial recognition system. Pietsch discusses the phenomenon of political micro targeting:

with respect to the microtargeting [. . . ] it is not necessary to have data on all voters, but rather data suffices on a smaller number of individuals that are representative of the possible variations that can occur in the population. (Pietsch, 2016, p. 141)

These “individuals that are representatives”, however, often do not represent the entire population as desired, but instead only a very limited social group. As a result, the complex lives of humans is often reduced to some universal assumptions (Cukier & Mayer-Schoenberger, 2013). Similarly, Boumans and Leonelli (2020, p. 96) stress the importance of defining the “richness, variability and multiplicity of features” in plant phenomics. Otherwise, without context-dependent information the data can not be situated and thus loses its usefulness outside of the context it was collected. As she puts it, we should

disaggregate the notion of Big Data science as a homogeneous whole and instead pay attention to its specific manifestations across different contexts. (Leonelli, 2014, p. 8)

It seems that we are facing a contradiction. On the one hand we want to reduce complexity in order to make sure that we are only examining the relevant, or difference-making, features. And on the other hand, we want to adequatly represent the complexity of the phenomenon in order to avoid reducing the phenomenon, e.g. human behaviour, to some universal assumptions. Certainly, we always have to compress and reduce information. A map, for example, with a 1:1 scale would be useless. However, a map still represents the phenomenon, i.e. the territory, in a useful manner.[7]

Multiple Comparison

The danger of reducing or simplifying complex phenomena into its subsystems, however, is that by themselves they might not reveal novel insights, but instead lead to false positives, which means that only positive results are published (Frické, 2015, pp. 659–60). Health tracking apps, for example, have become very popular. Health tracking apps, for example, take a very individualistic perspective, by tracking the user’s behaviour. They nevertheless produce correct or “positive” results. But the contextual circumstances, such as broadly speaking, the political and economical system are thereby not taken into account. This, however, suggests that mental health starts and ends with oneself, which calls into question the findings by these apps.

In contrast, we can fall prey of seeing correlations between things where there are none, which is called apophenia (Frické, 2015, p. 660). For example, the increased sale of ice cream might coincide with the increased number of burglars. But that does not mean that these two events are causally connected. This example might be straightforward and easy to understand. But in the case of Big Data, thousands of features are examined each of them having a different degree of relevance.

In combination with the acclaimed “end of theory” or “theories born from data”, apophenia and over-simplification seem dangerous. At first sight, the data may reveal new insights, but there is no theory that confirms these assumptions. This lack of linkage of the claimed correlations with a theoretical underpinning is what Sullivan (2019) calls “link uncertainty”. She gives the example of how scientists correlate facial features and sexual orientation. Similarly, a recent paper claims to have found correlations between facial features and political orientation (Kosinski, 2021). I do not want to judge whether these hypotheses are right or wrong, but they sound rather dubious. Leinweber (2007, p. 10) also warns us of the danger of apophenia as he says: “When doing this kind of analysis it’s important to be very careful what you ask for, because you’ll get it”. In this case, the use of big data perhaps even reminds us of Feyerabend’s anarchic perspective “anything goes”. To counteract this, we need more theories and less data (Frické, 2015, p. 660).

Causality

Although Pietsch argues for the causal nature of big data modelling, he also admits that “Data-intensive models lack substantial explanatory power” (Pietsch, 2016, p. 139). In so far it is not clear how the causal nature is ensured. Causal relationship are in so far important, as they allow us to intervene and alter certain effects. For example, in medicine the correct treatment can decide about life and death (Bjerring & Busch, 2020). Besides, the “knowledge” we produce, which perhaps rests on false correlations, is often used for further knowledge production, which also includes the development of technology which in turn is again used to collect data and so on. This could lead to a vicious cycle which is fuelled by “positive heuristics” as data alone can refute anomalies (Lakatos, 1968). Pietsch (2016, p. 169) himself also mentions the possible abuse especially in social sciences.

So, perhaps in that sense we do not need to understand the “why” in terms of physical mechanisms, but rather why this knowledge would be useful. So, why would we want to identify people’s political orientation based on facial features. As such, the “why” is not about the “truth”, but rather about certain values. Also in regard to the advancements in bio-technology which allow us to create more and more artificial phenomena, the notion of causality is perhaps shifting: Namely, from an explanatory sense, “Why did something happen?”, to a more producing sense, “Why do we want to make something happen?”. The produced effect that we “know” of, i.e. micro targeting, is used to cause another effect. But even to produce a certain effect, we need some minimal understanding of the phenomenon, i.e. effect, that causes another phenomenon. We cannot stop burglars by stopping the sale ice cream. So the challenge with Big Data is to find the meaningful correlations, which seems to become an increasingly difficult task as the inductive risk increases.

Conclusion

Data-intensive science rests heavily on the collection, preparation and analysis of data. Within all of these stages the choices that the scientists make play a crucial role. This is because data itself has no meaning. Rather it has to be assembled and interpreted by the scientist, which faces inductive risk. As we can not compute every data, due to computational limitations, we have to simplify and reduce the amount of information. Thereby risking the threat of discarding relevant information. Therefore, the challenge in Big Data modelling, is finding the right balance of simplicity and complexity in adequately representing the phenomenon. This applies especially to sciences that deal with complex and evolving phenomena, such as in social sciences. Here contextual information is relevant which can not be simply transferred to other contexts by inducing universal assumptions. Thus, the causal relationships claimed to have been found must be taken with a grain of salt. In the end, it seems as if Big Data is shifting the epistemological concept of causality. Perhaps we do not exactly need to know why something happened, but instead we should ask why we want to produce certain knowledge.


  1. The latter approach, as is the case with most deep learning algorithms, brings with it various problems of transparency and explainability that will not be discussed here. ↩︎

  2. See Pierson et al. (2021) for an AI that takes patients’ self-assessment into account. ↩︎

  3. Boumans and Leonelli (2020, pp. 80–90) describe different dimensions of metadata, such as general information about the project or why certain data is considered. ↩︎

  4. These practices clearly raise many privacy concerns. ↩︎

  5. The phenomenon is already being represented to some extent through the collection of data. ↩︎

  6. Better in the sense of finding patterns. Whether these patterns are meaningful or not has to be judged by the scientists. ↩︎

  7. See https://perell.com/essay/expression-is-compression/ ↩︎

References

Show Comments