The Co-Constitution of Voice Assistants and Language

Our material environment is increasingly constituted by algorithms in the form of smart technologies, consisting of various sensors, such as facial recognition. But we are also making our homes more “smart” by connecting different devices to the internet. For example, smart speakers like Amazon Echo. Since the environment and the objects around us affect us human beings, we should wonder how that is the case with these increasingly sophisticated and autonomous systems.

Aydin et al. (2019) provide a theoretical framework that describes our “immersive environments”, which actively influence our behaviour. However, not only does technology make the human, but we humans make the technology in the first place. Therefore, I want to examine how conversational agents, such as Siri or Alexa, have been designed and what the consequences are. Overall, the unification of human characteristics reduces human diversity, which we should counteract.

Technological Environmentality

As Grosz (1998) shows in her paper, the body and the city have a dialectical relationship and thus co-constitute each other. This means, she rejects the notion that the body is distinct from the city. Instead, “the body [. . . ] does not have an existence separate from the city” (Grosz, 1998, p. 47). In doing so, she emphasises the importance that the environment has in shaping the human being. Moreover, she predicted with shockingly accuracy the impact that technological development will have on cities, and thus on bodies. She mentions how technology will make distance obsolete and how speed will change the way we communicate, which in turn will change the way we live (Grosz, 1998, p. 49). Today we experience these effects through the ubiquity of computer technology. Unlike more mundane technologies that may have worried Grosz, such as stairs that can limit accessibility or windows that can compromise privacy, today’s “high technologies” like self-learning algorithms take a much more active role in shaping our lives.

To theorize the effects of our material environment that takes into account today’s advancements, Aydin et al. (2019) provide a framework called “Technological Environmentality”. In doing so, the authors build on Postphenomenology (Ihde, 1990) and Material Engagement Theory (MET) (Malafouris, 2013). The first describes how technology mediates our relation to the world, and the second helps us understand how materiality plays a role in being human. Moreover, MET overcomes the distinction between organism and environment, and instead describes how “organism and environment form a necessary unity” (Malafouris, 2020, p. 4). This is similar to Grosz’s view, namely that the body not only makes the city, but the city also makes the body. As Aydin et al. (2019, p. 332) claim, however, both concepts taken by their own are not sufficient to analyse emerging technology such as today’s autonomous algorithmic systems. This is because, in short, technology in postphenomenology is treated as rather passive and unintentional, and in MET as too low-tech, which means that the interactivity between artifact and human is not sufficiently addressed (Aydin et al., 2019, pp. 326, 331). But taken together these concepts of agency, primarily in MET, and intentionality, primarily in Postphenomenology, can provide helpful insights. As a result, Technological Environmentality helps us understand that we live in a technological world with which we interact, and that at the same time technology mediates our relation to the world and thus how we perceive ourselves (Aydin et al., 2019, p. 336). In this sense, technology has a double dimension.

However, Aydin et al. perhaps attribute too much agency and intentionality to smart technology, which is not meant as some technological determinism.

It is an illusion to think that we can control the ecology of invisible and active technologies, just as it is naïve to believe that we can control our natural ecology. (Aydin et al., 2019, p. 332)

Certainly, algorithms actively shape the human being, but we should keep in mind that people, like scientists, designers, engineers, create these systems in the first place. Thus our decisions actively shape the technology. Although, of course, our way of thinking does not exist separate from our environment, which means that the way we think about conversational agents is already influenced by the technology. In the following, in contrast to Grosz and although I do not advocate mind-body dualism, I would like to focus on the more mental and cognitive effects of our technological environment by taking a look at voice assistance systems.

Voice assistants

Recent developments in natural language processing (NLP) and text-to-speech engines enable a new interaction with our technical devices in that it allows the user to talk to or have a conversation with their device. In the latter point, this technology differs, for example, from a navigation system, which also uses language to convey information. The instructions, however, only go in one direction, namely from the system to the user. So-called voice assistants or conversational agents have been on our smartphones for a few years now. Apple’s Siri, for example, was introduced in 2011. But other devices, such as the Amazon Echo smart speaker released in 2014, which uses the Amazon Alexa voice service, are expanding the use cases and thus the prevalence of these voice assistants. Interestingly, at least for now, these devices are intended for use in a more private environment, namely at home (Turk, 2016, p. 16). In that regard, the matter becomes much more sensitive and personal compared to Grosz’s discussion of the city. Nevertheless, voice assistants nicely illustrate how software and algorithms are increasingly embedded in our material environment while remaining “invisible” in our background (Aydin et al., 2019, p. 325).

Anthropomorphism and Personification

Although machines were not yet so sophisticated at the time of Grosz’s writing, she already considered

the “cross-breeding” of the body and machine – that is, whether the machine will take on the characteristics attributed to the human body or whether the body will take on the characteristics of the machine remains unclear (Grosz, 1998, p. 50).

The final outcome of this “cross-breeding” may still remain unclear, but the direction in which it is currently heading seems clearer today. Designers and engineers disguise the nature of voice assistance systems by building human characteristics into them, also called anthropomorphism (Purington et al., 2017, p. 2854). Google Duplex1, for example, adds so-called speech disfluencies like ’umm‘ or ’aah‘, which would not be necessary from a technical point of view. In addition, in order to mimic human behaviour, Duplex does not always respond immediately (Leviathan & Matias, 2018). In the end, the goal is for the user to be able to talk to the voice assistant more or less naturally, as if it were a “friend” (Turk, 2016, p. 16). The developers of Google Duplex also justify their design decisions by saying that they want to “make the conversation experience comfortable” (Leviathan & Matias, 2018).

But maybe we are even getting too comfortable with our devices. The more we, the users, personify the device, for example by giving it a name, the more we tend to interact with it in a sociable, respectful and even emotional way (Purington et al., 2017; Turk, 2016). This may not be bad per se. But since we adapt our language to our surroundings and our peer group (Cekaite & Björk-Willén, 2013), the question arises to what extent machines influence our language (and our behaviour).

Studies indeed suggest that machines can persuade and influence people’s choice of words (Brandstetter & Bartneck, 2017, p. 284). This affects people who have fewer other communication partners more than people who are highly networked (Brandstetter & Bartneck, 2017, p. 286). As such, conversational agents can be considered as social actors, which elicit reciprocal behaviour (Schneider, 2020, p. 381). But to who do we have this reciprocal behaviour? What kind of person is or does Alexa or Siri represent?

Who is Alexa or Siri?

Submissive

The first feature one notices when interacting with one of these conversational agents is that most of them have a predefined female voice (Habler et al., 2019). By giving the device a gender, we associate a certain role that it is supposed to fulfil. As we use these devices to delegate rather mundane and administrative tasks, such as setting a reminder, playing music or looking up information, we might wonder to what extent the choice of a female voice reinforces gender stereotypes (Strengers, 2018, p. 75).

And moreover, how we adopt this task-oriented conversation in interactions with other people. For example, one study found that children tend to use an aggressive tone when talking to Alexa, which they then adopt in conversations with friends (Garg & Sengupta, 2020, p. 17). This is important to consider because conversational agents use rather submissive language that gives the impression of being a caretaker, just as a “typical” wife (Woods, 2018, p. 339). It is difficult to say what caused what, but users prefer voice assistants that use a “low-status”, i.e. simple language, which implies submissiveness (Habler et al., 2019, p. 470).

It also happens that women tend to communicate more passively compared to men, using, for example, more personal pronouns or verbs (Hannon, 2016, pp. 34–35). This bears the danger of “implicitly connect[ing] female AI personalities with low-status positioning in the human-machine relationship” (Hannon, 2016, p. 35). Interestingly, the sound of a female voice itself, however, has no effect on the user’s satisfaction or effectiveness when interacting with Siri or Alexa (Habler et al., 2019, p. 472). This would suggest that a genderless voice, which already exists2 is not necessary, but instead a genderless speech pattern is required. It is therefore important to scrutinize how these systems are designed in order to understand which differences are reinforced by the systems.

White

Crucial to the design of any machine learning system is the training data. The training data determines what the voice assistant can understand and thus answer. Research suggests that each voice assistance coming from Apple, Amazon, Google, IBM and Microsoft have difficulties understanding black people, especially black men. For this social group the average error rate is 41% which means that out of 100 words, 41 are not understood correctly by the system, whereas the error rate for white males is at 21%. For black women the error is 30% and for white women 17% (Koenecke et al., 2020, p. 7685). Apparently, men tend to use more informal language and less clear pronunciation, which is why conversational agents are better at understanding women (Koenecke et al., 2020, p. 7685). But overall these systems perform better on white people. In part, but not primarily, the reason for this gap is because black people use different vocabulary from what is known to the system. Rather, it is the lack of training with sufficiently diverse audio data to further develop the acoustic models. Conversely, this means that voice assistants have been trained to understand a particular pronunciation and accent.

Our findings indicate that the racial disparities we see arise primarily from a performance gap in the acoustic models, suggesting that the systems are confused by the phonological, phonetic, or prosodic characteristics of African American Vernacular English rather than the grammatical or lexical characteristics. (Koenecke et al., 2020, p. 7687)

Educated

In addition to the acoustic model, it is also important which languages the voice assistant supports. African languages, for example, make up one third of all living languages, 2144 out of 7111, and yet only a negligible number of them are represented digitally (Orife et al., 2020, p. 1). This deficiency manifests itself in the fact that no conversational agent speaks an African language. Currently, Amazon Alexa supports 8 languages, namely English, French, German, Hindi, Italian, Japanese, Portuguese (Brazilian) and Spanish (Amazon, 2021). Most of the languages, however, were added gradually, such as Spanish or Hindi at the end of 2019. Consequently, the amount of people who can use voice assistants are limited by language barriers. This may be far-fetched, but it also means that the group of people who can use these systems are more or less a homogeneous, privileged group. A person living in the Netherlands could theoretically afford an Amazon Echo, but not use it if s/he does not speak any of the supported languages. In this sense, not only financial but also educational criteria become a barrier. This exacerbates existing social inequalities and perpetuates a certain educational ideal (Phan, 2019, p. 22).

Unification of diversity

Based on the above, we can say that Siri or Alexa is a well-educated, white and submissive representation of a (female) person. This aligns with how intelligent systems are generally portrayed (Cave & Dihal, 2020). However, this limitation must be overcome considering the different groups of people who use (or want to use) this technology.

I do not want to claim that voice assistants contribute to the loss of languages. But it might be more true for accents, pronunciation and regional languages. To what extent this is the case would require further research. Nevertheless, language is considered as part of one’s identity and culture. So, it is not simply the language (or accent) that is lost, but in fact much more (Wallace, 2009). The importance of language was also emphasised by Ludwig Wittgenstein when he said “the limits of my language mean the limits of my world”. In that sense, language also influences the way we think and make sense of the world.

Different languages, for example, have varying numbers of describing colors which affects color cognition (Kay & Regier, 2006). As a result, if we do not have a word to describe something, we are less likely to think about it, but that does not mean that is impossible per se. Therefore, the more these systems are unified and characterized as one person belonging to a particular social group, the more this reduces human diversity.

Digital conversation

As voice assistants are a fairly new technology, further research, including long-term studies, is necessary to understand how machines and humans co-constitute the development of speech and language and thus our behaviour. For comparison, only now are we beginning to understand the impact of digital communication via social media. As Turkle (2015, p. 13) puts it:

Without conversation, studies show that we are less empathetic, less connected, less creative and fulfilled. We are diminished, in retreat. But to generations that grew up using their phones to text and messages, these studies may be describing losses they don’t feel. They didn’t grow up with a lot of face-to-face talk.

Whether the exact same can be said about conversational agents is not yet sure. Although the current direction of conversational agents that I have outlined sounds very alarming, it should not be understood in the sense that we should reject or ban the technology. Instead we should ask ourselves, how we want to build our technological environment in which we live in. This also means that we need to scrutinize certain values and characteristics that are built into the system and anticipate their consequences to be able to correct and design voice assistants like Siri and Alexa in a more meaningful way.

But at the same time we should ask ourselves in which situations we want to talk to a machine in the first place. Especially with regard to young children the possible consequences have to be taken into account, as they often do not understand that they are not talking to another human being but to a non-human agent, which could “appear in real life” (Garg & Sengupta, 2020, p. 21). Furthermore, we should remember that voice assistants do not understand the meaning of what they or we say. They only mimic the ability to hold a conversation, which means that a machine cannot be empathetic and therefore cannot relate to our subjective experience. In the end, it seems as “we want more from technology and less from each other” (Turkle, 2015, p. 346).

Responsibility of users

Voice assistants clearly constitute our material environment and, as autonomous agents, take a much more active role in shaping us. But since these systems are able to learn as we interact with them, we might ask what responsibility each user bears. As the aforementioned research found, women are less often misunderstood by the systems. From this we can infer or rather speculate that women conversely provide more useful training data. Which leads the system picking up on women’s speech patterns and becoming “feminine”. Perhaps this is too far-fetched, but nevertheless we should consider how each of us interacts with the system. We should certainly not expect changes overnight, as this is a long process. And in the end, the engineers have the main responsibility or influence on the system’s behaviour. But we can and should try to make the workings of these systems more visible (Nicenboim et al., 2020, p. 398), instead of simply letting them “becoming a constant, invisible, and always-on background” (Aydin et al., 2019, p. 325).

Conclusion

Conversational agents offer a new interaction with our devices, namely through speech in a bidirectional way. In that regard, these systems can be considered as social agents which can influence the way we talk and thus think. This is similar to how Grosz identifies the co-constitutive nature of bodies and cities. But, today’s technologies play a much more active role in shaping us, as they are “being directed at human beings” (Aydin et al., 2019, p. 334). The directedness is even becoming so dominant and ubiquitous as more and more algorithms are embedded in our material environment, creating “interactive environments” (Aydin et al., 2019, p. 336). This is precisely why we need to evaluate the decisions that led to the design and operation of the system. Certainly, the way we think about conversational agents is also shaped by them.

In the case of voice assistants, we can see that deliberate selection of training data resulted in a white, well-educated and submissive, i.e. female, representation of a person. Given the increasingly widespread use of these systems, this is alarming for several reasons. On the one hand, people are excluded by language barriers, which reinforces social inequalities and educational ideals. On the other hand, we might lose our accents or perhaps even language, i.e. part of our identity, through reciprocal behaviour. Furthermore, the current mode of operation reinforces gender stereotypes by having Siri or Alexa take over household tasks. Overall, the personification of some universal ideals diminishes human diversity. Therefore, we should also ask ourselves in which situations we want to talk to an emotionless machine at all instead of a human being.

References

Amazon. (2021). Develop skills in multiple languages. Retrieved January 25, 2021, from https://developer.amazon.com/en-US/docs/alexa/custom-skills/develop-skills-in-multiple-languages.html
Aydin, C., Woge, M. G. & Verbeek, P.-P. (2019). Technological environmentality: Conceptualizing technology as a mediating milieu. Philosophy & Technology, 32, 321–338.
Brandstetter, J. & Bartneck, C. (2017). Robots will dominate the use of our language. Adaptive Behaviour, 25(6), 275–288. https://doi.org/10.1177/1059712317731606
Cave, S. & Dihal, K. (2020). The whiteness of ai. Philosophy & Technology, 33, 685–703. https://doi.org/10.1007/s13347-020-00415-6
Cekaite, A. & Björk-Willén, P. (2013). Peer group interactions in multilingual educational settings: Co-constructing social order and norms for language use. International Journal of Bilingualism, 17(2), 174–188. https://doi.org/10.1177/1367006912441417
Cooperrider, K. & Núñez, R. (2016). How we make sense of time. Scientific American Mind, 27(6), 38–43. https://doi.org/10.1038/scientificamericanmind1116-38
Garg, R. & Sengupta, S. (2020). He is just like me: A study of the long-term use of smart speakers by parents and children. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1). https://doi.org/10.1145/3381002
Grosz, E. (1998). Bodies-cities. In H. J. Nast & S. Pile (Eds.), Places through the body (pp. 42–51). Routledge.
Habler, F., Schwind, V. & Henze, N. (2019). Effects of smart virtual assistants’ gender and language. Proceedings of Mensch Und Computer 2019, 469–473. https://doi.org/10.1145/3340764.3344441
Hannon, C. (2016). Gender and status in voice user interfaces. Interactions, 23 (3).
Ihde, D. (1990). Technology and the lifeworld: From garden to earth. Indiana University Press.
Johnson, M. (2008). What makes a body? Journal of Speculative Philosophy, 22(3), 159–169.
Kay, P. & Regier, T. (2006). Language, thought and color: Recent developments. Trends in Cognitive Sciences, 10(2), 51–54. https://doi.org/10.1016/j.tics.2005.12.007
Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J. R., Jurafsky, D. & Goel, S. (2020). Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences, 117(14), 7684–7689. https://doi.org/10.1073/pnas.1915768117
Leviathan, Y. & Matias, Y. (2018, May 8). Google duplex: An ai system for accomplishing real-world tasks over the phone. Retrieved January 8, 2021, from https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural- conversation.html
Malafouris, L. (2013). How things shape the mind: A theory of material engagement. MIT Press.
Malafouris, L. (2020). How does thinking relate to tool making? Adaptive Behavior, 1–15. https://doi.org/10.1177/1059712320950539
Nicenboim, I., Giaccardi, E., Søndergaard, M. L. J., Reddy, A. V., Strengers, Y., Pierce, J. & Redström, J. (2020). More-than-human design and ai: In conversation with agents. Companion Publication of the 2020 ACM Designing Interactive Systems Conference, 397–400. https://doi.org/10.1145/3393914.3395912
Orife, I., Kreutzer, J., Sibanda, B., Whitenack, D., Siminyu, K., Martinus, L., Ali, J. T., Abbott, J., Marivate, V., Kabongo, S., Meressa, M., Murhabazi, E., Ahia, O., van Biljon, E., Ramkilowan, A., Akinfaderin, A., Öktem, A., Akin, W., Kioko, G., . . . Bashir, A. (2020). Masakhane – machine translation for africa. https://arxiv.org/abs/2003.11529
Phan, T. N. (2019). Amazon echo and the aesthetics of whiteness. Catalyst: Feminism, Theory, Technoscience, 5(1), 1–39.
Purington, A., Taft, J. G., Sannon, S., Bazarova, N. N. & Taylor, S. H. (2017). ‘‘Alexa is my new bff”: Social roles, user satisfaction, and personification of the amazon echo. Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, 2853–2859. https://doi.org/10.1145/3027063.3053246
Schneider, F. (2020). How users reciprocate to alexa. In C. Stephanidis, M. Antona & S. Ntoa (Eds.), Hci international 2020 – late breaking posters (pp. 376–383). Springer. https://doi.org/10.1007/978-3-030-60700-5_48
Strengers, Y. (2018). Aesthetic pleasures and gendered tech-work in the 21st-century smart home. Media International Australia, 166 (1), 70–80. https://doi.org/10.1177/1329878x17737661
Turk, V. (2016). Home invasion. NewScientist, 16–17.
Turkle, S. (2015). Reclaiming conversation: The power of talk in the digital age. Penguin Random House.
Wallace, L. (2009, November 10). What’s lost when a language dies. Retrieved January 25, 2021, from https://www.theatlantic.com/national/archive/2009/11/whats-lost-when-a-language-dies/29886/
Woods, H. S. (2018). Asking more of siri and alexa: Feminine persona in service of surveillance capitalism. Critical Studies in Media Communication, 35 (4), 334–349. https://doi.org/10.1080/15295036.2018.1488082