Human-Computer Interaction

The Influence of Embodied Artificial Intelligence Presence on Phishing E-mail Recognition Performance in a Work Environment


This project is already assigned.

Motivation and Potential

Recent technological advances in artificial intelligence (AI) and augmented reality (AR) enable the development of embodied AI agents that can be integrated in real environments such as the workplace and promise intuitive, human-like interaction. The current study aims to find out whether agents using these current technologies may also introduce a side-effect that can be found in human social contexts called social facilitation. Social facilitation describes improved performance of simple or already learned tasks when an audience is present, whereas for hard or yet to learn tasks the opposite effect applies [1].

One possible application of AI-based systems is the workplace, where it can provide support in administrative, service-related and various other work environments. Research in the work environment can help to better adjust these systems to the needs of users in this environment and mitigate risks in later steps of the development [2]. A task prevalent in many workplaces is workers checking their email inboxes. From a German perspective, employees on average receive 26 emails per day [3]. In 2023, cyber-attacks were estimated to cause 148 billion € in damages, with 31% of those attacks being related to phishing. As such, the task of recognizing malicious emails is relevant for many businesses and email usage is widespread. Another argument for this task is a realistic possibility of actually co-operating with an AI agent during everyday processing of received emails.

Combining this motivation for embodied AI agents with effects from human social contexts and using email recognition as a representative work-related task proposes the following research question: How does the presence of an embodied AI agent displayed by an AR headset influence human detection of phishing emails in the work context?

Agents

Socially Interactive Agents (SIAs) are a means to provide interfaces similar to human-human interaction that can be “more natural to human interactants by equipping the interface with a body that interacts multi-modally […]” [4]. Besides physical human-like agents, such as robots, SIAs can also be virtual by utilizing technologies such as AR, which enhances the real world with virtual objects [5]. In order for SIAs to offer such human-like ways of interaction, disciplines like natural language understanding (NLU) need to be integrated [6]. This is one example in which SIAs may facilitate AI, as recent developments such as neural networks based on the Transformer architecture [7] like OpenAI’s ChatGPT 4 [8] can offer NLU that enables interactions more akin to human speech.

Humans may apply stereotypes from human-human interaction to human-computer interaction, such as gender stereotypes when the computer uses respective male or female voices and treat them as “social actors” [9]. Even without human-like qualities or even humanoid appearances, users may apply certain social behaviours such as politeness [10] during their interaction with the system. Social facilitation and inhibition effects have also been found using embodied voice assistants. In Liu and Pu [11], participants’ response times during easy math tasks increased in the presence of a smart speaker, while they decreased for harder tasks. This influence on response times was similar to a condition where a human assistant was present.

Social Facilitation and Agents

Existing studies found social facilitation or social inhibition effects using virtual agents. Zanbaka et al. [12] had participants solve mathematical tasks in the presence of a virtual agent. Depending on the condition, the agent was either displayed using a head-mounted device (HMD) or projected onto a screen. It is unclear if and how the agent was framed. The agent’s behavior however was deliberately human, including audible noises like yawning and coughing that participants commented on post-experiment. At least one participant interpreted this coughing as a sign of impatience, which might point towards an unwanted source of perceived evaluation.

A study by Park and Catrambone [13] used anagram, maze and math-related tasks with a virtual agent being displayed on a computer monitor. The agent was framed as being AI-operated and “present to learn more about the tasks”. The study found significant differences between alone and virtual presence conditions for all tasks that suggest social facilitation and inhibition effects by virtual agents framed as AI.

Social Facilitation in AR

More recently, Miller et al. [14] measured task performance “in the presence of embodied agents” using a Microsoft HoloLens and an anagram task. The study features a 2x2 within-subjects design requiring participants to solve either easy or hard anagrams in a social or alone condition. Miller et al. [14] found large effect sizes [15] of d = 0.96 for facilitation of task performance in the social condition for easy anagrams and d = 0.83 for inhibition of task performance regarding hard anagrams respectively. The agent was introduced as a research aide and had a human-like appearance, including an idling animation and jaw movement synchronized to recorded audio.

In summary, related studies found effects of virtual agents facilitating or inhibiting task performance using different display media and framing of the agent. It is unclear however, if these findings could be transferred to more recent technologies and whether they could be replicated in a work environment. To close this gap, this study will consider these aspects. First, the agent will be framed as being AI-operated, preventing potential assumptions of human operation. Accordingly, a questionnaire regarding AI literacy will also be included. Second, instead of optical see-through (OST) AR, video see-through (VST) AR will be used. Finally, both sample and task will differ from typical academical background, by using a potentially more diverse sample of office workers with different educational backgrounds and from various age groups as well as using a work-related task as mentioned before.

Based on the findings of Miller et al. [14], which suggest that the presence of a virtual agent can facilitate or inhibit task performance, the following hypotheses are proposed:

The presence of an embodied AI agent will lead to …

… by the participants when compared to an alone condition.

In Miller et al. [14], the agent was matched to the participant’s sex. A study by Liu et al. [16], which did not use virtual agents but real humans, also investigated effects of mixed-sex social conditions. In both experiments conducted by the authors, social facilitation effects were further increased when at least one member of the audience was of the opposite sex, resulting in further reduction of time needed for visual search as well as math tasks. Based on these findings, the following hypothesis is proposed:

The presence of an embodied AI agent with a stereotypical appearance attributed to the opposite sex will lead to …

… by the participants when compared to task performance in the presence of an agent of stereotypical appearance attributed to the same sex as the participant.

Methodology

Pre-Study

As social facilitation and inhibition effects depend on task difficulty, emails must be categorized according to the difficulty participants may face when having to distinguish between phishing and non-malicious emails. To achieve this, a pre-study will be conducted.

Material

Phishing and non-malicious emails will be sourced by a variety of means. As the main study’s sample will consist of German employees, appropriate emails regarding language, context and culture are needed. These will be sourced from online resources, the researcher’s personal inbox and the cooperating organization’s IT department’s collection of malicious emails. Further enhancement by large language model (LLM) generated emails will also be considered. The amount of phishing and non-malicious emails will be balanced, prioritizing a task design more suitable to the area of research rather than being fully representative of the work context. The amount of easy and hard to recognize emails is also aimed to be balanced, based on an initial assessment by the researcher. Emails will be displayed in the survey using web technologies such as hypertext markup language (HTML). After the classification task, participants will be inquired about their demographic data, their previous experience regarding computer use and web security as well as whether they use personal or organizational email accounts. They will also be provided the Meta AI Literacy Scale (MAILS) [17].

Sample

The pre-study will be conducted online using SoSci Survey [18] and will be open for eligible German-speaking participants of the working population. Zanbaka et al. [12] base their easy and difficult math tasks on a study by Foos and Goolksian [19] which found significant differences between those conditions in reaction time and error rate using a sample size of 26 participants. A similar sample size of 30 participants will be targeted. Participants will be recruited using Prolific [20], receiving a monetary compensation of 10 pounds per hour proportionally.

Procedure

Participants will receive an introduction on how to recognize phishing emails based on public videos from the SECUSO research team [21]. They will then be provided six exemplary non-malicious and phishing emails. After this, participants will binary classify 60 emails while both the time required and accuracy of correct prediction will be measured.

Main study

Design

The study will use a 2x2x2 (agent x difficulty x audience) mixed design. The agent will be varied between-subject to either posses a stereotypical female or male appearance. The remaining factors are manipulated within-subject in order for participants to classify either easy or hard to recognize emails (difficulty) alone or in the audience of a virtual agent (audience).

Sample

Participants will be recruited from within a single organization located in Germany. The staff consists of female and male workers ranging from 18 to approximately 65 years of age and varying professions in fields such as finance, marketing, logistics, software development, graphic design and customer management. If targeted sample sizes should not be reached within the organization, the sample will be augmented by students through the university’s regular means of acquisition. Participants will be screened for epilepsy and are required fluency in German. Participants from within the organization can claim their time spent participating in the study as work time which will result in their usual monetary compensation. Furthermore, six vouchers of 50€ each will be raffled upon participants. Students will be able to earn study credits for participating.

A sample size of 50 participants will be targeted for the main study. This estimation is based on Zanbaka et al. [12], who reported a medium effect size of eta^2 = 0.11. Estimation, conversion and adaption to the design of the planned study was done using g*Power.

Setting

The study will be conducted in the organization’s headquarters. The room chosen includes typical office equipment like chairs, two tables and cabinets. It currently serves as a small conference room. The videoconferencing system will be removed for the days of data acquisition. The experimenter will leave the room once the first block has started. Participants will be told that the experimenter will only enter the room if they push a button of a handheld transceiver to cancel the experiment.

System

Phishing emails will be displayed using a virtual display set to a fixed position in the real room where a typical physical display would be placed. Visually this will be similar to Apple Vision Pro’s workspaces [22] or workrooms in Meta’s Quest devices [23]. The agent will sit beside the participant, outside of their personal space. Participants are not allowed to talk to the agent, nor will the agent respond to them. The agent will introduce themself as a new, AI-operated colleague, that is still in development. Its appearance will be based on the GenErika and GenErik assets currently being evaluated in a different study by Krop et al.. The sex of the agent will be randomly assigned, thus enabling comparison of same-sex and mixed-sex conditions. In the style of Park and Catrambone [13], participants will be told that the AI agent is present to learn the task and is currently being trained to detect phishing emails. As such, the agent is not engaged in another task and no direct competition takes place, thus being in line with two aspects mentioned in Guerrin [24]. As the agent has to pay attention to the task, no interaction or talking is possible during the task.

To further embrace that the agent is not there to evaluate the participant, voting an email to be phishing will be done using the left hand controller, whereas for non-malicious emails, the trigger on the right hand controller should be pressed. The agent can not see which trigger is being pressed and no visual feedback will be shown on the screen. Labeling of left and right hand controls will however be shown on the screen during the whole task to avoid confusion. Users will also receive haptic feedback upon button presses.

Technology

The application will be developed in Unity 2022.3.20f1 or newer and be displayed on a Quest 3 HMD. The software will run via Quest Link on a laptop facilitating a Nvidia Geforce RTX 3080 Ti Laptop GPU.

Procedure

Participants will be pre-questioned regarding epilepsy and are required to have sufficient reading comprehension in German. They will furthermore be inquired about existing visual impairments and their corrections. Participants will receive the same introductory material on how to recognize phishing emails [21] as provided in the pre-study. They will be asked to familiarize themselves with the material and fill out pre-questionnaires and screening questions beforehand off site. The Simulator Sickness Questionnaire (SSQ) [25] will be provided on site directly before the main procedure.

As part of the consent material, participants will be made aware that data collected is not available to the organization and that there are no consequences regarding their task performance and performance scores. All data will be provided anonymously and no direct comparisons between individuals’ scores will be made. Participants will however be told to try their best to correctly classify as many emails as possible in the given time frame. They will also be asked to provide their name to appear in the emails.

The study will begin with practice runs for three easy and three hard to classify emails. During these runs, the researcher will be present to answer any remaining questions. Before the first of four block starts, the researcher will leave the room. Participants will be given a fixed amount of 15 emails per block. All blocks as well as the practice runs will make use of AR technology. The four available blocks are based on the within-subject manipulation of difficulty (easy/hard) and audience (alone/social) which will be in random order. Between-subject, the agent will be randomly manipulated to either be GenErika (stereotypical female appearance) or GenErik.

Participants won’t receive direct feedback whether they classified an email correctly during the study. They can opt-in however, to receive a summary of their performance after all participants finished the study. Emails will either be phishing or non-malicious and participants will have to do a binary classification of these emails.

Measurements

Task performance will be calculated using F1-score. Furthermore, duration to classify each email will be measured. Post-experiment, participants will be inquired about demographic data such as age and gender, their prior knowledge regarding phishing e-mails and computers as well as their average usage time of computers and virtual reality. They also will be asked whether they have access to a personal email account provided by the organization. Participants will be provided with the MAILS [17] and the SSQ [25]. Perception of the agent will be measured using the Virtual Human Plausibility Scale [26] as well as the Uncanny Valley Index (UVI) [27]. The Networked Minds Measure of Social Presence [28] questionnaire will be used to measure both social presence as well as co-presence. All material will be provided in German.

Tasks & Timelines

Literature


Contact Persons at the University Würzburg

Philipp Krop (Primary Contact Person)
Human-Computer Interaction Group, University of Würzburg
philipp.krop@uni-wuerzburg.de

Prof. Carolin Wienrich
Psychologie Intelligenter Interaktiver Systeme, XR Hub Würzburg, Universität Würzburg
carolin.wienrich@uni-wuerzburg.de

Legal Information