We are thrilled to share that WIS Delft will be well-represented at AAAI HCOMP 2021, the premier international conference on Human Computation and Crowdsourcing. Research contributions from members of WIS Delft resulted in the acceptance of 3 full papers, 2 works-in-progress, 2 demonstrations, and 1 blue sky idea.
Here’s a brief overview of the different contributions that will be presented at HCOMP 2021.
Recent research on data quality and the concomitant downstream effects on machine learning models and AI systems has demonstrated that cognitive biases can negatively affect the quality of crowdsourced data. Although this is now well-understood, cognitive biases often go unnoticed. Since significant efforts and costs entail the large-scale collection of human annotations across a plethora of tasks, there is unquestionable value in making such data collections reliable and resuable. To facilitate the reuse of crowdsourced data collections, practitioners can benefit from understanding whether and which cognitive biases may be associated with the data. To this end, task requesters (i.e., those who design and deploy tasks to gather labels or annotations online from a distributed crowd) need to ensure that the task workflows and design choices do not trigger cognitive biases of those who are contributing with their human input.
Addressing this need in our work led by Tim, we propose the Cognitive Biases in Crowdsourcing Checklist (CBC Checklist) as a practical tool that requesters can use to improve their task designs and appropriately describe potential limitations of collected data. We envsion this checklist to be a living document that can be extended by the community as and when new cognitive biases are discovered or understood to affect human input in crowdsourcing tasks. Collaborating on this work was inspiring for our team and we hope this can benefit the community! We invite you to read the paper to learn more about how the CBC Checklist can be used in practice. The paper also provides further analysis motivating the need for such a tool.
In this work led by Petros, we explored the realm of music content annotation on crowdsourcing platforms. Annotating complex music artefacts dictates the need for certain skills and expertise. Traditional methods of participant selection are not designed to capture these kind of domain-specific skills and expertise. Despite the popularity of music annotation tasks and a need for such input at scale, we have a limited understanding of the distribution of musical properties among crowd workers – moreso in case of auditory perception skills. To address this knowledge gap, we conducted a user study (N = 100) on the Prolific crowdsourcing platform. We asked workers to indicate their musical sophistication through a questionnaire and assessed their music perception skills through an audio-based skill test.
Our goal here was to establish a better understanding around the extent to which crowd workers possess auditory perceptions skills, beyond their own musical education level and self reported abilities. Our study shows that untrained crowd workers can possess high perception skills on the music elements of melody, tuning, accent and tempo; skills that can be useful in a wide range of annotation tasks in the music domain. Do read the paper to learn more about our work! A fun fact here is that Petros has received formal piano training in a conservatory for about 12 years, where he also studied music theory and harmony for nearly 6 years. Petros took violin lessons for around 1.5 years, but hit pause on that button to pursue a PhD!
In this work led by Tahir, who is adding finishing touches on his PhD dissertation, we explored the challenge of dealing with latency in crowd-powered conversational systems (CPCS). Such systems are gaining traction due to their potential utility in a range of application fields where automated conversational interfaces are still inadequate. Long response times negatively affect CPCSs, limiting their potential application. In an attempt to reduce latency of such systems, researchers have focused on developing algorithms for swiftly hiring workers and facilitating synchronous crowd coordination. Evaluation studies have typically focused on system reaction times and performance measurements, but have so far not examined the effects of extended wait times on users.
The goal of this study, based on time perception models, is to explore how effective different time fillers are at reducing the negative impacts of waiting in CPCSs. To this end, we conducted a rigorous simulation-based between-subjects (N = 930) study on the Prolific crowdsourcing platform to assess the influence of different filler types across three levels of delay (8, 16 & 32s) for Information Retrieval (IR) and stress management tasks. Our results show that asking users to perform secondary tasks (e.g., microtasks or breathing exercises) while waiting for longer periods of time helped divert their attention away from timekeeping, increased their engagement, and resulted in shorter perceived waiting times. For shorter delays, conversational fillers generated more intense immersion and contributed to shorten the perception of time. Working with Tahir and the broader team from TU/e (Javed and Panos) certainly made time fly!
With the increase of anthropomorphic conversational agents across several domains, research on the effects of anthropomorphism in conversational agents is also on the rise. However, prior studies present conflicting results and little is currently understood about how anthropomorphism can influence end user perception of conversational agents. This work-in-progress paper, led by Emilija while she was carrying out her Bachelor’s thesis project with us at WIS, attempts to contribute towards filling this gap by analysing whether anthropomorphic visual cues used in conversational agents have an effect on the trust and satisfaction of users.
We carried out a between-subjects experiment to this end, where the use of emojis and a profile image of four different levels of anthropomorphism were manipulated in a conversational agent based on Telegram. A total of 120 participants had a conversation with the agent and reported their experience. Based on our findings, we conclude that individual visual cues as well as a combination of them did not have any significant effects on the trust and satisfaction of users, and discuss future directions of research.
Research can’t get more “fun” than playing a game in the name of science! That’s precisely what we’ve been spending some of our time on in the recent past. In this work led by Agathe and Gaole, and fueled by Andy’s excellent Bachelor project, we designed and developed a game with a purpose (GWAP) that has the potential to help overcome some of the difficult challenges in the field of AI.
Limited contextual understanding and lack of commonsense knowledge of various types and about diverse topics have proven to be the pitfalls of many real-world AI systems. Games with a Purpose (GWAPs) have been shown to be a promising strategy in order to efficiently collect large amounts of data to train AI models. Yet, no GWAP has been proposed to collect specific types of knowledge — discriminative, tacit, or expert knowledge. Inspired by a popular game called “Guess who?”, we present FindItOut. In this GWAP, two players compete to find a target concept among several by asking each other questions in turns, using a set of relations, and entering natural language inputs, with an aim to discrim- inate the target concept from others. The data created by the players is then processed, and can be appended to existing knowledge bases to be exploited by AI systems.
Quality control and assurance are among the most important challenges in crowdsourcing. Low quality and sub-optimal responses from crowdworkers have been found to often result from unclear or incomplete task descriptions, especially from novice or inexperienced task requesters. Creating clear task descriptions with adequate information however, is a complex task for requesters in crowdsourcing marketplaces. To meet this challenge, we present iClarify, a tool that enables requesters to iteratively discover and revise eight common clarity flaws in their task description before deployment on the platform.
A requester can use iClarify to formulate a task description from scratch or also to evaluate the clarity of prepared descriptions. The tool employs support vector regression models based on various feature types that were trained on 1332 annotated real-world task descriptions. Using these models, it scores the task description with respect to the eight flaws, and the requester can iteratively edit and evaluate the description until the scores shown by the tool reach a satisfactory level of clarity. We are currently conducting a usability study with both requesters and crowdworkers to assess to which extent the tool is effective in improving task clarity. This work was led by Zahra and Nikhil. Keep an eye out for more research around iClarify, it’s coming out soon!