I built a chat room for a PhD research project.
Any write up of this in the PhD Thesis itself will never be able to do justice to the drawn-out, agonising, Herculean labour that this simple, whimsical statement turned out to involve, so I thought I’d do that here instead!
So my PhD involves studying the experiences of people using English as a second language in text-based online communication. Chat rooms, blogs, comments threads, Instant Messaging, social media, even emails, you name it. In order to do this, I thought a good first step would be to talk to these people and ask them about their experiences, bringing them together in small focus groups.
Then I thought: why not do it online? Why not use an online chat room to talk about online communication? Massive benefits of this include the fact that, as opposed to face to face focus groups, there is no need to convert video recording to text transcript (for which you can expect to spend four hours for each hour of recorded material), and that participants can take part from wherever they have an internet connected computer, meaning that a) they don’t have to travel to the study and b) they can be located pretty much anywhere in the world, multi-time-zone scheduling permitting.
Next question: what platform to use? I considered a number of possibilities. WhatsApp looked like a good option for several reasons: it gives you the option to email to yourself the chat log of any conversation you’re having, providing an instant transcript of the conversation, plus conversations are encrypted as they wing their way through the wires of the web. But you need people’s phone numbers, which I thought was a bit unnecessarily invasive. Other obvious contenders like Facebook Messenger lacked option to download the chat (technically possible, but it involves downloading your entire Facebook dataset, which seemed a bit cumbersome. Ditto with Google Hangouts). For research purposes, however, a major flaw of all commercial and web-based options was that you can’t be sure where your data is going, where it is being stored, and what is being done with it.
Ethics is a pretty big deal in academic research these days, and in anything to do with internet research, data security is an import ethical consideration. To satisfy ethics review boards you need to show you’ve given thought to the whereabouts of data collected from research participants at all times – that’s the physical location, so if it’s “in the Cloud” you need to know exactly what that means, what company’s particular subsection of “the Cloud” that is, where their actual computers are on which they store your data, and what laws apply to data storage in that location.
So, reasoned my treacherous brain, we can avoid all of these problematic ethical issues and ensure that the platform has all necessary features like exportable transcripts etc. if we simply… do it ourselves. You know, make a chat room and host it on a University of Nottingham server.
As I look back on this last sentence I’m not sure what on earth I thought I was thinking.
I should mention at this point that I have no background at all in computer science. I did a GCSE in IT in 1998. I had spent some time over the first year of my PhD learning the basics of programming using the excellent codecademy and had successfully learned enough Python to process Twitter data, and this experience had shown me that with perseverance and a great deal of searching on programming advice forums (stackoverflow, w3schools, digital ocean, many others…), it’s possible to figure stuff out and do things with computers that initially seem impossible. So I thought I should be able to do this.
And I was right. Just. But every step of the way threw up new and seemingly intractable challenges that not only required the finding of a solution, but often the hasty acquisition and understanding of some fundamental aspect of computer science and web science. Sometimes it took hours of painstaking online research just to understand what question I needed to ask.
Possibly it was not the most productive use of my time. But I’ve finally arrived at a point now where I have a cast-iron online platform on which I can do some (hopefully) interesting and (definitely) ethically-sound research.
In part II of this post I’ll describe the process, partly as an aide-mémoire for myself (lest we forget…) and partly in the hope that it might be of some use to anyone equally daft enough to try doing it for themselves. It will be full of dreadfully dull technical details, but hopefully I can explain it in such a way as to make it accessible to non-CS people like me.
Due to a fantastic initial response to this call I am no longer looking for further participants. If you have come here interested in taking part, then I’m very grateful for your interest in my research, but I will not be running any more chat sessions.
Are you a non-native speaker of English? Do you use English as a second or foreign language to communicate with people online, or through any form of electronic technology? If so, then we’d like to talk to you about your experience. As part of a research partnership between Cambridge University Press and the University of Nottingham we want to talk to people like you, in order to understand how English language teaching can help people communicate online.
Why? We want to find out:
- What kinds of electronic communication you use – email, chat rooms, instant messaging apps, online community forums… pretty much anything really, the only limit is that we’re interested in text communication – typed on a keyboard or on a phone – rather than video or audio communication.
- What you use these things for, which ones you like or dislike, and which ones you find easier or more difficult to use.
- Whether you have any particular problems communicating with other people using these methods of communication.
What do we want you to do?
We’ll ask you to participate in an online text-chat session with a researcher and three or four other people like you. The chat will last for about an hour, although the whole process will take an hour and a half of your time. We’ll ask you to talk about the things listed above, then fill in an online questionnaire. The whole process will be online, so you don’t have to travel anywhere and can take part from your home computer or anywhere you want. We only ask that you participate using a full computer keyboard, rather than just a mobile phone, so that everyone can contribute equally.
Between January and April 2016 I undertook a study of the use of Twitter for public engagement among members of the University of Nottingham staff. The project was run under the auspices of CaSMa – Citizen-Centric Approaches to Social Media Analysis – a research team at the Horizon Digital Economy Research Institute that explores methods of performing social media research that respect the rights to privacy and ownership of personal data of social media users. Consenting study participants provided data from their Twitter feed by exporting it from a web tool that was designed primarily to allow users to monitor and manage their Twitter activity. Using graph visualisation software Gephi I created an image of the network of interactions created by the tweeting, retweeting, quoting, mentioning, favouriting and following events in the data, and performed an analysis of hashtags propagation to look for signs of successful public engagement.
It was a challenging project. Designing the data collection in line with CaSMa’s citizen-centric ethos required meeting with each participant in person (and consequently much to-ing and fro-ing between the University of Nottingham’s various campuses and partner organisations), talking them through the web tool and the data collection process, and obtaining their written consent to analyse their data. Once the collection procedure was in place, I had to work out what to do with the data: it was delivered to me in json format text files, and in order to be able to render complete data sets much sorting and parsing of the data structures was needed. The text files presented a series of events: tweets, retweets, favouriting, following, etc. I needed to find, for example, where in the hierarchical text structure I could find the ID of the initiator of a particular event – the person who had written a tweet, or like a tweet, or retweeted a tweet. In the latter two cases I also wanted to know the ID of the person whose tweet had been liked or retweeted. This information was not nested in exactly the same place in every event type, and a considerable amount of time was spent establishing the necessary paths within each event type. Once this was established, I used Python to retrieve the data and compile it into uniform data sets. I had not previously done any programming, and getting to grips with the language was a real learning experience.
The code compiled primary user, secondary user, and mentioned user data, and with this I created a network graph visualisation. The final product looked like this:
The layout is determined by mathematical algorithm, and the colours a result of a modularity analysis carried out by the software to identify discrete communities based on interactions. Unsurprisingly, most of the communities in the image above are centered on my participants, although the blue, purple and black communities subsume more than one individual participant, and not, in all cases, by the conscious design of the users themselves. Outlying coloured dots that seem to have ‘escaped’ their neighbours represent individuals who bridge two communities (and are consequently located equi-distant between the two).
Combining this approach with an analysis of hashtags suggested that successful uptake of a hashtag-denoted topic or event can be aided by recruiting partners to help spread the message. However, detecting true public engagement proved challenging. Due to the data collection method, full profile data were only collected on users tweeting or retweeting, and not from users favouriting or following, resulting in profile data for only 60% of users. Consequently, it was not possible to perform a robust analysis of users as ‘inside’ or ‘outside’ the academic community, and to what extent the message was reaching a general ‘public’, or circulating around a more specialised audience. In fact, this consideration raised questions of who constitutes the ‘public’ in public engagement, and whether the concept of a demarcated ‘academia’ is a valid proposition (apologies for all the air-quotes).
Further research could look at finding computational methods to process profile descriptions and produce judgements of the likely affiliation of an individual. However, this would again raise ethical questions, which are going to become more and more salient in future social media research.
On the 16th of March 2016 I participated in the Pathways to STEM outreach event at the central library in Mansfield. Around 300 year 10 pupils from schools in the Mansfield area who had shown interest in STEM subjects at school (Science, Technology, Engineering, Mathematics) were invited to come to the library to meet postgraduate students from the University of Nottingham, who brought along examples of, and activities based on, their research.
Two days of training from the UoN Graduate School a few weeks earlier had result in around fifty interested postgrads forming small groups based on shared(ish) research interests, with the aim of creating a short activity for the school pupils to engage with. At this stage of the process, I initially felt a little discouraged, as my very tenuous links to STEM gave me little common ground with the chemists, physicists, biologists and engineers around me. I found myself in a group with an agricultural scientist specialising in efficiency in dairy farming, and an engineer working on developing new, non-invasive ways to accurately measure the heart rate of newborn babies. With such disparate disciplines we opted to share a stand at the event, but develop small activities individually, under the umbrella theme of “What Technology Can Do For Us”.
I found it difficult at first to design a short, fun activity based on my research. I thought that the most salient application of technology within the areas of Applied Linguistics with which I was most familiar was the use of computerised language corpora to study patterns of language in use. My initial ideas were to involve the students in the process of corpus creation, perhaps to create a corpus from samples of their own classwork, and then perform some basic analyses of their language use. However, while this is an intriguing idea, it was too involved for the format of the day.
In the end I settled on the idea of exploring the linguistic phenomenon of collocation. This is the tendency of certain words to ‘attract’ certain other words, and so co-occur together frequently. To put it another way, it is the tendency of language users to have ‘go to’ combinations of words which they can pull out and use with minimal mental effort. The strength of attraction between words varies, but at the high end of the scale are semi-fixed combinations such as ‘torrential rain’ and ‘excruciating pain’. The development of computerised corpora over the past thirty years or so has facilitated the study of this phenomenon, and the strength of attraction can be quantified using various statistical tests that generate scores; the stronger the connection, the higher the score.
In addition to being a good example of the application of technology in the study of language, I chose to focus on collocation because it is something that all native speakers of a language intuitively understand. Show any native English speaker a sentence in which the word following ‘torrential’ has been removed, there is a very good chance indeed that they will supply ‘rain’, or possibly ‘downpour’ to fill the gap. Running with this idea, I thought that I could sell this awareness as a form of mind reading, and the idea for ‘I Can Read Your Mind’ was born.
Using my newly-acquired powers of telepathy to get the students’ attention, I then wanted to explain a little about the use of computerised corpora to study this, and let them try it out for themselves. I decided to give the students an adjective chosen at random, and ask them to guess the five words that collocate most strongly with it. Using an online corpus analysis tool I generated a list of these words from the British National Corpus, a 100 million-word collection of text assembled in the 1990s with the intention of creating a representative sample of modern British English. I wrote a short computer program that compared the students’ five guesses with the top twenty words from the corpus, and scored each guess according to its rank in the top twenty. Their total scores would then be recorded on a leaderboard, to add a competitive element to the activity.
My team mates prepared great activities. Shiemaa brought along her cardiac monitor that allowed the students to see their heart rate in real time simply by holding two light-emitting contacts in their fingertips and go home with a print out of the signal, while Emma prepared a fantastic Monopoly-style game in which students took over the running of a dairy farm for a five-year period, applying various technological methods seeing their effects on feed price and milk yield! By our final preparatory meeting the week before the event, we were confident we had a good stand.
The day went smoothly and enjoyably. In two ninety-minute sessions, the students circulated around the large events room at the library, stopping at the different stands and experiencing a range of scientific and technological works-in-progress, from 3D printers, to disease prevention, to optimized growing conditions for plants, to DNA Jenga. My team’s activities went well, and the students were suitably wowed by my mind-reading powers. I learned several interesting things, notably that my activity was a little unfair, as some words had very obvious collocating nouns (the girls who got the word ‘healthy’, for example, scored very highly with ‘food’, diet’, ‘lifestyle’ etc.), whilst others had a tendency to take rather more obscure and difficult words. Furthermore, it was very hard to predict, without looking at the wordlist, which adjectives would be challenging, so even when I became aware of the problem and started pre-selecting adjectives (previously I was using an online random word generator), I still couldn’t guarantee non-zero scores. Still, most teams seemed to enjoy the activity, and there were several real eye-opening moments. The team of boys who got the adjective ‘nice’ thought they were being jokers when they wrote down the word ‘guy’… but it turned out to be the top word.
The event as a whole was a great success, and the organisers passed on the following feedback from students and teachers:
I just want to thank you firstly for a fantastic event today. I saw every single student engage in an activity and interact with the people who were leading the activities. It was really good to see them enjoy themselves and for some stretch themselves a little.
Thank you for yesterday I had a thoroughly enjoyable and informative afternoon as did my colleagues and more importantly my students.
I learnt lots about how science effects our world
I learnt lots and all the events were cool
I could see science displayed in different ways to how it is done in school
It was a great opportunity for me to apply my research interests and really gave me a fresh perspective on my work.