Big Data Text AnalysisHome -- Download -- Instructions -- FAQ |
Word association thematic analysis (WATA) identifies themes that occur more often in one subset of texts than another. For example, given a set of Covid-19 tweets it might find that male-oriented themes include sports and news, female-oriented themes include personal safety and mental health, and nonbinary-oriented themes include politics and identity.
How does it work? For any set of texts, you set filters in Mozdeh to split the texts into two non-overlapping sets, based on gender, country, label, sentiment, retweet count, or a query. Mozdeh then finds words occurring more often in one set than another. You then read lots of texts containing each word to identify its typical context and then use thematic analysis methods to group the words into themes. Full details are given in the following book.
Thelwall, M. (2021). Word association thematic analysis: A social media text exploration strategy. San Rafael, CA: Morgan & Claypool.
The table below illustrates some WATA studies. Starred papers do not use the name WATA.
Topic |
Data |
Comparison |
Example findings |
Gender differences in reactions to Covid-19 |
Tweets mentioning Covid-19 |
Female v. male |
Females tweet more about safety, males more about politics (Thelwall & Thelwall, 2020). |
Personal experiences of ADHD |
Tweets about “my ADHD” |
ADHD v. other disorders |
The brain is discussed as if it is a separate entity (Thelwall, et al., 2021a). |
Evolution of #BlackLivesMatter during Covid-19 |
Covid-19 tweets about racism |
Tweets in four different periods. |
The George Floyd killing led to tweets about systematic racism (Thelwall & Thelwall, 2021). |
Self-presentation on Twitter |
UK Twitter profiles |
Female v. male v. nonbinary |
Nonbinary profiles more likely to mention games and sexuality (Thelwall et al., 2021b). |
Autism on Twitter |
US autism tweets during Covid-19. |
Autism v. others |
Autistic tweeters do not have distinctive reactions to Covid-19 (Thelwall & Thelwall, 2022). |
Gender differences in museum interests |
Comments on YouTube museum videos |
Female v. male |
Females are more explicitly positive about content (Thelwall, 2018c). |
Discussions of bullying in YouTube |
Comments on YouTube influencer videos |
Bullying v. |
Strategies used to address bullying include generalisation (Thelwall & Cash, 2021). |
Interests on Reddit |
Reddit posts |
Female v. male |
Females more likely to mention doctors in the science subreddit (Thelwall & Stuart, 2019). |
Factors associated with success in SteemIt |
Steemit (like Reddit) posts |
Successful v. unsuccessful posts |
Financial news is less likely to be rewarded (Thelwall, 2018b). |
Nursing research |
Nursing journal articles* |
USA v. other countries |
Nursing administration and management is not studied in some countries (Thelwall & Mas-Bleda, 2020). |
US research subjects |
US journal articles* |
Female v. male |
Lists of gendered research topics and styles (Thelwall, et al., 2019b). |
UK research subjects |
UK journal articles* |
Female v. male |
Lists of gendered research topics and styles (Thelwall et al., 2020). |
Indian research subjects |
Indian journal articles* |
Female v. male |
Lists of gendered research topics and styles (Thelwall, et al., 2019a). |
Research quality | UK journal articles | High v. medium v. low quality | Lists of research topics and methods assocated with higher or lower scores in the UK Research Excellence Framework 2021 evaluations (Thelwall et al., 2023) |
References to papers using Word Association Thematic Analysis (not necessarily using that term).
Thelwall, M., Abdoli, M., Lebiedziewicz, A. & Bailey, C. (2020). Gender disparities in UK research publishing: Differences between fields, methods and topics. El Profesional de la Información, 29(4), e290415. https://doi.org/10.3145/epi.2020.jul.15
Thelwall, M., Bailey, C., Makita, M., Sud, P. & Madalli, D. (2019b). Gender and Research Publishing in India: Uniformly high inequality? Journal of Informetrics, 13(1), 118–131.
Thelwall, M., Bailey, C., Tobin, C. & Bradshaw, N. (2019a). Gender differences in research areas, methods and topics: Can people and thing orientations explain the results? Journal of Informetrics, 13(1), 149-169.
Thelwall, M. & Cash, S. (2021). Bullying discussions in UK female influencers’ YouTube comments. British Journal of Guidance and Counselling, 49(3), 480-493. https://doi.org/10.1080/03069885.2021.1901263
Thelwall, M., Kousha, K., Abdoli, M., Stuart, E., Makita, M., Wilson, P. & Levitt, J. (in press). Terms in journal articles associating with high quality: Can qualitative research be world-leading? Journal of Documentation. https://doi.org/10.1108/JD-12-2022-0261
Thelwall, M., Makita, M., Mas-Bleda, A. & Stuart, E. (2021a). “My ADHD hellbrain”: A Twitter data science perspective on a behavioural disorder. Journal of Data and Information Science, 6(1). https://doi.org/10.2478/jdis-2021-0007
Thelwall, M. & Mas-Bleda, A. (2018). YouTube science channel video presenters and comments: Female friendly or vestiges of sexism? Aslib Journal of Information Management, 70(1), 28-46.
Thelwall, M. & Mas-Bleda, A. (2020). How does nursing research differ internationally? A bibliometric analysis of six countries. International Journal of Nursing Practice, 26(6), e12851. https://doi.org/10.1111/ijn.12851
Thelwall, M. & Stuart, E. (2019). She’s Reddit: A source of statistically significant gendered interest information? Information Processing & Management, 56(4), 1543-1558.
Thelwall, M. & Thelwall, S. (2020). Covid-19 tweeting in English: Gender differences. El Profesional de la Información, 29(3), e290301.
Thelwall, M., & Thelwall, S. (2021). Racism discussions on Twitter after George Floyd during Covid-19: A space to address systematic and institutionalized racism? Social Science Research Network. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3764867
Thelwall, M., Thelwall, S. & Fairclough, R. (2021b). Male, female and nonbinary differences in UK Twitter user self-descriptions: A fine-grained systematic exploration. Journal of Data and Information Science, 6(2), 1-27.
Thelwall, M., & Thelwall, S. (2022). Autism Spectrum Disorder on Twitter during Covid-19: Account types, self-descriptions and tweeting themes. Data Science and Informetrics, 2(2), 1-12.
Thelwall, M. (2018a). Social media analytics for YouTube comments: Potential and limitations. International Journal of Social Research Methodology, 21(3), 303-316.
Thelwall, M. (2018b). Can social news websites pay for content and curation? The SteemIt cryptocurrency model. Journal of Information Science, 44(6), 736–751.
Thelwall, M. (2018c). Can museums find male or female audiences online with YouTube? Aslib Journal of Information Management, 70(5), 481-497.
See also: (word association analysis, but not full WATA):
Thelwall, M. (2021). World Food Day on Twitter 2009-2020: Driven by UNFAO and aligned campaigns. SSRN
The following spreadsheets give artificial examples of words identified by Mozdeh and suggested human assigned contexts and themes for them. These are small-scale examples for trainign purposes. Most of the papers above have online supplements with longer lists of words with contexts and themes.
The rest of this page gives instructions for getting the data for this with Mozdeh.
Step 1: Collect your data (tweets, YouTube comments, other) with Mozdeh in the same way as for any other Mozdeh project. Go to the main Mozdeh search screen when you have finished.
Step 2: Decide what type of comparison you are making. If you are comparing the texts matching a filter against the rest (e.g., female-authored tweets against all other tweets) then follow the version of step 3 for a one-vs-remainder word assocation test (3a). If you are comparing one set of texts against another, but not the rest (e.g., nonbinary-authored tweets against female-authored tweets) then follow the version of step 3 for a A-vs-B word association test (3b).
Step 3a: Enter filters in the search screen to match your set (e.g, gender, country...), check the Always save mine associations results... option in the Advanced menu and click the Mine Associations for Seach and Filters (slow) button. This should produce a file containing a list of words occuring more often in the set matching your filters than in the remaining texts. When a row is starred at the end, this means that the difference is statistically significant. The results are in a file in the folder that will appear when the procedure is finished. This list is sorted in descending order of statistical significance.
Step 3b. Select the Association mining comparisons tab. Enter two queries in the text box, following the instructions below. The two queries specify the two subsets to be compared (or enter one query and select the Male vs. Female option). Here are some examples.
* The queries nonbinary,transgender are an instruction to compare texts containing "nonbinary" with texts containing "transgender".
* The queries <N>,<F> are an instruction to compare texts authored by nonbinary people with texts authored by females.
* The queries <M>{{UK}},<F>{{UK}} are an instruction to compare texts from the UK authored by males with texts from the UK authored by females.
* The queries our{{UK}},our{{USA}} are an instruction to compare texts from the UK containing the word "our" with texts from the USA containing the word "our".
Now click the Compare words matching the above queries (slow) button. This should produce a file containing a list of words occuring more often in texts matching the first query than texts matching the second (and vice versa). When a row is starred at the end, this means that the difference is statistically significant. The results are in a file in the folder that will appear when the procedure is finished. This list is sorted in descending order of statistical significance.
Step 4: Configure the filters on the search screen for the first of the two queries compared, so that all texts to classify match the original queries. In the 3a case, keep the filters and/or queries used in step 3a. In the 3b examples, the following would be set.
* Enter nonbinary in the search box.
* Select nonbinary in the gender selection box.
* Select male in the gender selection box and UK in the country selection box.
* Enter our in the search box and UK in the country selection box.
Step 5: Click on the Save tab. Click the WATA button and select the file created by stage 3a or 3b (the version ending in diffp or diffp.txt). In reply to the questions, select column 1 and answer Yes to the question about search screen filters (unless you don't need to use them). This will produce a file that can be loaded into a spreadsheet that contains 100 randomly selected texts containing each of the first 100 words found by the word association tests (unless you changed the recommended answers).
Step 6a: Skip step 6.
Step 6b: Repeat steps 4 and 5, altering the filters in Step 4 to match those in the second query. For example, in the 3b cases this would be:
* Enter transgender in the search box.
* Select nonbinary in the gender selection box.
* Select female in the gender selection box and UK in the country selection box.
* Enter our in the search box and USA in the country selection box.
After Step 5, you don't need Mozdeh any more. Use the Step 4 file for the thematic analysis stage to find themes in the words found by the word association analysis. For example, if one of the top 100 word assocation words is "racist" then the file will contain 100 texts including the word "racist" and the context of the use of this word can be deduced by reading them. Repeating this for the other words and clustering the word contexts as part of a thematic analysis might put "racist" with an "Abuse" theme or a "Politics" theme, for example. Important: If you followed 3b/6b then you will have two files, one for each of the two queries. Only classify texts for a word from the file using the word more often. For example, if the term stupid is used more often by UK females than UK males for the third pair of queries above then text containing stupid should be classified from the first file (UK females) and the texts in the second file (UK males) should be ignored.
Made by the University of Wolverhampton during the CREEN and CyberEmotions EU projects and updated at the University of Sheffield. |