SentiStrength Language Customisation
To customise SentiStrength for a new language, follow the "Modifying SentiStrength input files" section below (takes about 2 days). To customise AND evaluate SentiStrength for the new language, follow all of the instructions below.
Instructions for modifying and evaluating SentiStrength for a new language (allow at least 3 weeks full time work for this – reduce the number of texts to speed up the process for a pilot study).
- STEP 1: Creating a Gold Standard. Create a random set of at least 1000 texts in the new language of the type that you hope to automatically classify. For example, to gather tweets in your language, you could use the program Webometric Analyst. Human classify these texts using (if possible) 3 independent classifiers, giving a rating of 1-5 for positive sentiment and -1 to -5 for negative sentiment. Once the labelling is complete, to resolve disagreements give each text the average score of the three coders. SentiStrength Windows version can be used to calculate the inter-coder consistency to check that it is high enough. To create a good gold standard, give the coders the instructions (example pdf) and carry out a couple of pilot exercises with 100 texts and ask them to discuss their disagreements with each other on these 100 texts afterwards.
- STEP 2: Modify SentiStrength Input Files: Follow the instructions below to create the new language-specific version of SentiStrength.
- STEP 3: Evaluating the new version of SentiStrength. Evaluate the accuracy (correlation is the best measure) against the gold standard in Step 1 – anything above 0.2 is OK and 0.4 and above is excellent. The Java version of the program can simultaneously evaluate the sentiment scores of the texts in the gold standard and compare them against the human scores. To do this, put the gold standard into a single plain text file in the form [pos] [tab] [neg] [tab] [text] on each line and feed it into SentiStrength (Windows or Java version).
- STEP 4: Improving SentiStrength. To improve SentiStrength, compare the SentiStrength predictions with the human classifications in cases where they disagree and then identify the reason for the disagreement via the SentiStrength explanations for its score. For each reason identified, decide whether the files created in STEP 2 should be modified to ensure that texts like the one just misclassified will be classified correctly in future.
- STEP 5: (optional) repeat Step 1, Step 3, Step 4 for new gold standards to improve SentiStrength further.
Step 2: Modify SentiStrength Input Files
This section contains more information about Step 2, as described in the list above.
EmotionLookupTable.txt
This file should contain a list of words that tend to indicate sentiment, such as “love”, “hate”. Each word should have a sentiment score that indicates the typical polarity of the sentiment expressed by the word and the typical sentiment strength using the following scheme:
-5 Very strong negative sentiment (e.g., excruciating)
-4 Strong negative sentiment (e.g., hate)
-3 Moderate negative sentiment (e.g., dislike)
-2 Mild negative sentiment (e.g., uncomfortable)
2 Mild positive sentiment (e.g., content)
3 Moderate positive sentiment (e.g., happy)
4 Strong positive sentiment (e.g., lover)
5 Very strong positive sentiment (e.g., ecstatic)
Note that this should be for the typical use of the word in informal written text.
A score of 1 (no sentiment) can be given to words that you decide do not have sentiment but might in some contexts. This shows that you have considered the term.
Enter one word per line, followed by a tab, followed by the sentiment score: one of : -5, -4, -3, -2, 2, 3, 4, 5. (note that -1, 0, and 1 are not used).
Please modify the sentiment scores of all the words in the existing file and then add any other words that you can think of.
The star notation: If there are many different word endings for a word that give it the same sentiment meaning, please truncate the word and replace the ending with a star *. This only applies to word endings.
For instance, hate* would match hate, hater, and hated so that separate entries are not needed for all of these terms. But please don’t do this if the star would match unrelated words. For instance, amaz* would match amazing, amazed (positive) but also Amazon (neutral), so this would be wrong.
BoosterWordList.txt
This file should contain a list of words that tend to increase or decrease the sentiment of the word that follows, such as “very”, “some”. Each word should have a booster score that indicates the typical increase or decrease in sentiment given by the word using the following scheme:
-2 Large decrease in sentiment (e.g., little)
-1 Moderate decrease in sentiment (e.g., some)
1 Moderate increase in sentiment (e.g., very)
2 Large increase in sentiment (e.g., extremely)
Note that this should be for the typical use of the word in informal written text.
Enter one word per line, followed by a tab, followed by the sentiment score: one of : -2, -1, 1, 2.
NegatingWordList.txt
This file should contain a list of words that almost always indicate that the sense of a sentence, word of phrase is negated, such as “not”, “never”, “don’t”. Please add as many terms as you can think of.
Enter one word per line with no extra spaces.
QuestionWords.txt
This file should contain a list of words that almost always indicate that the sentence is a question, such as “how”, “why”, “when”, but not “should” (which only sometimes indicates a question). Please add as many terms as you can think of.
Enter one word per line with no extra spaces.
IdiomLookupTable.txt (advanced)
This file should contain a list of common phrases that tend to indicate sentiment that is different from the sentiment of the original words. An example is the idiom “shock horror”. This means “mildly surprised” and is mildly negative (-2) but the individual words are strongly negative (-4). Each phrase should have a sentiment score that indicates its typical polarity and typical sentiment strength using the same scheme as above:
-5 Very strong negative sentiment (e.g., excruciating)
-4 Strong negative sentiment (e.g., hate)
-3 Moderate negative sentiment (e.g., dislike)
-2 Mild negative sentiment (e.g., uncomfortable)
2 Mild positive sentiment (e.g., content)
3 Moderate positive sentiment (e.g., happy)
4 Strong positive sentiment (e.g., lover)
5 Very strong positive sentiment (e.g., ecstatic)
Note that this should be for the typical use of the word in informal written text.
A score of 1 (no sentiment) can be given to words that you decide do not have sentiment but might in some contexts. This shows that you have considered the term.
Enter one phrase per line, followed by a tab, followed by the sentiment score: one of : -5, -4, -3, -2, 2, 3, 4, 5. (note that -1, 0, and 1 are not used).
Please modify the sentiment scores of all the phrases in the existing file (if any) and then add any other common phrases that you can think of.