# Practicum 8 - Nicki Minaj Vaccine Hesitancy


In mid-September, rapper Nicki Minaj posted the [following tweet](https://twitter.com/nickiminaj/status/1437532566945341441): 

```
My cousin in Trinidad wonâ€™t get the vaccine cuz his friend got it & became impotent. His testicles became swollen. His friend was weeks away from getting married, now the girl called off the wedding. So just pray on it & make sure youâ€™re comfortable with ur decision, not bullied
```

The vaccine hesitancy expressed by Nicki Minaj (and the way she expressed it with this story) surprised many people. It quickly became a partisan discussion when Tucker Carlson at Fox News ran a segment on her tweet expressing his support for her skepticism, and she subsequently [tweeted it](https://twitter.com/nickiminaj/status/1438248319650656256?lang=en).

The outbreak of conversation started by Nicki Minaj's tweet led to a flurry of emotions. This week, we'll use a dataset of tweets discussing Nicki Minaj and her controversial tweet. In particular, we'll look at the *sentiment* of the tweets that were posted. Sentiment analysis is a way of trying to quantify how emotion is expressed in text. 

-----------------

### Important Note!

Do _**not**_ share this data outside of class! It is a violation of Twitter's Terms of Service to share full tweet data publicly.

-----------------

## Dictionary-Based Sentiment Analysis

### Introduction

There are many, many ways to conduct sentiment analysis. We're going to do what is called *dictionary-based sentiment analysis*. Imagine that we have a a set of words and that each word is associated with some amount of happiness. So, for example, if we said that happiness falls on a scale from 1 to 9, we would expect words like "love" and "sunny" to have high happiness scores (close to 9), while we would expect words like "pandemic" and "murder" to have low happiness scores (close to 1). 

To calculate the "sentiment" or "happiness" of a text, we split it into each of its individual words. We then look at each word and check if it's in our sentiment dictionary to see if we have a happiness score for it. If we do, then we'll add that sentiment to the total for the sentence. We then divide the total sentiment by the total number of words that we scored to get the average sentiment of the text. 

For example, say we had a sentiment dictionary with just three words and scores: "coronavirus" with a score of 1.1, "vaccines" with a score of 7.3, and "impotence" with a score of 2.9. And say we had the following tweet:

```
Contrary to Nicki Minaj claims, no, none of the available coronavirus vaccines have been linked to testicular swelling or impotence
```

To get the average sentiment for the tweet, we look at the words and we see that "coronavirus", "vaccines", and "impotence" all appear exactly once. So, we add together their sentiment scores and divide by 3 (the total number of words that we scored): 

```
(1.1 + 7.3 + 2.9) / 3 = 3.76
```

On our scale from 1 to 9 (where 5 is the middle), our dictionary-based sentiment analysis would say that this tweet is relatively not happy.

### Be Wary of Sentiment Analysis (Especially on Short Texts)

Sentiment analysis is an entire branch of computer science (and the field of natural language processing, more specifically) and it is very difficult to conduct accurately. In particular, dictionary-based sentiment analysis is very easy to fool if we're working with short texts like tweets. Consider the following sentence:

```
I'm not happy about my birthday party this week
```

Dictionary-based sentiment analysis looks at _each word individually_. This means that it would see three positive words ("happy", "birthday", "party") and just one negative word ("not"). A naive dictionary-based approach will rate this sentence as a positive because it does not understand that "not" is negating the emotions expressed in the rest of the sentence. There are many ways to try and address this issue, but they're beyond the scope of this one assignment. There is one saving grace though: the longer our texts, the less negations, sarcasm, and other pathologies affect the overall sentiment. The oddities of the language all smooth out if we have a lot of text. 

In the first part of the assignment I'll ask you to calculate the sentiment of individual tweets. In general, I do _**not**_ recommend doing that with dictionary-based sentiment analysis. Instead, I typically recommend calculating the sentiment of a large group of tweets, which is what we'll do in the last part of the assignment.

## 1. Reading Data

### 1a. Sentiment Dictionary

We want to start by reading in our sentiment dictionary. We'll be using the labMT sentiment dictionary, which you can read more about [here](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026752). Words fall on a continuous scale from 1 to 9, where 1 is the least happy, and 9 is the most happy.

Write a function to read a CSV file of word-score pairs (like `labMT-en.csv`) into a Python dictionary where keys are words and values are sentiment scores.

In [None]:
def load_sentiment_dict(filename):
    """
    Gets sentiment scores from a CSV file of the form word,score
    
    Parameters
    ----------
    filename: str
        The name of the file to read
        
    Returns
    -------
    word2score: dict
        Dictionary where keys are words and values are sentiment scores
    """

In [None]:
sentiment_f = 'labMT-en.csv'
word2score = load_sentiment_dict(sentiment_f)

### 1b. Tweets

Next, we want to load the tweets. The tweets are in a new file format: `.json`. JSON is a way of storing data in a set of nested dictionaries. JSON files store multiple JSON objects, one on each line. 

I have written a function below that loads the tweets from the JSON file. Notice how it is _very_ similar to other functions we've written for reading files. The only difference is using `loads` function instead of `split`. Remember to import the `json` module too by running the cell.

In [10]:
import json

In [11]:
def load_tweets(filename):
    """
    Gets tweet data from a JSON file
    
    Parameters
    ----------
    filename: str
        The name of the file to read
        
    Returns
    -------
    tweets: list of dicts
        List where each element is a dictionary with the data for one tweet
    """
    tweets = []
    with open(filename, 'r') as f:
        for line in f:
            tweet = json.loads(line.strip())
            tweets.append(tweet)
            
    return tweets

In [12]:
tweets_f = 'vaccine_tweets.json'
tweets = load_tweets(tweets_f)

In [13]:
tweets[0]

{'id': '1437538884209020933',
 'text': "RT @crissles: @NICKIMINAJ your cousin's friend prolly just picked up an STD but please keep going ðŸ’€",
 'lang': 'en',
 'author_id': '130224558',
 'created_at': '2021-09-13T22:09:12',
 'conversation_id': '1437538884209020933',
 'possibly_sensitive': False,
 'reply_settings': 'everyone',
 'source': 'Twitter for iPhone',
 'public_metrics': {'retweet_count': 8424,
  'reply_count': 0,
  'like_count': 0,
  'quote_count': 0},
 'referenced_tweets': [{'retweeted': '1437534026324221958'}],
 'user': {'username': 'Vrn_TM',
  'name': 'Vâ„¢',
  'created_at': '2010-04-06T17:57:36',
  'description': 'This is the Unofficial Twitter Account of VLA ðŸ‡­ðŸ‡¹',
  'location': 'Boston, MA',
  'pinned_tweet_id': None,
  'public_metrics': {'followers_count': 99,
   'following_count': 357,
   'tweet_count': 54},
  'url': None,
  'profile_image_url': 'https://pbs.twimg.com/profile_images/1262542770238967810/YIzrsRpx_normal.jpg',
  'verified': False}}

I encourage you to spend a moment looking at the data. Examine the nested dictionary structure and try to make sense of it. Play around with it by entering different keys if you're not sure about how the data is stored. All of the tweet fields are explained [here](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet), and all of the user fields are explained [here](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user).

In [4]:
# "pprint" stands for "pretty print." It's good for visualizing nested dictionary structures"
from pprint import pprint

n_examples = 3
for indx in range(n_examples):
    print(f'Example {indx+1}')
    print('--------------------------------------\n')
    pprint(tweets[indx])
    print('\n\n\n')

Example 1
--------------------------------------

{'author_id': '130224558',
 'conversation_id': '1437538884209020933',
 'created_at': '2021-09-13T22:09:12',
 'id': '1437538884209020933',
 'lang': 'en',
 'possibly_sensitive': False,
 'public_metrics': {'like_count': 0,
                    'quote_count': 0,
                    'reply_count': 0,
                    'retweet_count': 8424},
 'referenced_tweets': [{'retweeted': '1437534026324221958'}],
 'reply_settings': 'everyone',
 'source': 'Twitter for iPhone',
 'text': "RT @crissles: @NICKIMINAJ your cousin's friend prolly just picked up "
         'an STD but please keep going ðŸ’€',
 'user': {'created_at': '2010-04-06T17:57:36',
          'description': 'This is the Unofficial Twitter Account of VLA ðŸ‡­ðŸ‡¹',
          'location': 'Boston, MA',
          'name': 'Vâ„¢',
          'pinned_tweet_id': None,
          'profile_image_url': 'https://pbs.twimg.com/profile_images/1262542770238967810/YIzrsRpx_normal.jpg',
          'public_m

## 2. Cleaning Text

Before we can conduct our sentiment analysis, we need to clean our text. Write a function that takes a piece of text and does the following:

1. Splits the text into individual words
2. Lower cases each word
3. Removes any hashtags (words that start with a # symbol, like #freenickiminaj or #vaccines)
4. Removes any handles and mentions (words that start with a @ symbol, like @NICKIMINAJ or @StephenAtHome)
5. Removes any punctuation

The final output should be a _list of words_. Hints:
- Remember, you have the `split` function in your toolkit
- Python has a built-in list of punctuation that you can get by adding the following to your code:

```python
# Note: this can be outside of the function definition (and I recommend it)
import string
punctuation = set(string.punctuation)
```

- You can combine a list of characters into a string like so:

```python
chars = ['r', 'y', 'a', 'n', ' ', 'g' , 'a', 'l', 'l', 'a', 'g', 'h', 'e', 'r']
name = ''.join(chars)
```

In [None]:
def clean_text(text):
    """
    Preprocesses text for sentiment analysis by lowering the case, 
    removing hashtags and handles, and removing any punctuation
    
    Parameters
    ----------
    text: str
        The text to clean
        
    Returns
    -------
    cleaned_text: list of strs
        List of strings where each element is a word from the cleaned text
    """

## 3. Sentiment Analysis

### 3a. Sentiment of an Individual Tweet

Write a function that takes a tweet dictionary object (not just text) and gets the average sentiment of the tweet. 

1. Get the text of the tweet
2. Clean the tweet text
3. Use the sentiment dictionary to sum the total sentiment of the tweet
4. Divide by the total by the number of words scored to get the average sentiment
5. If the total number of words scores is 0 (i.e. we can't measure sentiment for the tweet with our dictionary because no words in the tweet are in our dictionary), return `None`

In [None]:
def get_tweet_sentiment(tweet, word2score):
    """
    Calculates the average sentiment for an individual tweet 
    using a sentiment dictionary
    
    Parameters
    ----------
    tweet: dict
        Dictionary representing the data for a tweet
    word2score: dict
        Dictionary where keys are words and values are sentiment scores
        
    Returns
    -------
    avg_sentiment: float
        The average sentiment of the tweet
    """

### 3b. Auditing Sentiment of Individual Tweets

Now, write code to get the 5 tweets with the highest sentiment, and the 5 tweets with the lowest sentiment. Your code does not have to be wrapped in a function, but it should print out three things:
1. The tweet ID
2. The _original_ tweet text (not the cleaned text)
3. The sentiment score

Do these match your intuition of what should be lowest and highest? Why are these the tweets with the highest and lowest sentiment scores?

**Note, content warning:** If your code is working properly, one of the lowest sentiment tweets will have a gendered slur in it.


**Hint:** Look at how we used `sorted` a few weeks ago with our baseball leaderboards.

### 3c. Sentiment of a Group of Tweets

Write a function that takes a list of tweets and calculates the average sentiment across _all_ of them together. That is treat all of the tweets like one, single large text and calculate a single sentiment score across all of them. Remember, the average sentiment is the total sentiment divided by the total number of words that were scored.

In [None]:
def get_corpus_sentiment(tweets, word2score):
    """
    Calculates the average sentiment of a corpus of tweets
    using a sentiment dictionary
    
    Parameters
    ----------
    tweest: list of dicts
         List where each element is a dictionary with the data for one tweet
    word2score: dict
        Dictionary where keys are words and values are sentiment scores
        
    Returns
    -------
    avg_sentiment: float
        The average sentiment of the entire corpus of tweets
    """

In [None]:
avg_sentiment = get_corpus_sentiment(tweets, word2score)
print(avg_sentiment)

In [1]:
text = 'hi my name is ryan #old'

In [2]:
strings = text.split(" ")

In [3]:
print(strings)

['hi', 'my', 'name', 'is', 'ryan', '#old']


In [4]:
for string in strings:
    print(string)

hi
my
name
is
ryan
#old


In [6]:
if 'a' in 'animal':
    print('hi')

hi


In [7]:
for n in [1, 2, 3, 4, 5, 6]:
    if n % 2 == 0:
        continue
        
    print(n)

1
3
5


In [17]:
text = 'Hi my name is Ryan #gradschool'
split_words = text.split()

In [18]:
print(split_words)

['hi', 'my', 'name', 'is', 'ryan', '#gradschool']


In [None]:
cleaned_text = []