Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STA 4373 – Computational Methods in Statistics

Fall 2022

STA 4373 Assignment 3

Instructions.

In this assignment you’ll analyze a Twitter dataset and create a PDF of your results using the same Quarto template I posted to Canvas.  As before, when you turn in the file, the filename of the turn-in should be last names separated by dashes and terminated with -3.pdf. For example, if Joe Shmo, Jane Doe, Mickey Mouse worked together, they would turn in shmo-doe-mouse-3.pdf.

Again, you may use your text and work in groups of size up to three.  Only one delegate of your team will submit the resulting PDF on Canvas. The PDF should have the names of each of the collaborators on top. The main advantage to working in a group is that you can bounce ideas off one another, and hopefully uncover more interesting features of the data.

You may use the internet to access the text’s wepage, other websites directly linked in this document, and other general-purpose data science in R questions.  However, you may not read or use any analyses of this or related datasets you find online. Failure to follow this rule may be considered a violation of this course’s academic integrity policy. If you have any questions about this, please contact me.

Please put a new page break before each question so each question starts on its own page (this will facilitate grading) and never provide output that runs over more than one page if you can help it. Be sure to echo all your code!

Twitter data.

rtweet. The R package rtweet is used to download data from Twitter. To do this yourself, you would need to get a Twitter account and read through the documentation at https://rtweet.info/index.html. This is an easy process, but you won’t need to do it here as I have downloaded the data for you and stored it as an .rds file.

For what follows, please do not use any built-in rtweet functions, for example to extract pieces of data from the downloaded data object. Instead, use purrr verbs.

Questions.

1. Donald Trump’s Twitter handle is @realDonaldTrump. I recently used get_timeline() to scrape his last 2,000 or so tweets and save them as an object named dt_tweets.rds.  Read in this dataset and

store it into an object called tweets. Run the code below to show you’ve succeeded. tweets |> select(1:4) |> glimpse()

# Rows: 2,382

# Columns: 4

# $ created_at <dttm> 2021-01-08 09:44:28, 2021-01-08 08:46:38, 2021-01-07 18:10~

# $ id         <dbl> 1.347570e+18, 1.347555e+18, 1.347335e+18, 1.346913e+18, 1.3~

# $ id_str     <chr> "1347569870578266115", "1347555316863553542", "134733480405~

# $ full_text  <chr> "To all of those who have asked, I will not be going to the~

2. Tweets in Twitter must be between 0 and 280 characters long.  Make a histogram of the numbers of characters in text.  Do you notice anything strange?  You don’t have to investigate why this is happening, just identify the oddity.

Note: Dont use any width column that may or may not be in the dataset, compute the lengths yourself from the tweets.

3. hms::as_hms() converts a datetime object into a hours/minutes/second object (of class hms).  Use hms::as_hms() to make a histogram of the times of day at which Trump tweeted. Polish your graphic.

Hint:  Check out this thread!

4. Twitter users can mention users or direct tweets to users using the @ symbol directly followed by a user name, between 1 and 15 (word) characters long (here are the technical rules, but you don’t need to use them). Use string processing to determine the top 20 users Donald Trump mentioned in his tweets by analyzing the text of the tweets.

5. Other Twitter users can like tweets, and these are included in the data structure as favorite_count. Print Trump’s  15 most popular tweets to the screen,  formatting them nicely with functions like str_wrap().

6. A Twitter user has two ways of re-sharing past tweets (their own or someone else’s):  retweeting or quote tweeting.

Retweeting refers to simply re-sharing someone’s past tweet, so that the followers of the retweeter’s account would see the tweet, which may have come from an account that the retweeter’s followers would not have otherwise seen.

Quote tweeting refers to re-sharing someone’s past tweet along with commentary.

In this question we’re only interested in retweets. Unfortunately, there is not a simple binary variable in the dataset that indicates whether a given tweet is a retweet. Nevertheless, we can create one like this:

tweets <- tweets %>%

mutate(

"is_retweet" = map_lgl(retweeted_status, ~ !is.na(.x$created_at)) )

You can confirm this (roughly, at least) by looking at the resulting tibble and Donald Trump’s Twitter page:

tweets %>% select(created_at, is_retweet, is_quote_status, text) %>% print(n = 20)

Explain how the code in the mutate() call above works.

Note: Another way to make the column would be to check if the text of the tweet begins with something like "RT @\\w{1,15}: ", that’s just not the route we take here.

7. What user handle does Trump retweet most (not quote tweet). List the top 10.

Hint: Re-read the note from the previous question!

8. This is quite the interesting dataset.  Ask one additional question of the dataset and assess it using string processing.