Lab session 7 stats 220- why can we just calculate the


Data Technologies

The point of this lab is to get started using R and to practice reading text file data into R and calculating simple summaries from data.

Your answer will consist of a file containing R code; you can submit either a plain text file containing R code or a plain text file containing R markdown code. Please DO NOT submit anything other than a plain text file (e.g., DO NOT submit a Word document or a PDF document or an HTML document).

We will work with three CSV files called trump-tweets-num-2010.csv, trump-tweets-num- 2011.csv, and trump-tweets-num-2012.csv that contains data on tweets from the Twitter ac- count of Donald Trump (from 2010 to 2012).

Within these files, every row provides information for one of Donald Trump's tweets, mostly about when the tweet was sent (wday is day of the week, min is minutes, and sec is seconds), but also how many times the tweet was retweeted. The first few rows of the file trump-tweets-num-2010.csv are shown in Figure 1.

The data files are available on Canvas.

retweet_count,month,day,wday,hour,min,sec

144,11,30,4,21,42,1

109,11,23,4,16,26,18

112,11,16,4,14,30,23

250,11,14,2,20,55,30

12,11,13,1,16,42,27

14,11,13,1,16,39,7

24,11,13,1,16,30,47

44,11,10,5,14,42,15

55,11,9,4,20,2,3

24,11,2,4,15,32,49

31,10,29,1,15,52,46

69,10,24,3,18,41,32

32,10,24,3,17,20,54

19,10,24,3,15,53,23

26,10,22,1,17,22,23

21,10,18,4,17,11,35

27,10,18,4,15,45,35

34,10,15,1,19,42,7

28,10,11,4,15,20,8

Figure 1: The first few lines of the file trump-tweets-num-2010.csv.

NOTE: You should submit a file containing R code that assigns values to the appropriate symbols. I will run the code in your file and then check the values that have been assigned to the symbols.

NOTE: Your file should ONLY contain valid R code, properly indented, and with comments. You should be able to copy-and-paste your entire file of R code into R and get no errors.

NOTE: You should submit your answers on Canvas.

1. Write an R expression that reads the file trump-tweets-num-2010.csv and assigns the result to the symbol tweets2010.
NOTE: your code can assume that the data file is in the current working directory. The symbol tweets2010 should print like this:
> head(tweets2010)

 

retweet_count

month

day

wday

hour

min

sec

1

144

11

30

4

21

42

1

2

109

11

23

4

16

26

18

3

112

11

16

4

14

30

23

4

250

11

14

2

20

55

30

5

12

11

13

1

16

42

27

6

14

11

13

1

16

39

7

>dim(tweets2010)

2. Write an R expression that calculates the maximum value from the file trump-tweets-num- 2010.csv and assigns the result to the symbol maxRetweet2010.

The symbol maxRetweet2010 should print like this:

[1] 3813

Some things to think about:
 Why can we just calculate the maximum value for the whole file, rather than having to focus just on the retweet_count column?
 Is this calculation inefficient? Does it matter?

3. Write R code to calculate the largest number of retweets across all three files.
Assign your answer to the symbol maxRetweet. You should get a result that prints like this:
> maxRetweet

[1] 141644

Some things to think about:

 How unusual is this retweet value?
 How would you find out how unusual it is?

4. Write R code to calculate the latest time (before midnight), in seconds, that Donald Trump sent out a tweet.

Assign your answer to the symbol maxTweetTime. You should get a result that prints like this:
> maxTweetTime

[1] 86290

Some things to think about:

 Why did I specify "before midnight"?
 How would you convert this value into hours, minutes, and seconds?

 [EXTRA for EXPERTS - NO MARKS]

Write R code that shows the complete row of data for the latest (before-midnight) tweet ...

retweet_count month day wday hour min sec 86 25 5 5 6 23 33 42

.. and write code to produce a message that states the latest time (before midnight), including the date, that Donald Trump sent out that tweet ...

Donald's latest (pre-midnight) tweet was at 23:33:42 on Wednesday 05 May

Solution Preview :

Prepared by a verified Expert
Applied Statistics: Lab session 7 stats 220- why can we just calculate the
Reference No:- TGS02344872

Now Priced at $40 (50% Discount)

Recommended (90%)

Rated (4.3/5)