Ds 710 homework 6 - r assignment write python code to


Homework 6 - R assignment

1. Can we detect when a marketing campaign has been successful?

a. On homework 4, you simulated data from the TableFarm salad chain before and after the implementation of a new marketing campaign.  Read the combined data (both before and after) into R.  (You could do this by saving the data as a .csv file and using read.csv(), or by copying the data into a text file, separating the values by commas, and enclosing the data in c( ... ) to make a vector.)  Homework 4TableFarm  is below.

Average monthly revenue at each store in the TableFarm salad chain is $100,000, with a standard deviation of $12,000. An advertising firm claims they can increase monthly revenue to $120.000, but the standard deviation will be increased as well, to $25,000.

Write Python code to generate three lists of random numbers which model potential revenue: one list with 12 months of revenue using the current mean and standard deviation, another list with 12 months of revenue using the predicted mean and standard deviation, and a third list combining your first two lists. You can assume a normal distribution. Round each number to the nearest $1.000.

b. Make a scatterplot of the data.  Add a vertical line to mark the month in which the new marketing campaign began, and add a legend to your plot.

c. Make side-by-side boxplots of the revenue before and after implementing the marketing campaign.  Write a few sentences describing and comparing the boxplots, and relating them to the underlying model you used to simulate the data.

d. Based on the way you simulated the data, you know that the marketing campaign was successful; that is, the data after implementing the marketing campaign was simulated from an underlying model with a higher mean than before the marketing campaign.  However, in real life we probably wouldn't know this.  Based on the scatterplot and boxplots, would you be confident in claiming that the marketing campaign was successful?  Why or why not?

e. Write the null and alternative hypotheses for a test of whether the marketing campaign was successful.  (I.e., whether the mean revenue with the marketing campaign is higher than the mean revenue before the marketing campaign.)

f. In a few sentences, explain why a 2-sample, 1-sided t-test is appropriate for testing the hypotheses in part e.

g. Conduct a 2-sample, 1-sided t-test in R.  Include the R output and state your conclusion in the context of the problem.

2. Can we detect an association between chocolate consumption and Nobel prizes?  Homework 4 problems reffered to are below:

Researchers have observed a (presumably spurious) correlation between per capita chocolate consumption and the rate of Nobel prize laureates: see Chocolate Consumption. Cognitive Function, and Nobel Laureates. In this problem, we will create some sample data to simulate this relationship.

Write Python code to produce a list of 50 ordered pairs (c, n), where c represents chocolate consumption in kg/year/person and n represents the number of Nobel laureates per 10 million population. The values for c should be random numbers (not necessarily integers!) between 0 and 15. You may assume that c and n are related by

n = 0.4 · c - 0.8.

However, it is not possible for a nation to have a negative number of Nobel laureates, so if your predicted value of n is less than 0, replace that value by 0.

Report your values of c and n to 2 decimal places. Print your list of ordered pairs.

Problem - Error Term

Our list of data from part (a) is not a good simulation of real-world data, because it is perfectly linear. Starting with the c and n values you generated in part (a), generate new n values, using the following formula:

ne = n + c.

Here c should be a random variable with normal distribution, mean 0, and standard deviation 1. Using the list of ordered pairs generated in 3(a), create a new list of 50 ordered pairs (c, ne).

Again, your simulated data should not predict negative numbers of Nobel laureates. Again, do not generate a new list; make sure to use the list of ordered pairs already generated in 3(a).

Print your new list of ordered pairs.

a. On homework 4, you simulated data on countries' per-capita chocolate consumption and number of Nobel Prize winners, using an error term ? (representing random "noise").  Read these data into R and make a scatterplot of the number of Nobel Prize winners versus chocolate consumption.

b. Fit a linear model to the data.  What is the equation of the line of best fit?  How does it compare to the theoretical model you used to simulate the data?  Graph the line of best fit with the scatterplot.

c. State the null and alternative hypotheses for a test of whether the number of Nobel Prize winners (per 10 million population) is associated with per-capita chocolate consumption.

d. State your conclusion about the hypotheses in part c, in the context of the problem.

e. Graph the diagnostic plots for the regression. Explain what they tell us.

3. In homework 5, you counted the frequencies of letters in two encrypted texts.  In this problem, you will use statistical analysis to identify the language in which the text was written, and decrypt it.

a. Read the letter frequencies from encryptedA into R and attach the data.  Use the following code to make a barplot of the letter frequencies, with the letters listed in order of increasing frequency:  (Here I've assumed that your columns were named "key" and "count".)

encrypt_order = order(count)

barplot( count[encrypt_order], names.arg = key[encrypt_order] )

Be sure you understand what this code does.

b. The file Letter Frequencies.csv contains data on the frequencies of letters in different languages.  (Source:  https://www.sttmedia.com/characterfrequency-englishand https://www.sttmedia.com/characterfrequency-welsh, accessed 21 August 2015.  Used by permission of Stefan Trost.)  Read these data into R. 

c. In a single graphing window, display two bar plots:  A plot on top showing the encrypted frequencies, and a plot below it showing the frequencies of letters in English.  Each plot should be sorted in order of increasing frequency.  Each plot should also have a title telling whether it is from the encrypted text or from plain English.

d. Based on the shape of the plots, do you think it is likely that the encrypted text came from English?  Explain.

e. We want to conduct a hypothesis test to be more precise about whether it is plausible that the text came from English.  To do this, we will pair up each letter in the encrypted text with a letter in English, based on the order of frequency.  So, encryptedA "r" is paired with English "e", encryptedA "c" is paired with English "t", etc.  Then we will test whether the resulting letter frequencies plausibly come from a random sample of English words.

To pair up the letters, sort the vector of counts from the encrypted text in order of increasing frequency, and store it as a new vector.  Then do the same thing with the vector of frequencies from English.

f. To pair up the letters, we need the data (the counts of letters from encryptedA.txt) and the probability model (the theoretical frequencies from Letter Frequencies.csv) to have the same number of letters.  Depending on how you formatted your output from Python, your letter counts may include 20 or 26 letters.  This is due to the fact that some letters did not appear in the encrypted text, so they appeared 0 times.  If necessary, prepend 6 zeroes to the count vector to make it the same length as the theoretical frequencies:

count = c( rep(0, 6), count )

g. State the null and alternative hypotheses for a chi-squared Goodness of Fit test of this question.

h. To satisfy the assumptions of a Goodness of Fit test, we need the expected counts of each category to be greater than or equal to 5.  Find the total number of letters in the encrypted text.  Then multiply this number by the probabilities from Letter Frequencies.csv to get the expected counts. 

i. Combine categories (letters) to get expected counts that are greater than or equal to 5.  For example, if you decided to combine the first two categories, you could use the code

sortEnglish_combined = c( sum(sortEnglish[1:2]), sortEnglish[3:26] )

Combine the same categories in the encrypted counts.

j. Use R to conduct the chi-squared Goodness of Fit test. 

k. State your conclusion in the context of the problem.

l. Repeat stepsh-k for Welsh, and then repeat for both languages for encryptedB.  Based on the hypothesis tests, which text do you think came from which language?  How confident are you in your assessment?

m. Optional:  Try to decrypt the English text.  Simon Singh's Black Chamber website (https://www.simonsingh.net/The_Black_Chamber/substitutioncrackingtool.html) will automatically substitute letters for you, so you can test different possibilities for what English plaintext letter is represented by each letter in the ciphertext.  Start by substituting the letter E for the most common letter in the ciphertext.  Then use frequencies of letters in the ciphertext, common patterns of letters, and experimentation to determine other substitutions.

Attachment:- Assignment Files.rar

Request for Solution File

Ask an Expert for Answer!!
Computer Engineering: Ds 710 homework 6 - r assignment write python code to
Reference No:- TGS02201217

Expected delivery within 24 Hours