Create another set of columns that indicate the difference


Assignment

Before we can run any type of analysis we need to make sure that the data has been cleaned. For this dataset the analysis techniques we are considering are:

• Regression Analysis

• Structural Equation Modeling

Neither techniques handle missing data very well, so we need to make sure that all the missing data has been removed/addressed.

If the dataset is sufficiently large for the type of analysis we are planning to perform or if the percentage of missing data is relatively small (e.g., 5-10%), then we can just remove the rows with missing data. If this is not the case (our dataset is too small or there is a high percentage of missing data), then we may need to consider imputing the missing values - trying to "guess" at what they "should be".

1. Download the Survey Data dataset. (in the Datasets area on Blackboard).

2. Open the dataset in Excel

3. Look at the Data Dictionary - here you will get some idea as to what the legal values are for each variable.

4. Create a sheet in the Excel workbook and name it DeletedData. This is where we are going to moveall of the data that needs to be removed from the dataset.

5. Create a copy of the RawData sheet and rename the copy RemoveMissingValues

On this sheet identify the rows with missing values and move those rows to the DeletedData sheet. Be sure to put a heading above this data on the RemovedData sheet indicating why it was moved.

6. Next we are going to calculate Frequency of responses for each row. Create a copy of the RemoveMissingValues sheet and name the copy CalculateFrequencies

a. Add a row to the top of the CalculateFrequencies sheet - so we can insert some titles

b. First we want to check general integrity of the data

i. Count the number of values in each row

ii. Find the minimum value in each row

iii. Find the maximum value in each row

iv. Use conditional formatting to highlight values that are "incorrect". That would be not equal to 60 for count, less than 1 for min and greater than 5 for max. I use conditional formatting to highlight them - because if I were just to scan the numbers, then I might miss something.

Another way to do this is - go to the bottom and find the min and max of each of these new columns. The min and max of the count column should be 60, the min of the min column should be 1 or greater and the max of the max column should be 5 or less.

c. Add columns to calculate the frequency of each response (i.e., how many times did the 1st respondent answer 1, 2, 3, 4, and 5).

d. Use conditional formatting a heat map of the frequencies. I am going to use a 3-color scale with RED at the lower end, GREEN in the middle, and RED at the higher end.
Questionable rows will be all RED.

e. Next, I go to the bottom of the frequencies and calculate the average time that these answers have been selected. This is a "typical" response pattern.

f. Create another set of columns that indicate the difference between the frequency and the mean responses. NOTE: you will probably have to edit your conditional formatting rules to make this work.

Here the presence of "RED" means that it could be questionable.

g. Create a column that calculates the variance. Use conditional formatting (3 color RED for low values, YELLOW for middle, and GREEN for high values).
The lower the variance, the less variation in the responses. So, a variance of 0, means that they gave the same answer to all questions.

h. Finally, calculate how many times they gave a single response (1's, 2's, ...) for 100%, 95%, 90%, 85%). If a single response was used greater than or equal to 100%, 95%, 90%,85%, color it RED using conditional formatting.

7. Now, we are ready to remove rows - due to low variance of the responses. Create another heading on your DeletedData sheet. - REMOVED DUE TO LOW VARIANCE OF RESPONSES.

Move all of the responses where the variance is 0 - 100% of the responses are a single answer.
(NOTE: This will cause your DIFFERENCE FROM MEAN data to automatically be recalculated)

BEFORE WE REMOVED THE DATA WITH 0 VARIANCE

AFTER WE REMOVED THE DATA WITH 0 VARIANCE

8. Now, you need to use your judgement - do you stop or do you continue to delete responses with the next highest single answer bias? You have to really base this on - could that be a "reasonable" response. Could someone legitimately feel that way? If the answer is yes, then you stop. If the answer is no, then you keep going. So - it's up to you - do you delete the ones with 95% single responses and stop? Do you keep going 90%?

There is no real "right" answer. But, do be careful - it is better to leave some questionable responses in the dataset that it is to delete some legitimate responses.

9. The final step - create a copy of the CalculateFrequencies sheet and name the new sheet ReadyForAnalysis. On the new sheet. Delete any calculations that you did and delete any blank rows (from moved data). Now it is ready to load into a tool.

Attachment:- Assignment-Cleaning Survey Data.rar

Solution Preview :

Prepared by a verified Expert
Operation Management: Create another set of columns that indicate the difference
Reference No:- TGS02903352

Now Priced at $60 (50% Discount)

Recommended (90%)

Rated (4.3/5)