Math2349 data preprocessing - read the species and surveys


Assignment Tasks:

You will use WHO data set for Tasks 1- 5. Read the WHO data using an appropriate function and complete the tasks 1-5.

1- Tidy Task 1:

Use appropriate "tidyr" functions to reshape the WHO data set into the form given below:

2- Tidy Task 2:

The WHO data set is not in a tidy format yet. The "code" column still contains four different variables' information (see variable description section for the details). Separate the "code" column and form four new variables using appropriate "tidyr" functions. The final format of the WHO data set for this task should be in the form given below:

3- Tidy Task 3:

The WHO data set is not in a tidy format yet. The "rel", "ep", "sn", and "sp" keys need to be in their own columns as we will treat each of these as a separate variable. In this step, move the "rel", "ep", "sn", and "sp" keys into their own columns. The final format of the WHO data set for this task should be in the form given below:

4- Tidy Task 4:

There is one more step to tidy the WHO data set. We have two categorical variables "sex" and "age". Use "mutate()" to factorise sex and age. For "age" variable, you need to create labels and also order the variable. Labels would be: <15, 15-24, 25-34, 35-44, 45-54, 55-64, 65>=. The final tidy version of the WHO data set would look like this:

5- Task 5: Filter & Select

Drop the redundant columns "iso2" and "new", and filter any three countries from the tidy version of the WHO data set. Name this subset of the data frame as "WHO_subset".

You will use surveys and species data sets for Tasks 6 - 10. Read the species and surveys data sets using an appropriate function. Name these data frames as "species" and "surveys", respectively.

6- Task 6: Join

Combine "surveys" and "species" data frames using the key variable "species_id". For this task, you need to add the species information ("genus", "species", "taxa") to the "surveys" data. Rename the combined data frame as "surveys_combined".

7- Task 7: Calculate

Using the "surveys_combined" data frame, calculate the average weight and hindfoot length of one of the species observed in each month (irrespective of the year). Make sure to exclude missing values while calculating the average.

8- Task 8: Missing Values

Select one of the years in the "surveys_combined" dataframe, rename this data set as "surveys_combined_year". Using "surveys_combined_year" dataframe, find the total missing values in "weight" column grouped by species. Replace the missing values in "weight" column with the mean values of each species. Save this imputed data as "surveys_weight_imputed".

9- Task 9: Inconsistencies or Special Values

Inspect the "weight" column in "surveys_weight_imputed" dataframe for any further inconsistencies or special values (i.e., NaN, Inf, -Inf). Trace back and explain briefly why you got such a value.

10- Task 10: Outliers

Using the "surveys_combined" data frame, inspect the variable hindfoot length for possible univariate outliers. If you detect any outliers use any of the methods outlined in the Module 6 notes to deal with them. Explain briefly the actions that you take to handle outliers.

Attachment:- Assignment.zip

Request for Solution File

Ask an Expert for Answer!!
Dissertation: Math2349 data preprocessing - read the species and surveys
Reference No:- TGS02766288

Expected delivery within 24 Hours