Built a logistic regression model on the training data


Assignment Problem: Confidence Interval, Hypothesis Testing, Data Mining Models

Objectives:

This assignment assesses your understanding of Confidence Interval, Hypothesis Testing, and Data Mining Models.

Question 1: Central Limit Theorem

Central Limit Theorem believes that the sampling distribution of the mean of samples has a particular property. Regardless of the population that we want to make inference about it, if we draw many samples, the sampling distribution of the sample mean is always symmetric and bell-shaped. Please program to simulate the Central Limit Theorem in different population distributions (at least 3) and sampling sizes (at least 3). Totally 9 trails.

Question 2: Hypothesis Testing

(1). A steel-making factory wants to know if the introduced new method can increase its productivity. The staffs recorded 10 productivity results of the old method and the new method, respectively. The results are given as follow. Explain your t-test findings.

Old Method: 78.1 72.4 76.2 74.3 77.4 78.4 76.0 75.5 76.7 77.3
New Method: 79.1 81.0 77.3 79.1 80.0 79.1 79.1 77.3 80.2 82.1

Note that samples are independent with each other and come from normal distributions N(μ1, σ2) and N(μ2, σ2) where μ1,μ2 and σ2 are unknown.

old <- c(78.1,72.4,76.2,74.3,77.4,78.4,76.0,75.5,76.7,77.3)

new <- c(79.1,81.0,77.3,79.1,80.0,79.1,79.1,77.3,80.2,82.1)

(2). Do the old and new samples truely come from two distributions of the same variance?

Question 3: Linear Regression and Anova

Use the dataset 'Q3 Data.txt' (which is tab delimited) on different brands of cigarettes - you want to predict CO (Carbon Monoxide output) given the other variables.

1. Fit all seven possible linear models with CO as the dependent variable (i.e. with all possible sets of independent variables except for no independent variables) and summarise the results in a table.

2. Identify what you think is the best model for predicting CO and explain why you think it is good.

3. Include a summary of diagnostic checks that you try for your best model (Residuals versus Fitted, Normal Q-Q, scale-location, and residuals vs leverage.).

Question 4: Logistic Regression

You are required to predict affair with Logistic Regression in this task. The used dataset comes from a survey conducted by Psychology Today in 1969 which contains 601 observations on 9 variables. A detailed data description is shown as below.

affairs

numeric. How often engaged in extramarital sexual intercourse during the past year? 0 = none, 1 = once, 2 = twice, 3 = 3 times, 7 = 4-10 times, 12 = monthly, 12 = weekly, 12 = daily.

gender

factor indicating gender.

age

numeric variable coding age in years: 17.5 = under 20, 22 = 20-24, 27 = 25-29, 32 = 30-34, 37 = 35-39, 42 = 40-44, 47 = 45-49, 52 = 50-54, 57 = 55 or over.

yearsmarried

numeric variable coding number of years married: 0.125 = 3 months or less, 0.417 = 4-6 months, 0.75 = 6 months-1 year, 1.5 = 1-2 years, 4 = 3-5 years, 7 = 6-8 years, 10 = 9-11 years, 15 = 12 or more years.

children

factor. Are there children in the marriage?

religiousness

numeric variable coding religiousness: 1 = anti, 2 = not at all, 3 = slightly, 4 = somewhat, 5 = very.

education

numeric variable coding level of education: 9 = grade school, 12 = high school graduate, 14 = some college, 16 = college graduate, 17 = some graduate work, 18 = master's degree, 20 = Ph.D., M.D., or other advanced degree.

Occupation

numeric variable coding occupation according to Hollingshead classification (reverse numbering).

rating

numeric variable coding self rating of marriage: 1 = very unhappy, 2 = somewhat unhappy, 3 = average, 4 = happier than average, 5 = very happy.

# install.packages("AER")

data(Affairs,package="AER")

summary(Affairs)

## affairs gender age yearsmarried children

## Min. : 0.000 female:315 Min. :17.50 Min. : 0.125 no :171

## 1st Qu.: 0.000 male :286 1st Qu.:27.00 1st Qu.: 4.000 yes:430

## Median : 0.000 Median :32.00 Median : 7.000

## Mean : 1.456 Mean :32.49 Mean : 8.178

## 3rd Qu.: 0.000 3rd Qu.:37.00 3rd Qu.:15.000

## Max. :12.000 Max. :57.00 Max. :15.000

## religiousness education occupation rating

## Min. :1.000 Min. : 9.00 Min. :1.000 Min. :1.000

## 1st Qu.:2.000 1st Qu.:14.00 1st Qu.:3.000 1st Qu.:3.000

## Median :3.000 Median :16.00 Median :5.000 Median :4.000

## Mean :3.116 Mean :16.17 Mean :4.195 Mean :3.932

## 3rd Qu.:4.000 3rd Qu.:18.00 3rd Qu.:6.000 3rd Qu.:5.000

## Max. :5.000 Max. :20.00 Max. :7.000 Max. :5.000

1.Data Pre-Processing. (E.g. removal of null values; numeralization of factor features; split of training and test set with a ratio of 8:2, etc.)

2.Built a logistic regression model on the training data. You need to determine which feature to use based on the p values analysis.

3.Evaluate the trained model on the test set.

Feeling perplexed how to start composing your Modeling for Data Analysis assignments and homework? Get in touch with Modeling for Data Analysis Assignment Help right now and step ahead towards your academic success!

Tags: Modeling for Data Analysis Assignment Help, Modeling for Data Analysis Homework Help, Modeling for Data Analysis Coursework, Modeling for Data Analysis Solved Assignments, Confidence Interval Assignment Help, Confidence Interval Homework Help, Hypothesis Testing Assignment Help, Hypothesis Testing Homework Help

Attachment:- Modelling for data analysis.zip

Request for Solution File

Ask an Expert for Answer!!
Basic Statistics: Built a logistic regression model on the training data
Reference No:- TGS03052089

Expected delivery within 24 Hours