What does an agglomeration schedule tell us in general


Question 1: Cluster Analysis

The spss file "metropolitan areas.sav" contains a data set taken from "Cities - Life in the World's 100 largest metropolitan areas, Population Crisis Committee, Washington, 1990". The data includes information about the following variables:

Population = population in millions

Murders = no of murders per year per 100,000 people

Food = percentage of income spent on food

Pproom = average number of persons living in one room

Water = % of homes with access to water and electricity

Telephone = no of telephones per 100 people

School = % of children completing education to age 18 years

Infant death = infant deaths/100 live births

Noise = ambient noise level on scale 1 (quietest) to 10 (noisiest)

Traffic = traffic flow: average mph of traffic in rush hour

area = area code: 1 = USA, Canada, Europe, Japan, Australia

In order to reduce the complexity of the data, I have conducted a cluster analysis.

a) What does an agglomeration schedule tell us in general? Provide a brief hypothetical example (using the Metropolitan Areas case), outlining the circumstances in which we might be interested in interpreting the agglomeration schedule.

b) When performing the hierarchical cluster analysis, I decided to select a 4 cluster solution. Would you have chosen the same number of clusters? What are the criteria for making this decision?

c) Please briefly summarize the key findings from the K-Means cluster solution. Do you believe it is a good solution? How would you label the clusters? What could be done to try improving the cluster solution?

d) As you can see in the dialog box for the K-Means cluster analysis, I did not specify any initial cluster means before performing the analysis. Why does it normally make sense to predetermine these values? What kinds of cluster means would make sense here as an input to the K-means cluster model?

e) Imagine that we obtain data from additional cities that are not currently included in our data set. How can I assign these new observations to one of the clusters identified in our previous analysis?

Question 2: Logistic Regression

A study was done to examine the characteristics of MBA graduates from four top US business schools. From the study, a subset of 100 students was selected. The data sample includes information on each student's profile with respect to

1. Grade Point Average (GPA)

2. GMAT Score

3. College Major

a. Humanities/Social Science (binary: 1=yes, 0=no)

b. Maths/Engineering (binary: 1=yes, 0=no)

c. Business (binary: 1=yes, 0=no)

4. Gender (1=Male, 2=Female)

5. Work Experience (1=1 year, 2=2years,...,6=more than 6 years)

One of the business schools (variable name: School_B), which is located on the East Coast has analyzed the data in order to better understand the profile of their MBA students in comparison to students at other top schools. In particular, a logistic regression analysis was performed using a binary variable (attendance=1; non-attendance=0) to predict the probability that a student in the survey attended School_B (instead of one of the other three schools).

The following screenshots display the steps taken when performing the logistic regression analysis in SPSS. The SPSS output report can be found in a separate file called appendix 2.

a) Based on the SPSS output provided in Appendix 2, is this a good model for predicting whether MBA students in the sample attended School_B? Please justify your answer from a statistical point of view by assessing model fit and overall model significance.

b) According to the output report, the significance level of the Hosmer-Lemeshow test is p=0.713. What does this mean? Is this good or bad news?

c) What types of students does School B attract? What are the most important predictors for attendance of School B?

d) In the output report you can see that GPA is a significant predictor of attendance at School B. Moreover, the natural logarithm of the unstandardized slope coefficient for GPA is Exp(B)=22.794. What does this mean?

e) According to the classification plot at the end of the SPSS output report, does the model seem to be better at predicting "attendance" or "non-attendance" at School B? Would you say that 0.5 is a reasonable cut-off value as a classification threshold?

Assignment Files -

https://www.dropbox.com/s/szbkh90yj0f8kk6/Assignment%20Files.zip?dl=0

Solution Preview :

Prepared by a verified Expert
Applied Statistics: What does an agglomeration schedule tell us in general
Reference No:- TGS02235351

Now Priced at $40 (50% Discount)

This task shows the working examples of correlation and regression. Correlation coefficient was used to test whether there exists any relationship between two variables. once, it is found that there is a significant relationship between two variables, then, we need to use regression analysis to determine the effect of independent variables on the dependent variable.

Recommended (99%)

Rated (4.3/5)