What value of k minimizes the training set error for this


BUSINESS DATA MINING ASSIGNMENT

Problem 1. In the following questions you will consider a k-nearest neighbor classifier using Euclidean distance metric on a binary classification task. We assign the class of the test point to be the class of the majority of the k nearest neighbors. Note that a point can be its own neighbor.

(a) What value of k minimizes the training set error for this data set? What is the resulting training error? Explain.

(b) Why might using too large values k be bad in this dataset? Why might too small values of k also be bad?

(c) What value of k minimizes leave-one-out cross-validation error for this dataset? What is the resulting error?

Problem 2. (R question) Universal Bank is a relatively young bank growing rapidly in terms of overall customer acquisition. The majority of these customers are liability customers (depositors) with varrying sizes of relationship with the bank. The customer base of asset customers (borrowers) is quite small, and the bank is interested in expanding ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise smarter campaigns with better target marketing. The goal is to use k-NN to predict whether a new customer will accept a loan offer. This will serve as the basis for the design of a new campaign.
The file UniversalBank.xls contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities, etc.), and the customer response to the last personal loan campaign (personal loan). Among these 5000 customers, only 480 (9.6%) accepted the personal loan that was offered to them in the earlier campaign. Partition the data into training (60%) and test (40%) sets.

(a) Consider the following customer: Age =40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education = 2, Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1, and Credit Card = 1. Perform a k-NN classification with all predictors except ID and ZIP code using k = 1. How would this customer be classified.

(b) What is the best choice of k?

(c) Show the classification matrix for the test data that results from using the best k. (d)For the customer described in part (a), what is the predicted class using the best k?

Problem 3. Given the matrix X whose rows represent different data points, you are asked to perform a k-means clustering on this dataset using the Euclidean distance as the distance function. Here k is chosen as 3. All data in X were plotted in the figure below. The centers of 3 clusters were initialized as c1 = (6.2, 3.2) (red), c2 = (6.6, 3.7) (green), c3 = (6.5, 3.0) (blue).

(a) What's the center of the first cluster (red) after one iteration? (b)What's the center of the second cluster (green) after two iteration? (c) What's the center of the second cluster (green) after two iteration? (d)How many iterations are required for the clusters to converge?

Problem 4. Consider clustering a dataset X = {X1, ..., Xn}. Suppose that for some subset S ⊂ {1, . . . . , n} of the observations, we observe labels that divide the X to L different classes. Can you propose a modification of the K-means algorithm that uses these labels in such a way that all observations from a given class remain in the same cluster and allows you to cluster the data into k ≥ L clusters.

Problem 5. In the figure below, there are two clusters A (red) and B (blue), each has four members. The coordinates of each member are labeled in the figure. Compute the distance between two points using Euclidean distance.

(a) Use single link agglomeration clustering to group the instances in this data set. (b) Use complete link agglomeration clustering to group the instances in this data set. (c) Use average link agglomeration clustering to group the instances in this data set. (d) Among all three methods above which one is robust to noise? Explain. (e) If two clusters are desired, what data points would be clustered together according to the single linkage method used in part (a)?

Problem 6. Consider the following image showing data points belonging to three different clusters (indicated by the colors of the points). Which of the following clustering algorithms will perform well in accurately clustering the given data? Explain.
• k-means
• Single-link
• Complete-link
• None of the above

Problem 7. Considering single-link and complete-link hierarchical clustering, is it possible for a point to be closer to points in other clusters than to points in its own cluster? If so, in which approach will this tend to be observed? Explain.

Problem 8. (Case study) You need to use R software in this case study.

CRISA is a leading market research agency that specializes in tracking consumer purchase behavior in consumer goods (both durable and non-durable). In one major project, CRISA tracks about 30 product categories (e.g. detergents, etc.) and within each category, about 60-70 brands. To track purchase behavior, CRISA has constituted about 50,000 household panels in 105 cities and towns in India, covering about 80% of the Indian urban market. (In addition to this, there are 25,000 sample households selected in rural areas; however, we are working with only urban market data). The households are carefully selected using stratified sampling. The strata are defined on the basis of socio-economic status, and the market (a collection of cities). CRISA has both transaction data (each row is a transaction) and household data (each row is a household), and, for the household data, maintains the following information:

• Demographics of the households (updated annually)
• Possession of durable goods (car, washing machine, etc.; updated annually) and a computed "affluence index" on this basis
• Purchase data of product categories and brands (updated monthly).

CRISA has two categories of clients: (1) Advertising agencies who subscribe to the database services; they obtain updated data every month and use it to advise their clients on advertising and promotion strategies. (2) Consumer goods manufacturers who monitor their market share using the CRISA database.

Key Problems

CRISA has traditionally segmented markets on the basis of purchaser demographics. They would like now to segment the market based on two key sets of variables more directly related to the purchase process and to brand loyalty:

1. Purchase behavior (volume, frequency, susceptibility to discounts, and brand loyalty)

2. Basis of purchase (price, selling proposition)

Doing so would allow CRISA to gain information about what demographic attributes are associated with different purchase behaviors and degrees of brand loyalty, and more effectively deploy promotion budgets.

The better and more effective market segmentation would enable CRISAs clients to design more cost-effective promotions targeted at appropriate segments. Thus, multiple promotions could be launched, each targeted at different market segments at different times of a year. This would result in a more cost-effective allocation of the promotion budget to different marketsegments. It would also enable CRISA to design more effective customer reward systems and thereby increase brand loyalty.

Measuring Brand Loyalty

Several variables in this case measure aspects of brand loyalty. The number of different brands purchased by the customer is one measure. However, a consumer who purchases one or two brands in quick succession, and then settles on a third for a long streak is different from a consumer who constantly switches back and forth among three brands. So, how often customers switch from one brand to another is another measure of loyalty. Yet a third perspective on the same issue is the proportion of purchases that go to different brands a consumer who spends 90% of his or her purchase money on one brand is more loyal than a consumer who spends more equally among several brands. All three of these components can be measured with the data in the purchase summary worksheet.

Data

Data file is BathSoap.xls. The data in the Table 1 below profiles each household each row contains the data for one household.

Though not used in the assignment, two additional datasets were used in the derivation of the summary data.

CRISAPurchaseData is a transaction database in which each row is a transaction. Multiple rows in this dataset corresponding to a single household were consolidated into a single row of household data in CRISASummaryData.

The Durables sheet in the data file contains information used to calculate the affluence index. Each row corresponds to a household, and each column represents a durable consumer good. A 1 in a column indicates that the durable is possessed by the household; a 0 indicates that it is not possessed. This value is multiplied by the weight assigned to the durable item. The sum of all the weighted values of the durables possessed gives the affluence index.

Questions

1. Use k-means clustering to identify clusters of households based on

(a) The variables that describe purchase behavior (including brand loyalty).

[Variables: # brands, brand runs, total volume, # transactions, value, Avg. price, share to other brands, max to one brand].

(b) The variables that describe basis-for-purchase.

[Variables: Pur-vol-no-promo, Pur-vol-promo-6, Pur-vol-other, all price categories, selling propositions]

[Note: would you use all selling-propositions? Explore the data.]

(c) The variables that describe both purchase behavior and basis of purchase.

Note: How should k be chosen? Think about how the clusters would be used. It is likely that the marketing efforts would support 2-5 different promotional approaches.

Note: How should the percentages of total purchases comprised by various brands be treated? Isn't a customer who buys all brand A just as loyal as a customer who buys all brand B?

2.(a) Select what you think is the best segmentation - explain why you think this is the "best".

(b) Comment on the characteristics (demographic, brand loyalty and basis-for-purchase) of these clusters. (This information would be used to guide the development of advertising and promotional campaigns.)

Attachment:- BUSINESS DATA MINING ASSIGNMENT.rar

Request for Solution File

Ask an Expert for Answer!!
Computer Engineering: What value of k minimizes the training set error for this
Reference No:- TGS02741366

Expected delivery within 24 Hours