Plot the dendrogram of your hierarchical clustering model


Problem 1

Letter Recognition exercise

Letter Recognition

One of the earliest applications of predictive analytics was automatic recognition of letters, which is used in applications like sorting mail at post offices. In this problem, we will build a model that uses attributes of images of four letters in the Roman alphabet - A, B, P, and Ft - to predict which letter a particular image corresponds to.

In this problem, we have more than two classifications that are possible for each observation, like the situation in Chapter 8, when D2Hawkeye built a model to classify expected healthcare cost. Such problems are called multi-class classification problems.

The file Letters.csv (available in the Online Companion) contains 3116 observations, each of which corresponds to a certain image of one of the four letters A, B, P and R. The images came from 20 different fonts, which were then randomly distorted to produce the final images; each such distorted image is represented as a collection of pixels, each of which is "on" or "off." For each such distorted image, we have available certain attributes of the image in terms of these pixels, as well as which of the four letters the image is. These variables are described in Table 22.7.

a) To warm up, start by predicting whether or not the letter is "B." First, create a new variable called Is13 in your dataset, which takes value "Yes" if the letter is B, and "No" if the letter is not B. Then randomly split your dataset into a training set and a testing set, putting 50% of the data in each set.

i) Before building models, let us consider a baseline method that always predicts the most frequent outcome, which is "not B." What is the accuracy of this baseline method on the test set?

ii) Build a CART tree to predict whether or not a letter is a B, using the training set to build your model. Remember to not use the variable Letter as one of the independent variables in the model, as this is related to what we are trying to predict! Select reasonable parameter values for the model, and justify your parameter choices. What is the accuracy of this CART model on the test set?

iii) Now, build a random forest model to predict whether or not the letter is a B. Again, select reasonable parameter values for the model, and justify your parameter choices. What is the accuracy of the model on the test set?

Table 22.7: Variables in the dataset Lctters.csv.

Variable Description
Letter The letter that the image corresponds to (A, 13, P or R).
Xbox The horizontal position of where the smallest box covering the letter shape begins.
Ybox The vertical position of where the smallest box covering the letter shape begins.
Width The width of this smallest box.
Height height of this smallest box.
Onpix The total number of "on" pixels in the character image.
Xbar The mean horizontal position of all of the "on" pixels.
Ybar The mean vertical position of all of the "on" piXels.
X2bar The mean squared horizontal position of all of the "on" pixels in the image.
Y2bar The mean squared vertical position of all of the "on" pixels in the image.
XYbar The mean of the product of the horizontal and vertical position of all of the "on" pixels in the image.
X2Ybar The mean of the product of the squared horizontal position and the vertical position of all of the 'ton" pixels.
XY2bar The mean of the product of the horizontal position and the squared vertical position of all of the "on" pixels. 
Xedge The mean number of edges (the number of times an "off" pixel is followed by an "on" pixel, or the image boundary is hit) as the image is scanned from left to right, along the whole vertical length of the image.
XedgeYcor The mean of the product of the number of horizontal edges at each vertical position and the vertical position.
Yedge The mean number of edges as the images is scanned from top to bottom, along the whole horizontal length of the image.
YedgeXcor The mean of the product of the number of vertical edges at each horizontal position and the horizontal position.

iv) Compare the accuracy of your CART and Random Forest models. Which one performs better? For this application, do you think interpretability or accuracy is more important?

b) Let us now move on to the problem that we were originally interested in, which is to predict whether or not a letter is one of the four letters A, B, P or R. The variable in our dataset which we will be trying to predict is Letter.

i) In a multi-class classification problem, a simple baseline model is to predict the most frequent class of all of the options for every observation. For this problem, what does the baseline method predict, and what is the baseline accuracy on the testing set? Do you think this simple baseline method is a useful benchmark for this problem? Why or why not?

ii) Now build a classification tree to predict Letter. using the training set to build your model. (Remember not to use the variable IsB in the model, as this is related to what we are trying to predict!) Select reasonable parameter values and justify your parameter choices. What is the test set accuracy of your CART model?

(HINT: When you are computing the test set accuracy using a classification matrix, you want to add everything on the main diagonal and divide by the total number of observations in the test set.)

iii) Now, build a random forest model to predict Letter, using the training data - again, do not forget to remove the IsB variable. What is the test set accuracy of your random forest model?

iv) Compare the accuracy of your CART and Random Forest models for this problem. Which one would you recommend for this problem? Is your choice different from the model you recommended in part (a)?

Problem 2

Document Clustering exercise

Document Clustering

Document clustering, or text clustering, is a very popular application of clustering algorithms. A web search engine, like Google, often returns thousands of results for a simple query. For example, if you type the search term "jaguar" into Google, over 400 million results are returned. This makes it very difficult to browse or find relevant information, especially if the search term has multiple meanings, like this one. If we search for "jaguar," we might be looking for information about the animal, the car, or the Jacksonville Jaguars football team.

Clustering methods can be used to automatically group search results into categories, making it easier to find relevant results. This method is used in the search engines PolyMeta and Helioid, as well as on FirstGov, the official Web portal for the U.S. government. The two most common clustering algorithms used for document clustering are Hierarchical and IC-means.

In this exercise, we will be clustering articles published on Daily Kos, an American political blog that publishes news and opinion articles written from a progressive point of view. The file DailyKos.csv can be found in the Online Companion for this book, and contains data on 3,430 news articles or blogs that have been posted on Daily Kos. These articles were posted in 2004, leading up to the United States Presidential Election. The leading candidates were incumbent President George W. Bush (Republican candidate) and Senator John Kerry (Democratic candidate). Foreign policy was a dominant topic of the election, specifically, the 2003 invasion of Iraq.

There are 1,545 variables in this dataset -- each of the variables in the dataset is a word that has appeared in at least 50 different articles (1,545 words in total). For each document, or observation, the variable values are the number of times that word appeared in the document. (If you are familiar with text analytics, this approach is called bag of words.)

a) Start by building a Hierarchical Clustering model to cluster docu-ments using all of the variables in the dataset. Indicate which distance metrics you used for distances between the observations and distances between the clusters.

i) Building a hierarchical clustering model will probably take a significant amount of time on this dataset. Why?

ii) Plot the dendrogram of your hierarchical clustering model. Using the dendrogram and thinking about this particular appli¬cation, which number of clusters would you recommend? Keep in mind that document clustering would most likely be used by Daily Kos to show readers categories to choose from when trying to decide which articles to read.

iii) Assign each observation to a cluster, using the number of clusters you recommended in the previous subproblem. How many observations are in each cluster?

iv) In the previous chapter, we analyzed the centroids of the clusters by looking at the average values of all of the variables in each cluster. We do not want to do that here though, since we have over 1,000 variables! Instead, split your dataset into a dataset for each cluster, using your cluster assignments.

Then, find the six most frequent words in each cluster. If you are using R, and your dataset for the first cluster is called HierClusterl, this can be done with the command:
tail (sort (colMeans(HierClusteri) )).

Describe each cluster. Is there a cluster that is mostly about the Iraq war? Is there a cluster that is mostly about the democratic party? It might be helpful to know that in 2004, Howard Dean was one of the candidates for the Democratic nomination for the President of the United States, John Kerry was the candidate who won the democratic nomination, and John Edwards was the running mate of John Kerry (the Democratic Vice President nominee).

b) Now cluster the documents using K-means clustering. Choose the same number of clusters that you recommended for Hierarchical clustering.

i) How many observations are in each cluster? Is your answer the same as it was with Hierarchical clustering? Why or why not?

ii) Just like you did for Hierarchical clustering, split your dataset into a dataset for each K-means cluster, and analyze the most frequent words in each cluster. Are the clusters similar to the Hierarchical clusters? Can you find a similar Hierarchical cluster for each K-means cluster? Keep in mind that the order of the clusters (which cluster is labeled as 1, which cluster is labeled as 2, etc.) is meaningless - for example, Hierarchical cluster 3 might be very similar to K-means cluster 1.

c) Try repeating this problem with a different number of clusters than you originally selected. How do your results compare between the two selections? Do you prefer one number of clusters over the other? Are you able to make different observations about the data when the number of clusters changes?

Problem 3 (Real-life applications)

1. Make a summary in Word of at least 400 words and not more than 800 words of the paper "Analyzing user preferences using Facebook fan pages" posted on Canvas, explaining the clustering method used and describing the resulting clusters. Don't read the appendix. (Note: SPSS is a statistical software like R, except that it is not open-source).

2. Make a summary in Word of at least 500 words of Chapter 14 of the Analytics Edge textbook, making sure to include a brief description of each section, take particular care to describe the clustering approach in 14.3 Defining Peer Groups (among other things) and the Condorcet clustering method (except the "optimal clustering" section, which is starred and is therefore more advanced than the other sections) and answer the question: how can analytics be used to detect Medicaid fraud?

Article - Analyzing User Preferences Using Facebook Fan Pages by Pin Luarn, Hsien-Chih Kuo, Hong-Wen Lin, Yu-Ping Chiu, Ya-Cing Jhan

Attachment:- data sets.rar

Request for Solution File

Ask an Expert for Answer!!
Programming Languages: Plot the dendrogram of your hierarchical clustering model
Reference No:- TGS02723501

Expected delivery within 24 Hours