Write code to perform k-means clustering on the values in


Part 1: K-means Clustering

K-means clustering is a clustering method. The algorithm can be described as follows:

0) Determine how many (k) clusters you will search for.

1) Randomly assign points in your data to each of the clusters.

2) Once all values have been assigned to a cluster, calculate the means or the centroid of the values in each cluster.

3) Reassign values to clusters by associating values in the data set to the nearest (euclidean distance) mean.

4) Repeat steps 2 and 3 until convergence. Convergence occurs when no values are reassigned to a new cluster.

Task 1 - Write code to perform k-means clustering on the values in the matrix 'dat'.

The true group labels are provided in the vector 'true_groups'. Of course, you can't use that until the very end where you will perform some verification.

Requirements:

1) So everyone will get consistent results, I have performed the initial assignment of points to clusters.

2) With each iteration, plot the data, colored by their current groupings, and the updated means.

3) Convergence is reached when group assignments no longer change. Your k-means clustering algorithm should reach convergence fairly quickly.

4) Print out a 'confusion' matrix showing how well the k-means clustering algorithm classified the data.

You'll probably want to write a function that will calculate distances from the three means. I also highly recommend avoiding the use of loops within each iteration. Instead try to use apply functions. My final code takes 15 lines. You don't need to write as efficiently as I do. You can write a hundred lines, but as long as it gets the job done correctly, it'll be accepted.

Part 2: EM Algorithm

The Expectation-Maximization algorithm is an iterative algorithm for finding maximum likelihood estimates of parameters when some of the data is missing. In our case, we are trying to estimate the model parameters of a mixture of multi-variate Gaussian distributions, but we are missing the assignments of the data points.

The general form of the EM algorithm consists of alternating between an expectation step and a maximization step. In the expectation step, a function is calculated. The function is the expectation of the log-likelihood of the joint distribution of the data X along with the missing values of Z given the values of X under the current estimates of $\theta$. In the maximization step, the values of $\theta$ are found that will maximize this expected log-likelihood.

The great thing is that the solution to the maximization step can often be found analytically (versus having to search for it via a computational method.) For example, the estimate of the mean that maximizes the likelihood of the data is just the sample mean.

EM Algorithm for Gaussian Mixtures -

This (brilliant) algorithm can be applied to perform clustering of Gaussian mixtures (among many other applications) in a manner similar to the k-means algorithm. A key difference between the k-means algorithm and the EM algorithm is that the EM algorithm is probabilistic. While the k-means algorithm assigned a value to the group with the nearest mean, the EM algorithm calculates the probability that a point belongs to a certain group.

In the context of a Gaussian mixture, we have the following components:

1) $X$ is our observed data

2) $Z$ is the missing data: the class or cluster to which the observations $X$ belong.

3) $X$ come from a normal distributions defined by the unknown parameters $\Theta$ (the mean $\mu$ and variance $\Sigma$).

4) $Z$ is generated by a categorical distribution based on the unknown class mixing parameters $\alpha$. ($\sum \alpha_i = 1$)

The EM algorithm for Gaussian Mixtures will behave as follows:

1) Begin with some random or arbitrary starting values of $\Theta$ and $\alpha$.

2) E-Step. In the E-step, we will use Bayes' theorem to calculate the posterior probability that an observation $i$ belongs to component $k$.

3) M-step. Based on the estimates of the 'weights' found in the E-step, we will now perform Maximum Likelihood estimation for the model parameters.

This turns out to be fairly straightforward, as the MLE estimates for a normal distribution are fairly easy to obtain analytically.

For each component, we will find the mean, variance, and mixing proportion based on the data points that are "assigned" to the component. The data points are not actually "assigned" to the components like they are in k-means, but rather the components are given a "weight" or "responsibility" for each observation.

4. Iterate between steps 2 and 3 until convergence is reached.

Coding the EM algorithm for Gaussian Mixtures -

Coding the algorithm is a matter of turning the above steps into code.

The package 'mvtnorm' handles multivariate normal distributions. The function 'dmvnorm()' can be used to find the probability of the data $N(x_i | \mu_k, \Sigma_k)$. It can even be applied in vector form, so you can avoid loops when trying to find the probabilities.

You are dealing with a 1000 x 2 matrix of data points.

A few key things to remember / help you troubleshoot your code:

1) Your matrix of 'weights' will be 1000 x 3. (one row for each observation, one column for each cluster)

2) $N_k$ is a vector of three elements. It is effectively the column sums of the weight matrix $w$.

3) $\alpha$ is a vector of three elements. The elements will add to 1.

4) $\mu$ is a 3 x 2 matrix. One row for each cluster, one column for each x variable.

5) Each covariance matrix sigma is a 2x2 matrix. There are three clusters, so there are three covariance matrices.

Part 3: Open-ended analysis

For this third part, I will ask you to perform a bit of open ended analysis with real data.

Our data is comes from kaggle. It is a list of all the menu items offered at McDonald's restaurants and their nutrition information. The task is for you to perform some cluster analysis on the McDonald's menu items.

While the task is open-ended, there are a few things that you will be required to do.

Requirements:

Data Preparation:

1) Remove the following variables, as they are almost 100% colinear with another variable:

- Calories from Fat

- total fat (% daily value)

- Saturated Fat (% Daily Value)

- Cholesterol (% Daily Value)

- Sodium (% Daily Value)

- Carbohydrates (% Daily Value)

- Dietary Fiber (% Daily Value)

2) Perform your analysis on scaled the data. (just run 'scale()'). This prevents certain variables with large variances from dominating the analysis

Initial Visualization:

3) Produce between 1 and 3 plots of the menu items and nutrition information. Try to identify variables that would be useful in clustering the data. (You are allowed to use whatever packages you want for this. ggplot2, scatterplot3d, etc.) [Don't spend too much time (over 20 min) on this task. You are not trying to win datafest visualization. Keep it simple.]

PCA as "pre-processing":

4) Perform Principal components analysis on the scaled McDonald's data. Use the built in function 'prcomp()'.

5) Plot your data using the first two principal components

6) Look at resulting rotation matrix, identify which variables seem to be weighted the heaviest for each of the first two principal components.

K-means on the PCA matrix

7) Use R's 'kmeans()' function to perform k-means clustering using just the first two components in the PCA matrix. (Just the first two columns of $x in the prcomp output)

8) You will need to decide how many clusters to look for. You can plot $tot.withinss against the number of components to see where adding additional clusters do not improve the total within sum of squares.

I do a bit of this here:

9) Plot the results of the k-means analysis on the plot of the first two principal components

Explaining it all

10) Take a look at the clusters formed by your k-means analysis. Pair them up with the names of the menu items. Can you give each cluster a name? For example, you might call one category "sugary drinks and desserts" or something similar.

11) Using the results of your principal components analysis, try to pick the two or three most important variables in the original data. Produce a plot using those variables and the k-means groups.

Attachment:- Assignment Files.rar

Request for Solution File

Ask an Expert for Answer!!
Applied Statistics: Write code to perform k-means clustering on the values in
Reference No:- TGS02230366

Expected delivery within 24 Hours