Assignment: Data Mining
Run an exercise on the Vehicle Solhouettes dataset from vehicle.csv, completing this report and providing the commands, output screenshots, and discussion/interpretation as requested. Ensure that all commands are saved in this report AND in an R script.
For Reference: UCI Machine Learning Repository: Vehicle Silhouettes
i. Based on what you have learned this week about k-means clustering, provide a one-paragraph masters-level response describing what you anticipate that the kmeans method will accomplish for the Vehicle Silhouettes data? Be specific about the behavior and output structure of k-means models.
b. Data Pre-Processing: Load the Vehicle Silhouettes data into R Studio using the read.csv command (do not use File > Import Dataset > From CSV in the R Studio GUI as this uses read_csv() resulting in significant different variable types!!!).
i. Make a copy of the loaded Vehicle Silhouettes data you just imported and name the copy ‘myvehicle'. Keep the original import as you will need both the original and copy to complete this report. Include the command demonstrating this step below.
ii. Remove the variable class from ‘myvehicle'. Include the command and answer to the question below.
Why do we need to remove the class variable as part of the data preprocessing steps for k-means clustering?
iii. Run the scale() function on ‘myvehicle'. Include the command and answer to the question below. (Note: This command is NOT part of your tutorial. Consult the function help and use the default arguments. Hint: scale() is a function that outputs its results. You MUST save the scaled output back to the original ‘myvehicle'.
Why must we scale data as part of the data preprocessing steps for k-means clustering?
iv. What additional data preprocessing steps (if any) did you need to execute? Include the command(s) and output screenshot below.
c. K-Means Clustering - Running the Method (Hint: Record your results with k=4 in the table in part f):
i. Run ‘set.seed(12345)' and then run the kmeans method with k=4 and store the output to a variable named ‘kc'. Include the command, output screenshot, and discuss the input parameters you used.
ii. Enter ‘kc' at the prompt. Provide the output below and then answer the following questions:
How many instances are in each cluster?
What information does the cluster means section provide and how were those numbers obtained?
What is the clustering vector?
What is the sum of squares by clusters and what does it mean?
iii. Run the ‘kc$iter' command. Include the command, output screenshot, and explain what the output shows.
d. K-Means Clustering-Evaluate the Model:
i. Build the cross-tabulation to compare how the method clustered the vehicles from ‘myvehicle' to the actual vehicle class from your original import. Include the command, output screenshot, and answer the following questions:
What is the dominant vehicle class in each cluster?
What is the dominant cluster for each vehicle class?
What percentage of vehicles were clustered in agreement with the actual class?
e. K-Means Clustering - Cluster Visualization:
i. Run the ‘clusplot(kc)' function to visualize your model. Modify the plot appearance to make your visualization clear and easy to interpret. Unlike previous exercises, your visualization will now be evaluated on clarity and aesthetics in addition to the standard command, output, and interpretation evaluation. Include the full command, output screenshot (zoomed in), and a one-paragraph, masters-level response with your interpretation of your plot.
(Hint: Your interpretation should discuss all of the visualized clusters and should begin to address specific observations (data points) within each that warrant discussion.)
f. K-Means Clustering - Experiment with Different K Values (3 Runs Summarized):
i. Completely fill in the table below documenting the results of your experimentation with modifying the k value. You may use any k value other than 4 that is greater than 0. You do not need to provide any commands or output screenshots in this report. However, you will be evaluated on these commands being present in your R script!
k= Number of Instances in Each Cluster Between Clusters Sum of Squares Within Clusters Sum of Squares Number of Iterations.
ii. What effect do you observe that modifying the k values has on the method results? Provide a one-paragraph, masters-level response below:
iii. What is an ideal value of k for the Vehicle Silhouettes data? This is a subjective and open-ended question. Challenge yourself and come up with a creative and well-supported answer for which value you believe is ideal. Provide a one-paragraph, masters-level response below:
i. What differences between k-means clustering and classification methods did you observe? Provide a one-paragraph, masters-level response.
ii. (Not graded) Which part of this exercise did you find the most challenging and what steps did you take to resolve the challenge?
Format your assignment according to the following formatting requirements:
1. The answer should be typed, double spaced, using Times New Roman font (size 12), with one-inch margins on all sides.
2. The response also includes a cover page containing the title of the assignment, the student's name, the course title, and the date. The cover page is not included in the required page length.
3. Also include a reference page. The Citations and references should follow APA format. The reference page is not included in the required page length.