Write a program to load the data instances to memory from


Project

Project Title: Data Clustering using K-means

In this project, students are required to cluster Amazon product reviews that belong to four product categories: books, electronic appliances, dvds, and kitchen appliances. Moreover, each category is further divided into positive-valued sentiment reviews and negative-valued sentiment reviews. In total, you will find reviews that belong to 4 × 2 = 8 categories in the data file attached "data.txt".

The format of the data file is as follows. Each line of the data file corresponds to one review. The first element in the line represents the label of the instance (e.g. kitchen-positive indicates that the review is a positive sentiment review about some kitchen appliance). The next elements (separated by spaces) in the line represent the unigram and bigram features extracted from the review. Note that the two words in a bigram feature are connected by two underscores. Reviews are represented using binary-valued features (i.e. each feature appears exactly once in a given line).

Questions

(1) Write a program to load the data instances to memory from the provided file data.txt.

(2) Implement the k-means clustering algorithm with Euclidean distance to cluster the instances into k clusters. Make sure that you normalize each feature vector to unit L2 length before computing Euclidean distances.

(3) Instead of selecting the mean in a cluster,

i. select the instance that is closest to the mean as the cluster center when performing k-means clustering and

ii. select k-medoid method to perform clustering

(4) Evaluate the clusters obtained in step 2, 3 and 4 using cross validation evaluation method.

(5) Briefly discuss which clustering method is best for this data and why?

Submission Instructions

• Submit

(a) the source code for all your programs,

(b) a README file (plain text) describing how to compile/run your code to produce the various results

(c) a PDF file providing the answers of all above questions

Compress all of the above files into a single zip/rar file and name it with your registration number.

Solution Preview :

Prepared by a verified Expert
Data Structure & Algorithms: Write a program to load the data instances to memory from
Reference No:- TGS01182717

Now Priced at $23 (50% Discount)

Recommended (95%)

Rated (4.7/5)