Create a graph by coloring each test set point orange or


Assignment

1. R Project for building a Naive Bayes Model: Continuing the Tweeter theme, in the accompa- nying input file Q2in.csv" the frequencies of certain words (listed in the first row) is given. Each line corresponds to the data gathered for a business day. Also, under the column SP500 it is indicated whether on that day the S&P500 stock was up or down. This data is only a simulation.

a) download the data to your computer and read it into a data frame called Q2dat.

b) print the head and tail to make sure the data is read correctly.

c) Using the sample function, select 80% of the data and assign it to a data frame called Q2datTrain. The remainder of the data should be assigned to a data frame called Q2datTest.

If you have not installed it already then install and load the naivebayes package.

d) Run the naive bayes function assuming each column is normally distributed. Use the Q2datTrain as your training set. Call the resulting model NBmodelNormal. Next, predict the whether the stock market will go up or down by applying it to the Q2datTest. Find the empirical error rate of the predicted values.

e) Repeat part 2d) but this time make no assumption about the distribution of the data. Call this model NBmodelKern. Compute the error rate and compare it with the normal results in 2d).

f) Repeat part 2d) but this time turn each column into a non-numerical factor. Specifically, turn each column (other than the response column SP500 of course) of both Q2datTrain and Q2datTest data frames into a binary variable as follows: If the number in a column is larger than or equal to the median of that column then replace the number with 1, otherwise replace it with 0. For instance, suppose the column under Buy has only five values (394,407,398,409, 373). Then the median is 398, so this sequence of data is replaced by (0,1,1,1,0). Use the R functions sapply and median to accomplish this. Build a naive Bayes model based on frequency of zeros and ones in this table (don't forget to make the zeros and ones into a factor.) Call this model NBmodelBinary. Test your results on the test data and report the error rate. Compare this error rate to the two previous cases.

2. R project for building a k-NN model: All answers should be output by your R script including the last question. A simulated data set called blueOrangeIn.csv accompanies this question. The data set has two continuous feature variables X1,X2 and a categorical response variable Y with two values of ‘‘Blue" and ‘‘Orange''. The original data is drawn from random points in a 3 × 3 chessboard colored alternatively blue and orange, with some noise and inaccuracy injected in it.

a) Using the read.csv command read this data into a data frame called Q3dat. Print the head and tail of the Q3dat data frame to make sure it is read correctly.

b) Using the kknn package build six models for k = 1, 10, 100, 1000, 2500, 3500. For test set create a 40×40 grid by subdividing the range of X1 and X2 into 40 equally spaced intervals. The 1600 new points will form the test set which also should be used for graphing the results in the following questions. (Suggestion: Start with only a small portion of the dat, say a random subset of 500 rows. Write our program for that small set. Once you are sure it works, then run it on the full set. Each run on the full set may take several minute.)

c) For each of the value of k mentioned in question 1b) create a graph by coloring each test set point orange or blue based on the predicted value. Also draw the boundary between orange and blue points. You may use the knn2.r file posted as your template.

d) Use validation set technique to find the near optimal k for the k-NN method for this data. To this end use values of k = 1 to k = 901 with jumps of 100, that is, test k = 1,101,201,...,901. For each k run 10 experiments where you would choose 1000 random items from the Q3dat data frame as your training set, and the remaining items as your test set. For each run, build a k-NN model, and test it on the test set, and find number of misclassified orange items, misclassified blue items and the total number of misclassified items. Take the average over 20 experiments, and divide, respectively by the number of orange items, the number of blue items, and the total number of items in the test set. Collect these three items, along with the values of k in vectors. When done find the best k, that is the one resulting in lowest error rate. Also on the same graphics panel, graph orange error rate (proportion of orange points incorrectly classified as blue among all orange points), blue error rate (proportion of blue points incorrectly misclassified as orange) and total error rate against values of k. You may use the file ‘‘knnValidation.r'' as your template.

e) On the three curves of the last problem, one should be increasing with k, one decreasing with k and one that should roughly start high for small k, than hits a minimum and is a roughly flat, and then roughly start moving up again. For each of the first two classes explain why you observe a monotonic increasing and decreasing curve.

Request for Solution File

Ask an Expert for Answer!!
Computer Engineering: Create a graph by coloring each test set point orange or
Reference No:- TGS02684848

Expected delivery within 24 Hours