The following must be solved using spark what is the


Assignment -

For this assignment, the first two problems require Hadoop mapreduce jobs, although you need only solve one of them. Each of these problems should have it's own folder. The folder for a problem must contain a .txt file which gives the command line invocation for the job. For Java jobs submit the project directory as well as a jar. The streaming job will require it's own folder in which you will have files for the mapper and reducer. Problems which are carried out in Spark require only the file which will be submitted through spark-submit. Spark jobs will be implemented in Python. For Spark jobs, key-value output may include parentheses. For problems which do not require Mapreduce or Spark follow the instructions given below including all work in the main submission zip.

Solve one of problems 1 and 2.

1. The following is a mapreduce exercise. You may use either the Java or Streaming API's. From the UCI Machine Learning Repository download the compressed files docwords.nytimes.txt.gz and vocab.nytimes.txt.gz. These are part of the bag of words data set. Create a file named words_nytimes.txt which is the same as docwords.nytimes.txt but with the first three lines removed. Using the distributed cache translate the records of the nytimes data set into the form (docid, actual term , term count, max frequency for document). Parentheses should not be part of the output and you may use different delimiters. The actual term is the mapping of a term id as given in the file vocab.nytimes.txt. The input file here is words_nytimes.txt and the file which will be put in the distributed cache is vocab.nytimes.txt. The VM may have difficulty with the entire dataset. If you are having issues run on only a part of the file.

2. In this exercise you will implement matrix multiplication as a streaming job using Python. You will do so by executing a secondary sort in such a way that no buffering is required in the reducer. Your reducer may use only O(1) additional memory. For example you may use a small number of variables, storing foats or ints only.

3. In this problem you will build an inverted index for the nytimes datain the following sense. The output will be a term id together with a sorted list of the documents in which the term is found. To be precise the output will be lines with tab separated fields where the first field is the term and the subsequent fields are of the form docid:count where the count is the number of times that the term appears in the document. Furthermore, the docid:count data needs to be sorted, highest to lowest, by count. So the document for which the count is greatest will appear first and that in which the count is least will appear last. You will implement this in Spark. Your submission will be a file whose lines contain the required data together with a file giving the code/commands executed. Compress the submission data.

4. For this problem you will need to read about the tf-idf transform in the book Mining of Massive Datasets. For this problem the file words_nytimes.txt will be the input. The output will be the same as the input except that the third field which gives the count of the term in the document will be replaced by the tf-idf score for the term in the document.

You may solve this using any method you like, however the tf-idf score must be as defined in the above mentioned text. You need only submit the output. You must compress the output and include it with you zipped submission.

5. The following must be solved using Spark. You will submit your answers together with a file containing the commands you executed. It is recommended that you employ data frames for this problem. You may need to make use of AWS if your computer is unable to process the entire data set. When asking about particular words give the id only. Referring to the New York Times dataset mentioned above answer the following questions.

(a) How many documents have at least 100 distinct words?

(b) Which document contains the most total words from the vocabulary?

(c) Which document contains the most distinct words from the vocabulary?

(d) Which document, with at least 100 words, has the greatest lexical richness with respect to the vocabulary? By lexical richness we mean the number of distinct words divided by the total number of words.

(e) Which document, with at least 100 words, has the least lexical richness?

(f) Which word from the vocabulary appears the most across all of the documents, in terms of total count?

(g) How many documents have fewer than 50 words from the vocabulary?

(h) What is the average number of total words per document?

(i) What is the average number of distinct words per document?

6. Download the file movies.txt.gz and familiarize yourself with it's structure. This is a large file and the download may take some time depending on your internet connection. After this you will create a new file called reviews.csv which will have on each line the following:

review id, product id, score, helpfulness score

where the fields are separated by a comma. There should be one line per review in the file. You may solve this exercise in whatever manner you chose. Now carry out the following parts, you will include code and/or commands that were executed to answer these questions. You will also submit the compressed output. As in the previous you may use any method you like.

(a) Verify that you have the correct number of reviews in the file you created.

(b) Verify the number of distinct products.

(c) Verify the number of distinct users.

(d) Verify the number of users with 50 or more reviews.

(e) Create a file called mean_rating.csv which has one line per unique reviewer such that each line has the user id and mean score of all their ratings separated by a comma. This file should also be compressed and submitted.

Textbook - Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman.

Attachment:- Assignment File.rar

Request for Solution File

Ask an Expert for Answer!!
Dissertation: The following must be solved using spark what is the
Reference No:- TGS02823946

Expected delivery within 24 Hours