The similarity between any two files in our collection


The basic task is to measure the similarity between any two files in our collection. To do this, we will need a suitable universe of words. This will consist of all words in the collection that are (a) more than four letters long, (b) don't occur more than 20 times overall, and (c) don't occur in more than 7 files in the collection. Now we constructor a vector (in the mathematical sense) corresponding to each file. The vector will have as many coordinates as words in the universe -- so there is one coordinate for each word in the universe. If a word occurs in the file, the corresponding coordinate is 1, otherwise it is 0. 

Let us give an example: suppose the universe consists of the five words: apple, grapes, banana, doctor, program. Suppose file1 contains: apple, banana, program. Then the vector for file1 is (1,0,1,0,1). 

We need to normalize each of the vectors so that it has unit length. So each coordinate in the above vector gets divided by the square root of 3. 

The similarity of two files is defined to be the scalar product of the corresponding two vectors. The scalar product of two vectors is obtained by multiplying corresponding components and adding. For example, the scalar product of (2,1,3) and (0,5,6) is 2 * 0 + 1 * 5 + 3 * 6. 

Your task is to write a program that prints the names of the two files with the highest similarity among the files in the collection, and the names of the two files with the lowest similarity 

Request for Solution File

Ask an Expert for Answer!!
Basic Computer Science: The similarity between any two files in our collection
Reference No:- TGS0143977

Expected delivery within 24 Hours