What punctuation can be removed to determine termswhat stop


Assignment

1. Consider this dictionary: {"CAT", "COUNT", "DOG", "DONKEY", "ELEPHANT" }

Term-ID1

Offset

1


2


3


4


5


a) Complete this table assuming "dictionary as a string"

b) Create a second dictionary consisting of each word reversed (e.g. CAT -> TAC ). Show the dictionary as a string.

Term-ID2

Offset

1


2


3


4


5


c) Complete this table using your reversed dictionary string

d) Using your two dictionaries, show how you can determine the words that satisfy the wildcard query C*T

Consider the following documents:

Doc1: the wood table

Doc2: they made the wood

Doc3: the table is made of steel

Doc4: wood table or steel table

 

 

 

Using a shingle size 2, compute the Jaccard coefficient of:

(Doc1, Doc2)

(Doc1, Doc3)

(Doc1, Doc4)

Based upon your results, Doc1 is most similar to ____?

1. Crawl-delay: 10
2. User-agent: crawlerbot
3. Disallow: /includes
4. Disallow: /misc
5. Disallow: /setup
6. Allow: /misc/*.jpg
7. User-agent: *
8. Disallow: /setup
2. Using this robots.txt file

a) What does line 1 mean?

b) What is the difference between line 2 and 7?

c) Should any crawler access the file /setup/help.txt?

d) Should the crawler "mybot" access the file /a/b.htm?

3. Consider the following text:

This tree is just one of many older-growth trees in the forest. Forests in Texas, can be over 100 years-old before they are considered "old". Trees can be over 200 years.

a) What punctuation can be removed to determine terms?

b) What stop words can be removed?

c) Which tokens can be converted to lower case?

Request for Solution File

Ask an Expert for Answer!!
Computer Engineering: What punctuation can be removed to determine termswhat stop
Reference No:- TGS02218819

Expected delivery within 24 Hours