The cranfield collection is a standard ir text collection


The Cranfield collection is a standard IR text collection, consisting of 1400 documents from the aerodynamics field. It is available from the class web page. (Check the "Links and resources" section).

1. Write a program that preprocesses the collection. This preprocessing stage should specifically include:
a. Function that eliminates SGML tags
b. Function that tokenizes the text. In doing this, pay particular
attention to characters that need special handling, as 
discussed in class (. , - etc.). For this task, please use 
_your own_ implementation of a tokenizer. 

Request for Solution File

Ask an Expert for Answer!!
Basic Computer Science: The cranfield collection is a standard ir text collection
Reference No:- TGS094956

Expected delivery within 24 Hours