How do you sample the training data and test data


Assignment: IT Project Description

Choose one of the two tasks below for your final project

Jack Bauer family is going to move to Pittsburgh. The family is recruiting a butler to help them make decisions. The tasks are:

1. A house. Jack Bauer family wants to buy a house. The requirements are:

a) The price is less than 500,000 USD.
b) It has investment potential.
c) Close to medical centers/hospitals, universities and supermarkets/malls (Target, Walmart, Whole Food, Costco, etc.).
d) Excellent traffic in surrounding areas.

2. Technology setup. Jack Bauer can't live without Internet and asks for your decision support on Wifi router. He has narrowed down the selection to three comparable models: NETGEAR Nighthawk AC1900, ASUS AC1900, Linksys AC1750. He asked opinions from his friends and collected some reviews for these routers.

You are required to give a presentation to Jack Bauer family to help them make the above decision.

• For Task 1, you are suggested to use(but not limited to) Decision Tree and search for some more data from web to prepare a rich and exciting presentation.

• For Task 2, you are suggested to apply sentiment analysis and search for more data from web to help them make the decision.

• Data: Pittsburgh property price data, products review data.

* You are encouraged to additionally collect your own data to conduct more solid analysis.

• A visualization system to help:

This is an open project, please feel free to use your resources and power!

Jack Bauer family is looking forward to your presentation!

Task 1 Guide:

1. Whether a house should be recommended is a multi-factor decision, including its price, investment potential, traffic, proximity to public services, crime rate, neighborhood etc. Try to manually rate several (say, 20~) housing options by discussing among group members. Determine an overall rating of each housing option based on the multiple factors your group chose to judge upon.

1.1 This rating can be somewhat subjective, but the more options you rate, the more objective your analysis becomes in the later training step.

1.2 Suggested scale is 1~10, but the actual rating scale is up to you.

1.3 Theoretically, if you manually rate all the 3318 options in csv data, your job is done because you can recommend the houses with highest ratings to Jack family. However, do you have time for that?

2. The subset of data you manually rated (i.e., labeled) is your training data + test data. Again, the more data you label, the more time you need, but the more useful your trained model will be. You need to find a balance by yourselves.

Select 90% of labeled data as training data, the rest 10% as test data to optimize your decision tree parameters. You want to train a decision tree model with good performance on your test data.

Things to think about:

2.1 How do you sample the training data and test data?
2.2 Do you need to use all the attributes provided in csv? Any preprocessing of the raw data?

3. When you trained a satisfactory decision tree model, apply the model to unlabeled data to automatically label them (make rating prediction).

Finally, provide suggestion to the family with good reasons.

Task 2 Guide:

1. Determine whether to perform a sentence-level or review-level sentiment analysis (what is your document 'unit' in training and testing data)? You can choose either, but you need to make it consistent between training and testing data.

2. Prepare training data from Amazon (or equivalent websites). You can do manual copy & paste to create separate training files. You can also automate this by writing codes if you have good programmers in your group.

2.1 How many sentences/reviews need to be in training data? Of course the more the better.

2.2 Proportion of reviews/sentences for each camera type; proportions of positive and negative reviews used as training set also need to be carefully considered

2.3. Other miscellaneous stuff like preprocessing, capitalization, word stemming etc.

3. Prepare your testing data based on the txt files given on Blackboard. These files contain sentences/reviews without labels. You need to use your sentiment analysis model to automatically label them.

3.1 You can manually convert the four files into separate testing files (either sentence level or review level). If you have a programmer in your group, automating this process is encouraged.

4. Train your model, and apply it to your test data. Finally determine which router is best based on your analysis.

Format your assignment according to the following formatting requirements:

1. The answer should be typed, double spaced, using Times New Roman font (size 12), with one-inch margins on all sides.

2. The response also include a cover page containing the title of the assignment, the student's name, the course title, and the date. The cover page is not included in the required page length.

3. Also Include a reference page. The Citations and references should follow APA format. The reference page is not included in the required page length.

Attachment:- Project-Description.rar

Solution Preview :

Prepared by a verified Expert
Management Information Sys: How do you sample the training data and test data
Reference No:- TGS02956765

Now Priced at $60 (50% Discount)

Recommended (93%)

Rated (4.5/5)