Project lexical normalisation of twitter data - determine, Computer Engineering

Project lexical normalisation of twitter data - determine

Overview

The goal of this Project is to assess the performance of some spelling correction methods on the problem of tweet normalisation, and to express the knowledge that you have gained in a technical report. This aims to reinforce concepts in approximate matching and evaluation, and to strengthen your skills in data analysis and problem solving.

1. One or more programs, implemented in one or more programming languages, which must:
- Determine the best match(es) for a token, with respect to a reference collection (dictionary)
- Process the data input file(s), to determine the best match for each token
- Evaluate the matches, with respect to the truly intended words, using one or more evaluation metrics

2. A README that briefly details how your program(s) work(s). You may use any external re- sources for your program(s) that you wish: you must indicate these, and where you obtained them, in your README. The program(s) and README are required submission elements, but will not typically be directly assessed.

3. A technical report, of 1000-1600 words, which must:
- Give a short description of the problem and data set
- Briefly summarise some relevant literature
- Briefly explain the approximate matching technique(s), and how it is (they are) used
- Present the results, in terms of the evaluation metric(s) and illustrative examples
- Contextualise the system's behaviour, based on the (admittedly incomplete) understanding from the subject materials
- Clearly demonstrate some knowledge about the problem

By using this data, you are becoming part of the research community - consequently, as part of your commitment to Academic Honesty, you must cite the curators of the dataset in your report, as the following publication:

Bo Han and Timothy Baldwin (2011) Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Compu- tational Linguistics, Portland, USA. pp. 368-378.

Reports that do not cite this work constitute plagiarism, and will be correspondingly assigned a mark of 0.

Please note that the dataset is a sub-sample of actual data posted to Twitter, with almost no filtering whatsoever. Unfortunately, the Internet is a place where freedom of speech is both empowering and harmful: consequently, some of the information expressed in the tweets is undoubtedly in poor taste. We would ask you to please look beyond this to the task at hand, as much as possible. (For example, it is generally not necessary to actually read the tweets themselves.)

Request for Solution File

Ask an Expert for Answer!!

Computer Engineering: Project lexical normalisation of twitter data - determine

Reference No:- TGS02420280

Have a Question? (oR Write a Review)

Recent Questions Asked Computer Engineering

Q : Liability under the securities acts jones cpa audits a

Q : Suppose a group of accountants wanted to start an

Q : A discussion in your own words and ideas containing

Q : Star corporation an amusement park is considering a capital

Q : Project lexical normalisation of twitter data - determine

Q : Class action lawsuits in the united states it has become

Q : What are the indications and contraindications for taking

Q : Liability under common law and the securities act of 1933

Q : Collectibility of the remaining lease payments is

What physical wellness in the workplace refers to

Process of redesigning several floors of office space

What if an adult diagnosed with schizophrenia

Discuss problems within the health care system

What are parts of the patient safety competency

Discuss brain development and emotional intelligence

Caring for a client with pancreatic cancer