Write a function that loads the contents of a given text


Programming Project Assignment: Data Compression using Huffman Codes

Introduction

This assignment focuses on building a lossless data compression algorithm using Huffman Codes. In 1952, David Huffman developed a way to find a set of optimal prefix codes for a given set of input symbols. Essentially, Huffman Codes are variable-length bit strings which uniquely represent a set of input symbols. The main benefit offered by this scheme is that the more common the input symbol, the fewer the number of bits that are required to represent it. Clearly, this is advantageous when compressing data.

Compressing ASCII Text using Huffman Codes

The algorithm always begins with the source set of symbols to be compressed. For this assignment, we will focus only on using ASCII text but the algorithm is applicable to any set of symbols. Below is an example source string to be compressed:

to be or not to be

The first step is find the frequency of each letter in the source text. These symbol frequencies are a fundamental aspect of the algorithm. Table 1 gives a summary of these frequencies for this example. One approach to compressing this data is to use a block encoding representation that uses the minimum number of bits that permit all the letters to be encoded, i.e. 3 bits per letter for this example. Since there are 18 characters in this string, 54bits (3 x 18) would be required to represent this sentence using block encoding. Using the Huffman Codes shown in Table 2, the same sentence can be represented with just 47bits - a 13% reduction in size.

Letter

Freq

Space

5

o

4

t

3

b

2

e

2

n

1

r

1

Table 1 - Letter Frequencies

The second step of the algorithm is to construct a binary tree using the data in Table 1. The pseudo-code below describes a way to build a binary tree. It makes use of a priority queue to construct the tree, i.e. in this kind of data structure items with higher priority are placed at the front of the queue. In this case the letter frequency is used as the priority indicator with lower letter frequencies having higher priority.

CREATE leaf node for each letter

ADD all leaf nodes to priority queue

LOOP WHILE more than one node in priority queue REMOVE two nodes from priority queue*

CREATE new node with two removed nodes as children

SET new node's priority to sum of two child node priorities ADD new node to priority queue

END LOOP

REMOVE remaining node from priority queue SET node as root

*when more than two items have the same priority, any node pair is permitted

The third step of the algorithm is to traverse the constructed binary tree to build Huffman Codes for each letter. To do this all links within the tree must be labelled 0 or 1. By convention, each left child link is a 0 and each right child link is a 1. Figure 1 gives an example binary tree for the example string.

1887_Figure-1–Example-Binary-Tree.jpg
Figure 1 - Example Binary Tree

Each Huffman Code is found by traversing the tree from the root node to the letter in question. Table 2 shows the result of traversing the tree for each letter in the example string.

Letter

Huffman Code

Space

00

o

11

t

011

b

010

e

100

n

1010

r

1011

Table 2 - Huffman Codes

Assignment Tasks

There are 8 tasks to complete for this assignment worth a total of 100 marks. Each task needs to be completed before the next one can be attempted (except the first task which can be completed at any time). You have been provided with a sign-off sheet to be completed during tutorial sessions by your tutor. Your tutor may give you partial marks for any incomplete tasks and so please try to complete as many as possible. It is therefore essential that you attend the remaining tutorial sessions. After you have completed a task you must ask a tutor to sign it off on your sheet.

NOTE: You should take extra care to keep your sign-off sheet safe because you will need to provide a scan or photograph of it when you submit your assignment to Blackboard.

Task 1

Huffman Coding is one example of a lossless compression algorithm. Write a short 500-750 word academic report describing the similarities and differences between lossless and lossy compression algorithms and under what circumstances you might favour one kind of compression algorithm over the other. Your report should include brief descriptions of one example algorithm that is used for each kind of compression (excluding Huffman Coding). You should try to use diagrams and illustrations where applicable to help your explanations of your chosen compression algorithms. Marks will be given for good UWE Harvard referencing, good grammar and spelling, and a good report structure.

Task 2

Write a function that loads the contents of a given text file into memory. Once loaded into memory you should output the text to the Console Window. You are expected to use the ‘ToCompress.txt' file for this part of the assignment (which is provided for you on Blackboard).

Task 3

The following code represents a letter-frequency pair structure:

struct LetterFrequencyPair
{
char character; int frequency;
};

Using this structure and the lecture slides to help you, develop a Linked List data structure that can be used to store a list of letters-frequency pairs. You should use the following test data to demonstrate to your tutor that you can store letter-frequency pairs in your Linked List:

Character

Frequency

A

2

a

3

{a space character}

5

;

1

1

2

Task 4

For this task, you need to write a function to find the frequency of each letter in the text supplied in the ‘ToCompress.txt' file. Remember this is a lossless algorithm and so your implementation must be case- sensitive and include all punctuation. You will need to make use of the code you produced for Task 2 and Task 3 to complete this task, i.e. a Linked List of letter-frequency pairs. Finally, you should print the letter- frequency pairs you have discovered to the Console Window.

Task 5

The following code represents a binary tree node that can be used to build a Binary Tree:

struct BinaryTreeNode
{
struct LetterFrequencyPair* letter_frequency_pair; struct BinaryTreeNode* leftChild;
struct BinaryTreeNode* rightChild;
};

Write a function that takes two BinaryTreeNode parameters, merges them according to the Huffman Coding Algorithm (i.e. it creates a new node, points the left and right child to the supplied BinaryTreeNode parameters, sets the frequency of the new node to be the sum of the parameter's frequencies and returns the new node). You should demonstrate to your tutor that this function performs correctly by using the following test data:

Character

Frequency

a

2

b

3

NOTE: you will need to initialise the new node objects leftChild and rightChild equal to NULL to indicate that they do not point to anything

Task 6

Write a function that takes a Linked List of BinaryTreeNode objects as a parameter and returns the BinaryTreeNode with the lowest frequency. This BinaryTreeNode should also be removed from the Linked List. You should demonstrate to your tutor that this function performs correctly by using the following test data:

Character

Frequency

x

4

y

2

z

5

w

3

NOTE: you will need to create a Linked List of BinaryTreeNode objects using the four letter-frequency pairs given above

Task 7

Write a function that builds a Huffman Tree making full use of the code produced in the previous 5 tasks (Tasks 2-6). You should make use of the algorithm given in the explanation above when building the Huffman Tree for the supplied ‘ToCompress.txt' file. You should demonstrate to your tutor that your program has correctly built a Huffman Tree for this file.

Task 8

The final task is to create a recursive function that prints out the Huffman Codes using the Huffman Tree created in Task 7. Your function needs to recursively traverse the Huffman Tree whilst building the Huffman Codes as it goes. All Huffman Codes should be output to the Console Window together with the letter they represent, e.g. A: 001

NOTE: you will need to use Dynamic Memory Allocation for your Linked Lists and Huffman Tree. To avoid Memory Leak, you must ensure that all allocated memory is deallocated before the program exits. Extra credit will be given if this is correctly done, i.e. there should be no examples of Memory Leak.

Request for Solution File

Ask an Expert for Answer!!
Computer Engineering: Write a function that loads the contents of a given text
Reference No:- TGS02712617

Expected delivery within 24 Hours