Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


CS2630: Computer Organization

Homework 2

T9 predictive text

(Data structures in C)


Goals for this assignment

● Write a useful interactive application in C

● Implement a dynamic data structure, managing memory properly with malloc and free


3 Preparation

You should have completed the lab assignments, readings, and knowledge checks through the end of Lab 3. Most relevant are the two lectures on data structures in C (with companion code: liststart.c/listfinish.c) and lab 3’s binary search tree.


4 Submission

● t9.c

● report.pdf


5 What is T9?

In this assignment, you will build programs to implement T9 predictive text, a text input mode available on many old cell phones, automated customer service phone systems, and keypads. Each number from 2-9 on the keypad represent three or four letters, the number 0 represents a space, and 1 represents a set of symbols such as { , . ! ? } etc. In this assignment, we will only use the numbers 2-9, which represent the following letters:

● 2 abc

● 3 def

● 4 ghi

● 5 jkl

● 6 mno

● 7 pqrs

● 8 tuv

● 9 wxyz

Since multiple letters map to a single number, many key sequences represent multiple words. For example, the input 2665 represents "book" and "cool", among other possibilities. Therefore, our T9 will also accept “#” as an input to scroll through words with the same code.

To translate from number sequences to words, we will use a data structure known as a trie.


6 What is a trie?

A trie is a multiway branching structure (tree) that is searched using the prefixes of sequences. As we travel down a path in the trie, we reach word sequences spelled out by the numbers along that path. Classic trie data structures have edges labeled with letters to store prefixes of strings. But for this application, we use a compressed trie that has only 8 possible branches at each node instead of 26. The digits 2-9 represent the 26 letters. A node can represent multiple words so an extra layer of complexity (#) is needed to figure out the string represented by a path.

For more information on trie data structures, read the Wikipedia article. Note that the general concept of a trie as “a multiway branching structure (tree) that is searched using the prefixes of sequences” has many different realizations. Example tries you see online might not be exactly the structure that this T9 project needs. For example, many example tries you might find are searched by the word prefixes, not by a numeric T9 code.

In the above abstract representation of a T9 trie that contains

● jello at code 53556

● rocks at 76257

● socks at 76257#


7 What your program needs to do

Implement in C a program called t9. The command

./t9 DICTIONARY TESTFILE

should read in a dictionary file (DICTIONARY) that contains a list of words. Translate each word in the dictionary into its numeric key sequence, then add the key sequence to your trie, with the word at the end of the path corresponding to the digits. If a word with the same numeric sequence already exists in the trie, add the new word to the trie as a link to a new node with an edge labeled '#' instead of one of the digits 2-9. The words linked from a node by the '#' edges form a linked list of words that have the same numeric code.

For example, if the program reads the set of words "jello","rocks", and "socks" from the dictionary and adds it to an empty trie, the resulting trie should look like the picture above. Once your program has read the dictionary and built the trie containing the words in it, it should start reading the test file (TESTFILE). Each line is a T9 code, and for each one, your program should print out the word corresponding to the sequence of numbers entered. Your program should use the numbers to traverse the trie that has already been created, retrieve the word, and print it to the screen. If the test file line is '#', the program should print the next word in the trie that has the same numeric value, and so forth. The test file might also include a number followed by one or more '#' characters - this should print the same word that would be found by typing the number and individual '#' characters on separate input lines.

As an example, if we run the program using the above trie, and the following test file…

76257

#

53556

#

76257#

76257##

4423

exit

…then the program will print…

rocks

socks

jello

There are no more T9onyms

socks

There are no more T9onyms

Not found in current dictionary.

The program should terminate either when it encounters "exit" or the end of the file.

Make sure your program properly handles the case where there are more "#"s than there are T9onyms for a particular number.

We provide you with two dictionary files, smallDictionary.txt and dictionary.txt. Each of these text files contains a list of words to be used in constructing a trie - the small one primarily for testing, and the large one for making sure your program works on large inputs. Translate each word in the file into its associated T9 key sequence, and add the word to the trie. In the case of multiple words having the same key sequence k, let the first word encountered in the text file be represented by the key sequence k, the next encountered represented by k#, the next k##, etc. For example, 2273 can represent acre, bard, bare, base, cape, card, care, or case. To disambiguate, acre would be represented by 2273, bard by 2273#, bare by 2273##, and so forth. When a user inputs a key sequence, print the appropriate word.

You are not limited to just testing with given dictionary and sample test files!! In fact, making a tiny dictionary file and tiny test file are a great way to test your program early on.


8 What goes in report.pdf

8.1 Question 1: Summary

A brief description (3 sentences at most) of your trie data structure.

8.2 Question 2: Box-and-arrow diagram

A box-and-arrow diagram of your trie data structure when it has the contents:

● jello (at code 53556)

● rocks (at code 76257)

● socks (at code 76257#)

This is a box-and-arrow diagram for your particular trie data structure as you’ve coded it in C, NOT an abstract diagram as shown in Section 6.

Where relevant, make sure your box-and-arrow diagram includes slash (/) in boxes that were set to NULL and question mark (?) in boxes with an uninitialized value. Therefore, there should be no empty boxes.

8.3 Question 3: Lookup example

Give a step-by-step description of how your search algorithm looks for 76257## in your trie, when it has the contents given above.

8.4 Question 4: Debugging story

a) Describe one undesired behavior you encountered your program having as you were working on this project. Specifically, what was the expected behavior and what was the observed behavior?

b) For the undesired behavior above, describe what you did to investigate the potential cause of it.

c) Based on that potential cause, explain what you modified in your code, why you thought this change would fix the behavior, and what the new program behavior was.


9 Implementation tips

Here are some tips to help you write your code.

9.1 Reading files line by line

You are using two data inputs for the t9 program:

● a file for the dictionary

● a file for the T9 codes

The filenames for the both text files are provided to your program as command line arguments.

./t9 DICTIONARY TESTFILE

We have provided an example, readfile.c, of getting the name of a file from the command line and then reading and printing lines from that file. To run it, do

./readfile FILENAME

Note that readfile only takes one argument. In t9.c you’ll need to take two arguments specifying two files. A good first step to make sure you understand how to do this is to modify readfile.c so it takes in two filenames and reads and prints their contents line by line, one after the other.

9.2 Constants

Since you will need to create arrays of characters for words and arrays of nodes for children, it may be helpful to have a constant defining the maximum length of a word and maximum number of children.

● The dictionary.txt only contains words of 8 or fewer characters and smallDictionary.txt contains words of 14 or fewer characters. So, you could just make it 14 to avoid having to change it.

o (FYI: I found this out by running this command “wc -L dictionary.txt” on the command line. It calculates the length of the longest line in a file)

● The maximum number of children for a node is probably between 8 and 11, depending on how you represent your trie.

Here’s how to make a constant in C. Put this near the top of your file.

#define MAX_CHARS_IN_WORD 14

9.3 Trie

Your trie implementation needs to involve struct(s), malloc, and free. When coding the trie, use an object-oriented approach. See the linked list in Week 3’s liststart.c/listfinish.c and the binary search tree in Lab 3’s tree.c for examples of an object-oriented approach to data structures. We are intentionally giving you the freedom to choose the particular way you represent your trie.

To give you some sense of the design space, below are four different ways you could define a trie node, along with a box-and-arrow diagram showing what it looks like.

Each one has different implications on memory usage and on how to manage dynamic memory with malloc and free.

You are not constrained to using one of the approaches below but they likely represent straightforward ways to implement your trie.

The following diagrams are incomplete examples, so boxes having a question mark should not necessarily be assumed to be uninitialized; they could for example contain NULL or a legitimate arrow.

9.3.1 Approach 1

struct TrieNode {

char * word;

struct TrieNode * children;

}

● Array pointed to by word needs to be dynamically allocated

● Array pointed to by children needs to be dynamically allocated

9.3.2 Approach 2

struct TrieNode {

char word[MAX_CHARS_PER_WORD];

struct TrieNode * children;

}

● Array of chars for word is part of the TrieNode

● Array pointed to by children needs to be dynamically allocated

9.3.3 Approach 3

struct TrieNode {

char word[MAX_CHARS_PER_WORD];

struct TrieNode* children[NUM_CHILDREN];

}

● Array of chars for word is part of the TrieNode

● Array of pointers to the children TrieNodes is part of the TrieNode

9.3.4 Approach 4

struct TrieNode {

char * word;

struct TrieNode * children[NUM_CHILDREN];

}

● Array pointed to by word needs to be dynamically allocated

● Array of pointers to the children TrieNodes is part of the TrieNode


10 Compiling the code

Here are the commands you can type to compile the code.

● make t9    compiles t9.c into the executable t9

● make readfile    compiles readfile.c into the executable readfile

● make t9-debug    compiles t9.c with debug symbols on (-g)

● make readfile-debug    compiles readfile.c with debug symbols on (-g)

● make clean    delete the executable files

If you have trouble running the “make” commands, just manually type the particular gcc command you need. They are listed in the Makefile.


11 Coding requirements

Just like with writing, it is unreasonable to expect to code “perfectly” on the first draft. Give yourself time to get the functionality working, and then refactor it until it meets the following requirements. The following requirements only need to be met by the version of your code that you finally submit.

You should not stuff all your code in the main procedure. Instead break up the functionality into reasonable functions. Tip on how to do this: think of the trie data structure as separate from the rest of the program. Just like how in CS2 Data Structures, you implemented data structures independently so they could be used by many different programs.

● Multiple files? No. While this program ought to be organized into multiple .c files, we have not yet covered how to do that properly in C. So please put everything in t9.c.

● Include at least the following comments

○ Comment above every function saying what it does (not how just what), particularly what the arguments and return value mean.

○ Comment at top of the file saying 1. Your name, 2. Semester, 3. what the application does, including the flags.

● Use appropriate function names: the function name should be a terse description of what the function does.

● Use appropriate variable names

○ Longer names for important variables that are used for a long time

○ Shorter names for less interesting variables, such as a loop index

● Your program must not have any compiler warnings when compiled with the gcc flag - Wall. This flag is included in the Makefile, so we recommend you compile with make (as described in the earlier section).

● Your program must be portable. You can typically achieve portability by: 1) not having any compiler warnings AND 2) not using any libraries other than the C standard library. We will grade the homework using the CLAS Linux machines (i.e., fastx), so we recommend you compile and run your program at least once on them. Alternatively to fastx, making sure it works on repl.it is fine.

● Use C standard library functions where possible; do not reimplement operations available in the standard libraries.

● Your program must have no memory leaks and no memory errors (e.g., invalid reads). Run it through valgrind to check (see lab 3 for examples). TIP: using the large dictionary.txt file will be very slow with valgrind. Use smaller dictionary files when testing for memory leaks. The staff will run valgrind on your program.

● The program needs to process dictionary.txt in a reasonable amount of time (less than 1 minute). If it is taking several minutes, you might be doing something algorithmically inefficient.

● Often coding style is enforced by your collaborators when they review your code. However, since this is a solo project, we suggest running the following beautifier on your code. It will fix spacing, indentation, and other basic style issues.

clang-format -i t9.c

This assignment derived from CSE 374 Spring 2015.