Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Introduction

Autocomplete

Autocomplete highly enhances user experience for two reasons: it minimises the number of

keystrokes needed to specify a string input, while at the same time helps users with a suggestion of useful closely related inputs. Autocomplete typically wants to suggest the most likely strings that

share the preix speciied so far. Typically the preix is extended every time a new keyboard event is  triggered, hinting that any successful autocomplete function should be as fast as the average human typist. Surveys suggest that the average person types 200 characters per minute, which means that   ideally our autocomplete algorithm should not take longer than 0.3 seconds to suggest the correct    autocompletion candidates. If autocomplete takes much longer, users will dislike our solution.

Another important aspect is that more than one candidate should be suggested according to their  likelihood. If we fail suggesting the correct autocompletions users will not like our implementation either.

Autocomplete can be eciently implemented using a radix tree, which is a concrete data structure

that stores and supports lookup of keys. Note that radix trees are like binary search trees but may

have more complex rules for children (e.g. branching is based on bit-level comparison of the strings,  rather than full string comparisons). Radix trees are used as an associative data structure (trie), where each node does not store the full key (stores part of the key). The position of the node in the tree

deines the key associated with the node, where all its equal subtree descendants share the same key preix. A key can have an associated weight to indicate its likelihood.

Autocompletion can be implemented in C using a number of abstract data structures and underlying concrete data structures. Any implementation must support the operations:

insert a new item < key, data > into a tree and

find_and_traverse the tree given a preix, and return (or print) a list of pairs < key, data > where all keys share the same preix.

In getting an understanding of the theory behind the new data structure used in stage 3, you may ind the original paper useful: PATRICIA—Practical Algorithm To Retrieve Information Coded in

Alphanumeric (Morrison, 1968).

Radix Trees

Though there are more compact ways to implement the radix tree data structure (e.g. the table

structure described in the original paper will allow the same functionality, but requires less data to be stored), the approach we recommend is a linked data structure made up of nodes. These nodes will    contain:

An  integer  representing the number of bits which are a common preix to all entries in the chain. If there are no common bits, this will be 0.

A dynamically allocated array of  character values, this will contain the entire common preix.

One pointer to a node called  branchA , this covers the branching structure when the bit following the common preix is 0.

One pointer to a node called  branchB , this covers the branching structure when the bit following the common preix is 1.

One pointer to a list of associated  data  items.

This particular approach assumes a unique termination character is also part of the string (we will use the character where all bits of the character are 0 ( \0 ) for this character).

With this approach, we keep track of the number of bits we have processed through, adding the preixes of those bits to the key constructed so far.

To check if the key has been completed, after processing the bits in the current node, the number of   bits seen can be checked to see if it is a multiple of 8 and the inal character can be checked to see if it contains a  \0 (both conditions are necessary).

To treat the searched key as a preix, the terminating character is ignored, with all descendant nodes of the matching preix position traversed and the associated values of all completed keys returned.

Your Task

Your task involves two new stages and a report. The irst new stage is Stage 2, which implements

Autocomplete behaviour using a sorted array as the concrete implementation. Then the second new  stage will be Stage 3, which uses a radix tree. After completing the two new stages, you will compare their performance in a report. For the report, you will need to devise and carry out a process which   makes use of what you have learned in the subject so far.

For the two new stages, your output to the output ile will be the same format as Assignment 1, but  the output to the terminal will change. To support your analysis, this output is recommended to be   the number of bits compared, the number of characters compared and the number of whole strings compared.

Implementation Details

Example Radix Tree Insertions

The irst insertion is conceptually very simple. The radix tree is initially empty, so the preix

comprises the entire key. The diagram below shows the data structure, with the letters shown to illustrate what the binary values tell us about the key.

Work needs to be done when the second key is added. If we add "Arbory Bar and Eatery" to the radix tree, the two difer in the irst character. D has binary representation 0b01000100, and A has binary  representation 0b01000001, so the irst 5 bits (0b01000) are a shared preix.

Note in this case that the separation of bits is just for visual clarity, the irst byte in the "Domino's Pizza" terminal's preix would contain 100 01101.

The inal type of insertion is the duplicate. For this type of insertion, you should add the new record to the end (or start) of the data  list - the conceptually simplest way to do this is to have a function

which searches for the full key in the radix tree to ind the associated node, and then if found, to add  the new data to the end of that list. This approach keeps the add logic simple - it would be possible to combine these for eciency and code deduplication reasons, but we are most concerned with search performance.

Tips

To implement the Radix Tree, a recommended approach is to write two functions, one to get a

particular bit from a given character (e.g.  getBit(key,  n) ) and a second to extract a stem from a given key (e.g.  splitStem(key,  start,  length) ) which extracts a certain number of bits from a given key. Since the earliest bit in a character is the largest value (i.e.  0b10000000  =  128 ), this

process is otherwise error prone.

A few operators you may not be aware of for working on bits in numbers which may be useful are shown in the following table.

Stage 2 - Sorted Array Autocomplete

In Stage 2, you will implement the  insert and  find_and_traverse  operations using a sorted array   as the concrete data structure. After constructing the sorted array using repeated  insert  operations, you can then use  find_and_traverse to ind a key matching the preix by using a binary search, you should then linearly search around this location to ind adjacent values which also match the preix    until you have found all values matching the preix.

Your  Makefile should produce an executable program called  dict2 . This program should take three command line arguments.

1.  The irst argument will be the stage, for this part, the value will always be 2. 2.  The second argument will be the ilename of the data le.

3.  The third argument will be the ilename of the output le.

Your program should:

Just as in Assignment 1, read data from the data ile speciied in the second command line argument. This may be stored unchanged from  dict1  in earlier implementations.

Construct a Sorted Array, sorted on the trading_name attribute.

Accept (assumed) partial  trading_name s from stdin, search using a binary search and print all records matching the preix to the output ile. You may assume that all queries will be strings  terminated by a newline and will always consist of full characters. Though you can use this to  allow the querying of the entire dataset, you may also assume preixes are at least one

character long and will always begin with a standard ASCII letter for simplicity. Unlike the irst assignment, for this stage, your program should not output any speciic output where no records are found matching the preix.

In addition to outputting the record(s) to the output ile, the number of comparisons should be output to stdout. It is recommended that this show the number of bits, bytes (characters) and   strings compared.

For testing, it may be convenient to create a ile of queries to be searched, one per line, and redirect the input from this ile. Use the UNIX operator  < to redirect input from a ile.

Example Execution

An example execution of the program might be:

make  -B  dict2

#  ./dict2  stage  datafile  outputfile

./dict2  2  dataset_2.csv  output.txt

followed by typing in the queries, line by line, and then ending input once all keys are entered or:

make  -B  dict2

#  ./dict2  stage  datafile  outputfile

./dict2  2  dataset_2.csv  output.txt  < queryfile

Example Output

This is an example of what might be output to the output le after two queries:

McDonalds

-->  census_year:  2021  ||  block_id:  114  ||  property_id:  103235   ||  base_property_id:  103235  ||  buildi -->  census_year:  2020  ||  block_id:  65  ||  property_id:  109301  ||  base_property_id:  109301   ||  buildin -->  census_year:  2020  ||  block_id:  931  ||  property_id:  103908  ||  base_property_id:  103908   ||  buildi Standing  Room

With the following output to stdout:

McDonalds  -->  b472  c59  s12

Standing  Room  -->  b232  c29  s10

Stage 3 - Radix Tree Autocomplete

In Stage 3, you will use a Radix Tree as the concrete data structure to implement your approach.

Your  Makefile should now also produce an executable program called  dict3  . This program should follow the same input and output format at Stage 2.

Your dict3  program should:

Just as Stage 2, read in the dataset, store it in a dictionary and construct an autocomplete dictionary on the stored data.

·   For Stage 3, you should store the data in a Radix Tree of order 2 as described in the example at  the top of this slide and the earlier slides. This tree should be searched by comparing the next    highest order bit in the preix being searched with the associated value in the preix seen so far. In order to align your solution with the provided test cases, your solution should not count the  branching comparison as a bit comparison, but should count the irst bit in the preix following  the branch as a bit comparison (getting this wrong may lead to minor variations). In traversing   the matches,  branchA  should be fully explored before  branchB .

Output should be in the same format as Stage 2. For simplicity, you may make a reasonable assumption about how to count string comparisons, whether or not a string comparison

explicitly occurs. For this reason, some edge cases will deliberately not be checked.

Example Execution

An example execution of the program might be:

make  -B  dict3

#  ./dict3  stage  datafile  outputfile

./dict4  3  dataset_2.csv  output.txt

followed by typing in the queries, line by line, and then ending input once all keys are entered or:

make  -B  dict3

#  ./dict3  stage  datafile  outputfile

./dict3  3  dataset_2.csv  output.txt  < queryfile

Example Output

This is an example of what might be output to the output le after two queries:

McDonalds

-->  census_year:  2021  ||  block_id:  114  ||  property_id:  103235   ||  base_property_id:  103235  ||  buildi -->  census_year:  2020  ||  block_id:  65  ||  property_id:  109301  ||  base_property_id:  109301   ||  buildin -->  census_year:  2020  ||  block_id:  931  ||  property_id:  103908  ||  base_property_id:  103908   ||  buildi Standing  Room

With the following output to stdout:

McDonalds  -->  b74  c20  s1

Standing  Room  -->  b45  c21  s0

Experimentation

You will run various test cases through your program to test its correctness and also to examine the number of key comparisons used when searching key preixes across diferent iles. You will report  on the number of comparisons used by your Stage 2 and Stage 3 implementations for diferent data file inputs and key preixes. You will compare these results with what you expect based on theory (Big-O).

Your approach should be systematic, varying the size of the iles you use and their characteristics (e.g.

sorted/unsorted, duplicates, range of key space, length of preix, etc.), and observing how the number of key comparisons varies given diferent sizes of key preixes. Looking at statistical   measures about the results of your tests may help.

UNIX programs such as  sort (in particular  -R ),  head ,  tail ,  cat  and  awk  may be valuable. You may also ind small C programs (or additional "Stages") useful, the test data you produce and use   does not have to be restricted to the dataset provided.

Additionally, though your C code should run on Ed, you may like to run omine tests or other work outside of Ed and are welcome to do so.

You will write up your indings and submit them along with your Assignment Submission in the Assignment Submission Tab.

Make sure to click the "Mark" button after uploading your report. It is often the case for results to be released and more than a handful of students to ind an unexpected 0 mark for the report due to not technically

submitting it for marking.

You are encouraged to present the information you've collected in appropriate tables and graphs. Select informative data, more is not always better.

In particular, you should present your indings clearly, in light of what you know about the data

structures used in your programs and in light of their known computational complexity. You may ind that your results are what you expected, based on theory. Alternatively, you may ind your results do  not agree with theory. In either case, you should state what you expected from the theory, and if

there is a discrepancy you should suggest possible reasons. You might want to discuss space-time trade-ofs, if this is appropriate to your code and data.

You are not constrained to any particular structure in this report, but a well laid out plan with a strong structure tends to lead to the best results. A more general and helpful guiding structure is shown in

the Additional Support slide, but you may ind it valuable to keep in mind a general low:

Introduction - Summary of data structures and inputs.

Stage 2 and Stage 3:

- Data (data ile, number of keys, characteristics, summary statistics for relevant metrics)

- Comparison with theory

Discussion

Dataset

The sample dataset is the same as Assignment 1, but is reproduced here for completeness.

The dataset comes from the City of Melbourne Open Data website, which provides a variety of data about Melbourne that you can explore and visualise online:

https://data.melbourne.vic.gov.au/

The dataset used in this project is a subset of the Cafes and Restaurants with Seating Capacity dataset.

The processed dataset can be found in the Dataset Download slide. You aren't expected to perform this processing yourself.

The provided dataset has the following 14 ields:

census_year - The year the information was recorded for (integer)

block_id - The city block ID. (integer)

property_id - The ID of the property. (integer)

base_property_id - The ID of the building the business is in. (integer)

building_address - The address of the building. (string)

clue_small_area - The CLUE area of Melbourne that the building is in. (string)

business_address - The address of the business itself. (string)

trading_name - The name of the business. (string)

industry_code - The ID for the category of the business. (integer)

industry_description - The description of the category of the business. (string)

seating_type - The type of seating the record describes. (string)

number_of_seats - The number of seats provided of this type. (integer)

longitude - The longitude (x) of the seating location. (double)

latitude - The latitude (y) of the seating location. (double)

The ields  building_address ,  clue_small_area ,  business_address ,  trading_name ,

industry_description and  seating_type are strings, and could contain spaces or commas. You may assume none of these ields are more than 128 characters long. The ields  census_year ,

block_id ,  property_int ,  base_property_id ,  industry_code and  number_of_seats are integers, you may assume they are always speciied and present. The ields  longitude and  latitude are

double values, you should store them so that they can be printed correctly to 5 decimal places.

This data is in CSV format, with each ield separated by a comma. For the purposes of the assignment, you may assume that:

the input data are well-formatted,

input iles are not empty,

input iles include a header,

⚠ any ield containing a comma will begin with a double quotation mark ( " ) and end with a quotation mark ( " ),

each ield contains at least one character,

·   ields always occur in the order given above, and that

the maximum length of an input record (a single full line of the CSV) is 512 characters.

Where a string contains a comma, the usual CSV convention will be followed that such strings will be  delimited with a quotation mark (always found at the beginning and end of the ield) — these quotes  should not be preserved once the data is read into its ield. You may assume no quotes are present in the processed strings when they are stored in your dictionary.

Smaller samples of these datasets can be found in the Dataset Download slide.

Requirements

The following implementation requirements must be adhered to:

You must write your implementation in the C programming language.

You must write your code in a modular way, so that your implementation could be used in

another program without extensive rewriting or copying. This means that the dictionary

operations are kept together in a separate .c ile, with its own header (.h) ile, separate from the main program, and other distinct modules are similarly separated into their own sections,

requiring as little knowledge of the internal details of each other as possible. Though note that your approach is not expected to implement object oriented principles strictly or deliberately.

Your code should be easily extensible to multiple dictionaries. This means that the functions for interacting with your dictionary should take as arguments not only the values required to

perform the operation required, but also a pointer to a particular dictionary, e.g.

search(dictionary,  value) .

Your implementation must read the input ile once only.

Your program should store strings in a space-ecient manner. If you are using  malloc() to

create the space for a string, remember to allow space for the inal end of string character,‘ \0 ’ ( NULL ).

Your program should store its data in a space-ecient manner. If you construct an index, this index should not contain information it does not need.

Your approach should be reasonably time e历cient.

A full Makeile is not provided for you. The Makeile should direct the compilation of your   program. To use the Makeile, make sure it is in the same directory as your code, and type  make  dict2 and make  dict3 to make the dictionary. You must submit your Makeile with your assignment.

Hints:

• If you haven’t used  make  before, try it on simple programs irst. If it doesn’t work, read the error messages  carefully. A common problem in compiling multiile executables is in the included header iles. Note also that the whitespace before the command is a tab, and not multiple spaces.

• It is not a good idea to code your program as a single ile and then try to break it down into multiple iles. Start by using multiple iles, with minimal content, and make sure they are communicating with each other before starting more serious coding.

Programming Style

Below is a style guide which assignments are evaluated against. For this subject, the 80 character

limit is a guideline rather than a rule — if your code exceeds this limit, you should consider whether your code would be more readable if you instead rearranged it.

/**  ***********************

*  C  Programming  Style  for  Engineering  Computation

*  Created  by  Aidan  Nagorcka-Smith  ([email protected])  13/03/2011

*  Definitions  and  includes

*  Definitions  are  in  UPPER_CASE

*  Includes  go  before  definitions

*  Space  between  includes,  definitions  and  the  main  function .

*  Use  definitions  for  any  constants  in  your  program,  do  not  just  write  them *  in .

*

*  Tabs  may  be  set  to  4-spaces  or  8-spaces,  depending  on  your  editor .  The  code

*  Below  is  ``gnu''  style .  If  your  editor  has  ``bsd''  it  will  follow  the  8-space

*  style .  Both  are  very  standard . */

/**

*  GOOD:

*/


#include  

#include  

#define  MAX_STRING_SIZE  1000

#define  DEBUG  0

int  main(int  argc,  char  **argv)  {

...


/**

*  BAD:

*/


/*  Definitions  and  includes  are  mixed  up  */

#include  

#define  MAX_STING_SIZE  1000

/*  Definitions  are  given  names  like  variables  */

#define  debug  0

#include  

/*  No  spacing  between  includes,  definitions  and  main  function*/

int  main(int  argc,  char  **argv)  {

...


/**  *****************************

*  Variables

*  Give  them  useful  lower_case  names  or  camelCase .  Either  is  fine,


*  as  long  as  you  are  consistent  and  apply  always  the  same  style .

*  Initialise  them  to  something  that  makes  sense . */

/**

*  GOOD:  lower_case

*/


int  main(int  argc,  char  **argv)  {


int  i  =  0;

int  num_fifties  =  0;

int  num_twenties  =  0;

int  num_tens  =  0;


...

/**

*  GOOD:  camelCase

*/


int  main(int  argc,  char  **argv)  {


int  i  =  0;

int  numFifties  =  0;

int  numTwenties  =  0;

int  numTens  =  0;


...

/**

*  BAD:

*/


int  main(int  argc,  char  **argv)  {


/*  Variable  not  initialised  -  causes  a  bug  because  we  didn't  remember  to

*  set  it  before  the  loop  */

int  i;

/*  Variable  in  all  caps  -  we'll  get  confused  between  this  and  constants

*/

int  NUM_FIFTIES  =  0;

/*  Overly  abbreviated  variable  names  make  things  hard .  */

int  nt  =  0


while  (i  < 10) {

...

i++;

}


...


/**  ********************

*  Spacing:

*  Space  intelligently,  vertically  to  group  blocks  of  code  that  are  doing  a

*  specific  operation,  or  to  separate  variable  declarations  from  other  code .


*  One  tab  of  indentation  within  either  a  function  or  a  loop .

*  Spaces  after  commas .

*  Space  between  )  and  { .

*  No  space  between  the  **  and  the  argv  in  the  definition  of  the  main

*  function .

*  When  declaring  a  pointer  variable  or  argument,  you  may  place  the  asterisk

*  adjacent  to  either  the  type  or  to  the  variable  name .

*  Lines  at  most  80  characters  long .

*  Closing  brace  goes  on  its  own  line */

/**

*  GOOD:

*/


int  main(int  argc,  char  **argv)  {


int  i  =  0;


for(i  =  100;  i  >=  0;  i--)  {

if  (i  >  0)  {

printf("%d  bottles  of  beer,  take  one  down  and  pass  it  around,"

"  %d  bottles  of  beer.\n",  i,  i  -  1);

}  else  {

printf("%d  bottles  of  beer,  take  one  down  and  pass  it  around . "

"  We're  empty.\n",  i);

}

}


return  0;

}


/**

*  BAD:

*/


/*  No  space  after  commas

*  Space  between  the  **  and  argv  in  the  main  function  definition

*  No  space  between  the  )  and  {  at  the  start  of  a  function  */ int  main(int  argc,char  **  argv){

int  i  =  0;

/*  No  space  between  variable  declarations  and  the  rest  of  the  function .

*  No  spaces  around  the  boolean  operators  */

for(i=100;i>=0;i--)  {

/*  No  indentation  */

if  (i  >  0)  {

/*  Line  too  long  */

printf("%d  bottles  of  beer,  take  one  down  and  pass  it  around,  %d

bottles  of  beer.\n",  i,  i  -  1);

}  else  {

/*  Spacing  for  no  good  reason .  */


printf("%d  bottles  of  beer,  take  one  down  and  pass  it  around . "

"  We're  empty.\n",  i);


}

}

/*  Closing  brace  not  on  its  own  line  */

return  0;}


/**  ****************

*  Braces:

*  Opening  braces  go  on  the  same  line  as  the  loop  or  function  name

*  Closing  braces  go  on  their  own  line

*  Closing  braces  go  at  the  same  indentation  level  as  the  thing  they  are

*  closing */

/**

*  GOOD:

*/


int  main(int  argc,  char  **argv)  {


...


for(...)  {

...

}


return  0;

}


/**

*  BAD:

*/


int  main(int  argc,  char  **argv)  {


...


/*  Opening  brace  on  a  different  line  to  the  for  loop  open  */

for(...)

{

...

/*  Closing  brace  at  a  different  indentation  to  the  thing  it's

closing

*/

}


/*  Closing  brace  not  on  its  own  line .  */

return  0;}


/**  **************

*  Commenting:

*  Each  program  should  have  a  comment  explaining  what  it  does  and  who  created *  it .

*  Also  comment  how  to  run  the  program,  including  optional  command  line


*  parameters .

*  Any  interesting  code  should  have  a  comment  to  explain  itself .

*  We  should  not  comment  obvious  things  -  write  code  that  documents  itself */

/**

*  GOOD:

*/


/*  change.c

*

*  Created  by  Aidan  Nagorcka-Smith  ([email protected])

13/03/2011

*

*  Print  the  number  of  each  coin  that  would  be  needed  to  make  up  some change

*  that  is  input  by  the  user

*

*  To  run  the  program  type:

*  ./coins  --num_coins  5  --shape_coins  trapezoid  --output  blabla.txt

*

*  To  see  all  the  input  parameters,  type:

*  ./coins  --help

*  Options::

*      --help     &nb