CSE340 Fall 2022 Project 1: Generating a lexical analyzer automatically!!
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CSE340 Fall 2022 Project 1:
Generating a lexical analyzer automatically!!
1 Introduction
I will start with a high-level description of the project in this section. In subsequent sections, I will go into a detailed description of the requirements and how to go about implementing a solution that satisfies them.
The goal of this project is to implement a lexical analyzer automatically for any list of tokens that are specified using regular expressions (if you do not know what a regular expression is, do not worry, it will be defined in this document). The input to your program will have two parts:
1. The first part of the input is a list of tokens separated by commas and terminated with the # (hash) symbol. Each token in the list consists of a token name and a token description. The token description is a regular expression for the token. The list has the following form:
t1_name t1_description , t2_name t2_description , . . . , tk_name tk_description #
2. The second part of the input is an input string which is a sequence of letters and digits and space characters.
Your program will read the list of tokens, represent them internally in appropriate data structures, and then do lexical analysis on the input string to break it down into a sequence of tokens and lexeme pairs from the provided list of tokens. The output of the program will be this sequence of tokens and lexemes. If during the processing of the input string, your program cannot identify a token to match from the list, it outputs ERROR and stops. If the input to the program has a syntax error, then your program should not do any lexical analysis of the input string and instead it should output a syntax error message and exits. Your program should also output error messages if the same name is used for multiple tokens or if one of the regular expressions generates the empty
string. More specifics about the input format and expected output are given in Sections 2 and 3. The remainder of this document is organized as follows.
1. The second section describes the input format.
2. The third section describes the expected output.
3. The fourth section describes the requirements on your solution and the grading criteria.
4. The fifth and largest section is a detailed explanation how to go about implementing a solution. This section also includes a description of regular expressions.
5. The last section includes general instructions that apply to all programming assignments in this class.
2 Input Format
The input of your program is specified by the following context-free grammar:
input |
→ |
tokens_section INPUT_TEXT |
tokens_section |
→ |
token_list HASH |
token_list |
→ |
token |
token_list |
→ |
token COMMA token_list |
token |
→ |
ID expr |
expr |
→ |
CHAR |
expr |
→ |
LPAREN expr RPAREN DOT LPAREN expr RPAREN |
expr |
→ |
LPAREN expr RPAREN OR LPAREN expr RPAREN |
expr |
→ |
LPAREN expr RPAREN STAR |
expr |
→ |
UNDERSCORE |
Where
CHAR = a | b | . . . | z | A | B | . . . | Z | 0 | 1 | . . . | 9 LETTER = a | b | . . . | z | A | B | . . . | Z
SPACE = ' ' | \n | \t
INPUT_TEXT = " (CHAR | SPACE)* "
COMMA = ' , '
DOT = ' . '
HASH = '# '
ID = LETTER . CHAR*
LPAREN = ' ( '
OR = ' | '
RPAREN = ') '
STAR = '* '
UNDERSCORE = '_ '
You are provided with a lexer to read the input, but you are asked to write the parser. In the description of regular expressions, UNDERSCORE represents epsilon (more about that later).
2.1 Examples
The following are examples of input.
1. t1 (a) | (b) , t2 (a) . ((a)*) , t3 (((a) | (b))*) . (c) # "a aa bb aab"
This input specifies three tokens t1, t2, and t3 and an INPUT_TEXT “a aa bb aab".
2. t1 (a) | (b) , t2 ((c)*) . (b) # "a aa bb aad aa"
This input specifies two tokens t1, t2, and an INPUT_TEXT “a aa bb aad aa".
3. t1 (a) | (b) , t2 (c) . ((a)*) , t3 (((a) | (b))*) . (((c) | (d))*)# "aaabbcaaaa"
This input specifies three tokens t1, t2 and textttt3 and an INPUT_TEXT “aaabbcaaaa".
4. tok (a) . ((b) | (_)) , toktok (a) | (_), tiktok ((a) . (a)) . (a) # "aaabbcaaaa"
This input specifies three tokens whose names are tok, toktok, and tiktok and an IN-
PUT_TEXT “aaabbcaaaa". Recall that in the description of regular expressions, underscore
represents epsilon, so the regular expressions for the token tok is equivalent to (a).((b)∣(e)) and the regular expressions for the token toktok is equivalent to (a)∣(e)
Note 1 The code we provided breaks down the input to the program into tokens like ID, LPAREN, RPAREN and so on. To read the input, the code we provide has an object called lexer and a
function GetToken() used in reading the input according to the fixed list of tokens given above for the input to the program. Your program will then have to break down the INPUT_TEXT string into a sequence of tokens according to the list of token in the input to the program. In order not to confuse the function that you are going to write to break down the INPUT_TEXT from the function GetToken() in the code we provided, you should call your function something else like my_GetToken(), for example.
3 Output Format
The output will be either SYNTAX ERROR if the input has a syntax error or a message indicating that one or more of the tokens have expressions that are not valid (see below) or a sequence of tokens and their corresponding lexemes according to the list of tokens provided if there are no errors. More specifically, the following are the output requirements.
1. (Syntax Checking: 20 points, but no partial credit) If the input to your program is not in the correct format (not according to the grammar in Section 2), there are two cases to consider
(a) (10 points) If the first syntax error that your program encounters is while parsing a regular expression, then your program should output the following and nothing else and stops
SYNTAX ERROR IN EXPRESSION OF token_name
where token_name is the name of the token that your program parses before parsing the regular expression (such a name must exist because if it is missing, your program must have encountered a syntax error before parsing the expression).
(b) (10 points) If the first syntax error that your program encounters in not in the regular expression of a token, then your program should output the following and nothing else
SNYTAX ERORR
You should make sure not to print anything before parsing of the input is completed.
2. (Semantic Checking: 15 points) If the input to your program is syntactically correct (no syntax error), but some token name is declared more than once, then for every instance of the token name, except for the first instance, your program should output
Line line_number1: token_name already declared on line line_number2 Where line_number1 is the line in which the duplicate token_name appears and line_number2 is the line in which the first appearance of token_name occurs.
3. (Epsilon error: 20 points) If there are no syntax errors and all tokens have distinct names, then if any regular expressions of the tokens in the list of tokens in the input to your program can generate the empty string, then your program should output
EPSILON IS NOOOOOOOT A TOKEN !!! token_name1 token_name2 . . . token_namek where token_name1, token_name2, ..., token_namek is the list of names of the tokens whose regular expressions can generate the empty string.
4. (Lexical Analysis : 65 points) If there is no syntax error and none of the expressions of the tokens can generate the empty string, your program should do lexical analysis on INPUT_TEXT and produce a sequence of tokens and lexemes in INPUT_TEXT according to the list of tokens specified in the input to your program. Each token and lexeme should be printed on a separate line. The output on a given line will be of the form
t , "lexeme"
where t is the name of a token and lexeme is the actual lexeme for the token t. If during lexical analysis of INPUT_TEXT, a syntax error is encountered then ERROR is printed on a separate line and the program exits.
In doing lexical analysis for INPUT_TEXT, SPACE is treated as a separator and is otherwise ignored.
The total of these points is 120 points. In addition, there are 10 points for programs that compile correctly and are documented (see below). So, the overall total is 130 points which will count as 13% of the course grade.
Note 2 The my_GetToken() that you will write is a general function that takes a list of token representations and does lexical analysis according to those representations. In later sections, I explain how that can be done, so do not worry about it yet, but keep in mind that you will be writing a general my_GetToken() function.
Examples
Each of the following examples gives an input and the corresponding expected output.
1. t1 (a) | (b) , t2 ((a)*) . (a) , t3 (((a) | (b))*) . (((c)*) . (c)) # "a aac bbc aabc"
This input specifies three tokens t1, t2, and t3 and an INPUT_TEXT “a aac bbc aabc”. Since the input is in the correct format and none of the regular expressions generates epsilon, the output of your program should be the list tokens in the INPUT_TEXT:
t1 , "a"
t3 , "aac"
t3 , "bbc"
t3 , "aabc"
2. t1 (a) | (b) , t2 ((a)*) . (a) , t3 (((a) | (b))*) . (c) # "a aa bbc aad aa"
This input specifies three tokens t1, t2, and t3 and an INPUT_TEXT “a aa bbc aad aa”. Since the input is in the correct format and none of the regular expressions generates epsilon, the output of your program should be the list tokens in the INPUT_TEXT the output of the
program
t1 , t2 , t3 , t2 ,
should be
"a"
"aa"
"bbc"
"aa"
ERROR
Note that doing lexical analysis for INPUT_TEXT according to the list of tokens produces
ERROR after the second t2 token because there is no token that starts with ’d’ .
3. t1a (a) | (b) , t2bc (a) . ((a)*) , t34 (((a) | (b))*) . ((c) | (d))# "aaabbcaaaa"
This input specifies three tokens whose names are t1a, t2bc, and t34 and an input text “aaabbcaaaa". Since the input is in the correct format, the output of your program should be
the list tokens in the INPUT_TEXT:
t34 , "aaabbc"
t2bc , "aaaa"
4. t1 (a) | (b) , t2 ((a)*) . (a) , t3 (a)*, t4 b , t5 ((a) | (b))* # "a aac bbc aabc"
This input specifies five tokens and an INPUT_TEXT “a aac bbc aabc". Since some of the
regular expressions can generate epsilon, the output:
EPSILON IS NOOOOOOT A TOKEN !!! t3 t5
4 Requirements and Grading
You should write a program to produce the correct output for a given input as described above. You will be provided with a number of test cases. Since this is the second project, the number of test cases provided with the project will be small relative to the number of test cases provided for
project 1. In your solution, you are not allowed to use any built-in or library support
for regular expressions in C/C++. This requirement will be enforced by checking your code.
The grade is broken down as follows for a total of 130 points
1. Submission compiles and code properly documented 10 points. To get the 10 points,
● the submission must compile and
● every function must have comments and
● every file must have your name.
2. Submission does not compile or some functions have no comments or some submitted file does not have your name: no credit for the submission.
3. Syntax checking: 20 points: 10 points for each category of syntax error messages (There is no partial credit for this part)
4. Semantic checking: 15 points (grade is strictly proportional to the number of test cases that your program successfully passes)
5. EPSILON IS NOOOOOOOT A TOKEN !!! error: 20 points (grade is strictly proportional to the number of test cases that your program successfully passes)
6. Lexical analysis of INPUT_TEXT: 65 points (grade is strictly proportional to the number of test cases that your program successfully passes)
Refer to the general project environment document for information about the compiler that we will use and the execution environment.
Note 3 If your code does not compile on the submission website, you will not receive any points, not even for documentation. Do not wait until the last minute to submit because there can be unexpected issues when you submit especially if you are not developing your solution in an environment that is not compatible with the environment of the submission website.
5 How to Implement a Solution
The parser for this project is relatively simple, but this is the first time you write a parser. So, it is important that you finish your parser completely before you attempt to do anything else. You should make sure that the parser is correctly parsing and producing syntax error messages when needed.
The main difficulty in the project coming is in transforming a given list of token names and their regular expression descriptions into a my_GetToken() function for the given list of tokens. This transformation will be done in three high-level steps:
1. Transform regular expressions into REGs. The goal here is to parse a regular expression description and generate a graph that represents the regular expressionl . The generated graph
will have a specific format and I will describe below how to generate it. I will call it a regular expression graph, or REG for short.
2. Write a function match(r,s,p), where r is a REG, s is a string and p is a position in the string s. The function match will return the longest possible lexeme starting from position p in the string s that matches the regular expression of the graph r.
3. Write a class my_LexicalAnalyzer(list,s), where list is a list of structures of the form {token_name, reg_pointer} and s is an input string. my_LexicalAnalyzer stores the list of structures and keeps track of the part of the input string that has been processed by updating a variable p which is the position of the next character in the input string that has not been processed. The class my_LexicalAnalyzer has a method my_GetToken(). For every call of my_GetToken(), match(r,s,p) is called for every REG r in the list starting from the current position p maintained in my_LexicalAnalyzer. my_GetToken() returns the token with the longest matching prefix together with its lexeme and updates the value of the current position
p. If the longest matching prefix matches more than one token, the matched token that is listed first in the list of tokens is returned.
In what follows I describe how a regular expression description can be transformed into a REG
and how to implement the function match(r,s,p). But first, I will give an overview of regular expressions and the sets of strings they represent.
5.1 Set of Strings Represented by Regular Expressions
A regular expression is a compact representation of a set, possibly infinite, of strings. For a given regular expression, we say that expression can generate a string if the string is in set that is represented by the regular expression. We start with a general description, then we give examples.
5.1.1 General description
We start with the simple expressions (the base cases)
● (One-character strings) The regular expression a represents the set of strings {a}, that is the set consisting of only the string "a".
● (Epsilon) The regular expression _ represents the set of strings {∈}, that is the set consisting of only the string ∈ (which is the empty string).
For the inductive step (recursion of your parser), there are four cases:
● (Parentheses) If R is a regular expression, the regular expression (R) represents the same set of strings that R represents. The parentheses are used for grouping and to facilitate parsing and do not have a meaning otherwise.
● (Concatenation) If R1 and R2 are regular expressions that represents sets of strings S1 and S2 respectively, then (R1) . (R2) represents the set of strings that can be obtained by concatenating one string from S1 with one string from S2 (order matters).
● (Union) If R1 and R2 are regular expressions that represents sets of strings S1 and S2 respectively, then (R1) | (R2) represents the union of the two sets of strings S1 and S2.
● (Kleene star) The last case is the most interesting because it allows us unlimited number of repetition. If R is a regular expression that represents the set of strings S, then (R)* represents the set of strings that can be obtained by concatenating any number of strings from S, including zero strings (which gives us epsilon).
5.1.2 Examples
1. The set of strings represented by a is
{a}
2. The set of strings represented by b is
{b}
3. The set of strings represented by (a) | (b) is
{a, b}
4. The set of strings represented by ((a) | (b)) . (c) is {ac, bc}
5. The set of strings represented by ((a) | (b)) . ((c) | (d)) is
{ac, ad, bc, bd}
This requires some explanation. the set of strings represented by ((a) | (b)) is {a, b}and the set of strings represented by ((c) | (d)) is {c, d} , so the set of strings represented by ((a) | (b)) . ((c) | (d))consists of strings that can be obtained by taking one string from the set {a, b} and one string from the set {c, d} and concatenating them together. The possibilities
are
{ac, ad, bc, bd}
6. The set of strings represented by ((c) | (d)) . ((a) | (b)) is {ca, cb, da, db}
7. The set of strings represented by (a)* is
{e, a, aa, aaa, aaaa, . . .}
8. The set of strings represented by (b)* is
{e, b, bb, bbb, bbbb, . . .}
9. The set of strings represented by (a) | ((b)*) is
{a, e, b, bb, bbb, bbbb, . . .}
Figure 1: Regular expressions graphs for the base cases
Figure 2: Regular expression graph for the an expression obtained using the dot operator
10. The set of strings represented by ((a)*) | ((b)*) is
{e, a, b, aa, bb, aaa, bbb, aaaa, bbbb, . . .}
11. The set of strings represented by ((a) | (b))* is
{e, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, . . . }
5.2 Constructing REGs
The construction of REGs is done recursively. The construction we use is called Thompson’s
construction. Each REG has a one start node and one accept node. For the base cases of epsilon and a, where a is a character of the alphabet, the REGs are shown in Figure 1. For the recursive cases, the constructions are shown in Figures 2, 3, and 4. An example REG for the regular expression ((a)*) . ((b)*) is shown in Figure 5.
5.2.1 Data Structures and Code for REGs
In the construction of REGs, every node has at most two outgoing arrows. This will allow us to use a simple representation of a REG node.
Figure 3: Regular expression graph for the an expression obtained using the or operator
Figure 4: Regular expression graph for the an expression obtained using the star operator
struct REG_node {
struct REG_node * first_neighbor;
char first_label;
struct REG_node * second_neighbor;
char second_label;
}
In the representation, first_neighbor is the first node pointed to by a REG node and second_neighbor is the second node pointed to by a REG node. first_label and second_label are the labels of the arrows from the node to its neighbors. If a node has only one neighbor, then second_neighbor will be NULL. If a node has no neighbors, then both first_neighbor and second_neighbor will be NULL.
Figure 5: Regular expression graph for the an expression obtained using concatenation and star operators
2022-09-04