Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CS 2210 Programming Project (Part I)

Lexical Analyzer

In this phase of the project, you will write a lexical analyzer for the CS 2210 programming language, MINI-JAVA. The analyzer will consist of a scanner, written in LEX, and routines to manage a lexical table, written in C. The rest of the compiler project will communicate with these program modules.

Due date

Please read the source code and understand how it works.

Token specification

Figure 1 defines the tokens that must be recognized, with their associated symbolic names. All multi-symbol tokens are separated by blanks, tabs, newlines, comments or delimiters.

Comments are enclosed in /* ... */ and cannot be nested. An identifier is a sequence of (upper or lower case) letters or digits, beginning with a letter. Upper and lower case are not distinguished (i.e. the identifier ABC is the same as Abc. There is no limit on the length of identifiers. However, you may impose limits on the total number of distinct identifiers and string lexemes and on the total number of characters in all distinct identifiers and strings taken together. If defined, these

limits should be defined as follows.
#def ine LIMIT1 500

There should be no other limitation on the number of lexemes that the lexical analyzer will process.

An integer constant is an unsigned sequence of digits representing a base 10 number. A string constant is a sequence of characters surrounded also by single quotes, e.g. ’Hello, world’. Hard-to-type or invisible characters can be represented in character and string constants by escape sequences; these sequences look like two characters, but represent only one. The escape sequences support by the MINI-JAVA language are \n for newline, \t for tab, \ 0 for the single quote and \\ for the backslash. Any other character following a backslash is not treated as escape sequence.

Token attributes

A unique identification of each token (integer aliased with the symbolic token name) must be returned by the lexical analyzer. In addition, the lexical analyzer must pass extra information about some token to the parser. This extra information is passed to the parser as a single value, namely an integer, through a global variable as described below. For integer constants, the numeric value of the constant is passed. In order to allow other passes of the compiler to access the original identifier lexeme the lexical analyzer passes an integer uniquely identifying an identifier (other than reserved words). String constants are treated in the same way, with a unique identifying number being passed. The unique identifying number for both identifiers and string constants should be an index (pointer) into a string table created by the lexical analyzer to record the lexemes. Same identifiers should return the same index.

Implementation

The central routine of the scanner is yylex, an integer function that returns a token number, indicating the type (identifier, integer constant, semicolon, etc.), of the next token in the input stream. In addition to the token type, yylex must set the global variables yyline and yycolumn to the line and column number at which that token appears. In the case of integer and string constants, store the value into the global integer variable yylval. Lex will write yylex for you, using the patterns and rules defined in your lex input file (which should be called lexer.l. Your rules must include the code to maintain yyline, yycolumn and yylval.

Figure 1: Defined Tokens in Mini Java.

In the case of identifiers and string constant, yylval contains a pointer pointing to a string table that contains the real string. The same index should be returned for the same identifier that appear at different places. Similarly the same index are returned for the same string. However, abc and “abc” should return different indice in the string table.

Reserved words may be handled as regular expressions or stored as part of the id table. For example, reserve words may be pre-stored in the string table so your program can determine a reserve word from an identifier by the section of the table in which the lexeme is found. Efficiency should be a factor in the management of the lexical and string table.

You are to write a routine ReportError that takes a message and line and column numbers and reports an error, printing the message and indicating the position of the error. You need only print the line and column number to indicate the position.

The #define mechanism should be used to allow the lexical analyzer to return token numbers symbolically. In order to avoid using token names that are reserved or significant in C or in the parser, the token names have been specified for you in Figure 1.

The parser and the lexical analyzer must agree on the token number to ensure correct communication between them. The token numbers can be chosen by you, as the compiler writer, or, by default, by Yacc (a parser generator to be used in

the next assignment). The default token number for a literal character (i.e. a one-character token) is the numerical value of the character in the local character set. Other names are assigned token numbers starting at 257 in the order they are declared in your Yacc specification. Regardless of how token numbers are chosen, the end-marker must has token number 0 or negative, and thus your lexical analyzer must return a 0 ( or a negative) as a token number upon reaching the end of input.

Temporary driver

In order to test your lexcial analyzer without a parser, you will have to write a simple driver program which calls your lexical analyzer and print each token with its value as the input is scanned. For ease in combining the lexical analyzer and parser in the second assignment, the lexical analyzer function should be put in a file by itself. The following shows the structure of a driver, if you use it, please remember to break from the endless loop after recognize the end of file.
main()
{... while (1) {
switch (yylex()) {
case ICONTnum: ...
}
}
...}

Error handling

Your lexical analyzer should recover from all malformed lexemes, as well as such things as string constants that extend across a line boudary or comments that are never terminated. Specifically, an identifier which starts with a digit is considered to be an error and should be reported.

An example program with output

The program /* Example 1: A hello world program */

program xyz;

class Test {
method void main() {
System.println(’Hello World !!!’);
}
}
The output of Lexical Analyzer
Line Column Token Index in String table
2     13         PROGRAMnum
2     17         IDnum 0
2    18         SEMInum
3    11         CLASSnum 4
3    16         IDnum
3    18         LBRACEnum
4    18         METHODnum
4    23         VOIDnum
4    28         IDnum 9
4    29         LPARENnum
4    30         RPARENnum
4    32         LBRACEnum
5    22         IDnum 14
5    23        DOTnum
5    30        IDnum 21
5    31        LPARENnum
5    48        SCONSTnum 29
5    49        RPARENnum
5    50        SEMInum
6    13        RBRACEnum
7    7          RBRACEnum EOFnum
String Table : xyz test main system println Hello World !!!
End of file

Assignment submission

No submission.