关键词 > COMPX301

COMPX301-24A Assignment 4: Pattern Search

发布时间:2024-06-05

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMPX301-24A Assignment 4: Pattern Search

Due: Friday, 7 June 2024, 11:59pm

This assignment is intended to give you experience building a regular expression (regexp) FSM compiler and corresponding pattern matcher. Your implementation is expected to be written in Java, and compile and run on a Linux machine such as those in the R Block labs.

Students must work in pairs to complete this assignment (otherwise we can't complete marking in time). Students can form their own partnerships, or will be assigned a partner. To that end, voluntary partnerships are to send an email to Dr

Tony Smith before 20 May 2024 with the subject "COMPX301 A4 partners" that includes in the body both the names and ID#s of the partnership, and make both partners recipients (i.e. Cc-ed or author) of that email. Students are encouraged to try a new partnership for this assignment, but certainly may continue with a current partnership if both parties are agreed. Those who do not have a partner by Sunday, or wish to be assigned one anyway, will be put in a partnership by me and posted on the assignment page via a link (similar to assignment three) by Monday morning. Work may be split however the partnership likes, but I will set a default split between the compiler and the searcher (roughly equal difficulty) in case an agreement cannot be made by the partners. A better situation is just to cooperate amicably.

Overview: Implement a regexp pattern searcher using the FSM, deque and compiler techniques outlined in lectures. Your solution must consist of two programs: one called REmake.java and the other called REfind.java.

The first of these accepts a regexp pattern as a command-line argument (typically enclosed within quotes—see "Note"    below), and produce as standard output a description of the corresponding FSM, such that each line of output includes four things (i.e. four fields, comma separated):

1. the state-number,

2. the state type (i.e. either a literal symbol to be matched in double-quotes, or "BR" as a branch-state indicator),

3. an integer indicating one possible next state, and

4. another integer indicating a possible next state.

State Zero is the start state of the entire machine and should branch to whichever state your compiler builds that is the actual start state (as used in lecture).

The second program must accept, as standard input , the output of the first program, then it must execute a search for matching patterns within the text of a file whose name is given as a command-line argument. Each line of the text file

that contains a substring that matches the regexp is output just once, regardless of the number of times the pattern might be satisfied in that line. (Note also we are just interested in searching plain text files, like The Brown Corpus.)

Regexp specification: For this assignment, a wellformed regexp is specified as follows (n.b. a little different than in lecture):

1. any symbol that does not have a special meaning (as given below) is a literal that matches itself

2. . is a wildcard symbol that matches any literal

3. adjacent regexps are concatenated to form a single regexp

4. * indicates closure (zero or more occurrences) on the preceding regexp

5. ? indicates that the preceding regexp can occur zero or one time

6. | is an infix alternation operator such that if r and e are regexps, then r|e is a regexp that matches one of either r or e

7. ( and ) may enclose a regexp to raise its precedence in the usual manner; such that if e is a regexp, then (e) is a regexp and is equivalent to e. e cannot be empty.

8. \ is an escape character that matches nothing but indicates the symbol immediately following the backslash loses any special meaning and is to be interpretted as a literal symbol

9. operator precedence is as follows (from high to low):

o   escaped characters (i.e. symbols preceded by \)

parentheses (i.e. the most deeply nested regexps have the highest precedence)

o   repetition/option operators (i.e. * and ?)

o   concatenation

o   alternation (i.e. | )

10. not required for this assignment, but a challenge to those who are interested, is how you might incorporate ! as a "do not match" operator, such that !e matches only a pattern that does not match the expression e.

You must implement your own parser/compiler, and your own FSM (simulating two-state machines and three-state branching machines) similar to how it was shown to you in lectures, and you must implement your own dequeue to support the search. You do not have to use the exact cluster of arrays as we did in lecture for representing your FSM, but it is probably not a bad choice. And char primitives can sometimes be a nuisance in Java so it is okay, if you wish, to just use a string with one character in it, if you like.

Note that you should make sure you have a good grammar before you start programming, so take time to write out the phrase structure rules that convince you that your program will accept all and only the regular expressions you deem legal. Anything not explicitly covered by this specification may be handled any way you want. For example, you can   decide if a** is a legal expression or not. And it is okay to preprocess the expression prior to compiling it, if such preprocessing simplifies what you are trying to do. For example, you could decide to replace any ** with just * if you want ** to be legal. I also suggest you build and test a parser first before turning it into a compiler.

Observe that REfind can be developed prior to, or in parallel with, REmake simply by working out the states of a valid FSM by hand and testing REfind with that.

Note: Command shells typically parse command-line arguments as regular expressions, and some of the special

characters defined for this assignment are also special characters for the command-line interpreter of various operating system shells. This can make it hard to pass your regexp into the argument vector of your program. You can get around most problems by simply enclosing your regexp command-line argument within double-quote characters, which is what you should do for this assignment. To get a double-quote character into your regexp, you have to escape it by putting a backslash in front of it, and then the backslash is removed by the time the string gets into your program's command-line argument vector. There is only one other situation where Linux shells remove a backslash character from a quoted string,  and that is when it precedes another backslash. For this assignment, it is the string that gets into your program that is the regexp—which may entail some extra backslashes in the argument. (N.b. Windows command prompt shell has a different syntax for parsing regexps than does Linux, so if you develop on a windows box, make sure you make the necessary adjustments for it to run under a linux bash shell on the lab machines.)

Submission: Place only copies of your well-documented, well-formatted source code and any README text file in an otherwise empty folder/directory whose name is your student ID number and the student ID number of your partner separated with an underscore, then compress it and submit through Moodle as per usual. I recommend including your grammar as a set of phrase structure rules in the README file or in the header of your compiler to assist in marking. See your tutor for details.