COMP2212 Programming Language Concepts Coursework Semester 2 2023/24

发布时间：2024-05-20

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP2212 Programming Language Concepts Coursework

Semester 2 2023/24

Introduction

In this coursework you are required to design and implement a domain speciic programming language for querying simple Graph Data documents. There are many query languages for assorted data formats, the most famous perhaps being SQL for relational databases. You are welcome to research any existing query languages to inspire the design of your own language. If you want, you could stick closely to the syntax of an existing language. Of course, you are also more than welcome togo your own way and be as original and creative as you want: it is YOUR programming language, and your design decisions.

You are required to (1) invent an appropriate syntax and (2) write an interpreter (possibly using Alex and Happy for lexing and parsing). Your overall goal is :

To design and implement a programming language for querying and manipulating simple Graph Data input iles.

For the ive example problems listed below, you will be required to produce a program, in your language, that solves each of the problems. These ive programs, together with the Haskell sources for your interpreter, are the required deliverables for the irst submission.

Please keep a record of all the sources you consult that inluence your design, and include them in your programming language manual. The manual will be required for the second submission, along with programs for ive additional unseen problems, which we will make public after the deadline for theirst submission. You should anticipate that the additional problems will comprise of variations and combinations of the irst ive problems. You will ind more procedural details at the end of this document.

The speciication is deliberately loose. If we haven’t speciied something, it is a design decision for you to make: e.g. how to handle syntax errors, illegal inputs, whether to allow comments, support for syntax highlighting, compile time warnings, any type systems etc. A signiicant part (50%) of the mark will be awarded for these qualitative aspects, where we will also take into account the design of the syntax and reward particularly creative and clean solutions with additional marks. The remaining 50% of the mark will be awarded for correctly solving a number of problems using your programming language. The correctness of your solution will be checked with a number of automated tests. We will release an “automarker” script before the irst deadline which will give you an indication as whether your programs will compile and run against our test harness.

The Input and Output Format

For each problem we will declare the name of one input ile in a simple Graph Data format. The particular variant of Graph Data we will be using is a simpliied version of the input for- mat for Neo4j (see https://neo4j.com/docs/operations-manual/current/tools/neo4j-admin/neo4j- admin-import/import-tool-header-format). A key design point of the Neo4j Graph Data model is that both the nodes of the graph and the edges between nodes (relationships) may contain both typed property data and labelling information.

The input ile name will correspond to a ile in the current working directory with an extension “ .n4j”. For example, if the problem input ile is called “foo” and your interpreter is executed in di- rectory “C:/Home/Users/jr/STQL/” then the input will beat “C:/Home/Users/jr/STQL/foo.n4j” . The input iles will be in a simple Neo4j format as follows:

The overriding data format is that of Comma Separated Value (CSV). That is, each input ile will contain a number of rows with comma separated ields. Input iles contain a number nodes sets followed by collections of relationships between these nodes. Each node set begins with a node header row that describes the property ield names and types for that node set. Then follows all of the entries for that node set, one per row. Similarly, a relationship begins with relationship header row that also describes the property ield names and types for the relationship.

A node header row must be of the form (using regular expression syntax):

:ID, (<name> : <Type>)* (,:LABEL)

that is, we must declare an ID ield with a name <name>. We have zero or more named typed ield declarations representing property values of the node. Finally, we have an optional keyword :LABEL that allows us to tag nodes with a label list. We separate multiple labels with semi-colon.

Subsequent entries in the input ile following a node header must match the format of the previous header row. For example,

:ID, colour:string, age : integer , active : boolean , :LABEL

vehicle1 , "Yellow", 3, true , Car

vehicle2 , "Blue" , 7, true , Car

vehicle3 , "Red" , 11, false , Car

vehicle4 , null, 2, null, Car;Imported

The ield names, that is the <name> entries, must be alpha-numeric only. Types can be one of string, integer, or boolean only. Labels must be alphabetic characters only. The literal values range over string values that are delimited by quotes and contain only alphabetic characters, integer values that match the regexp “[+-]?[0-9]+” and boolean values true and false. The value null is also a valid value of any type.

There may be multiple such node sets in the input ile. Please note that Identiier values for the nodes must be unique across the input ile. An input ile that has two node rows beginning with the same ID value is considered an error, even if it is a duplicate row.

Following the node sets, we have a number of relationship sets. A relationship header row must be of the form:

:START_ID, (<name> : <Type>)* , :END_ID , :TYPE

that is, we state that this is a relationship between a source node with a given start ID and a target node with a given end ID. There are zero or more property ields associated with the relationship and a relationship has exactly one TYPE. Like labels, the TYPE values describe the nature of the relationship and must be alphabetic characters only. Unlike labels, we only ever have a single TYPE. Subsequent entries in the input ile following a relationship header must match the format of the previous header row. For example,

:START_ID, speed : integer , :END_ID, :TYPE

vehicle2 , 30, vehicle3, CrashedInto

vehicle2 , 20 , vehicle1, CrashedInto

vehicle4 , 15, vehicle2, CrashedInto

describes three relationships (edges in the graph). Two of the relationships have a source node vehicle2 with targets vehicle3 and vehicle1. The third relationship has source node vehicle4 and target vehicle2. We use the shorthand e.g. vehicle4 -CrashedInto->vehicle2 to describe the presence of an edge. There may be multiple such relationship sets in the input ile.

Spaces are allowed between data-items in each row and blank rows in the input ile are per- mitted and may be skipped over. You may assume in the stated problems that all input iles will be well-formed in this simple Neo4j input format.

For each stated problem, the output should also be in the same simple Neo4j format The output should always be printed to Standard Out.

Problems

For every problem below, you may assume that we will place a simple Neo4j ile in the same directory where we execute your interpreter. The ile will always be compatible with the simple Neo4j format. You may assume that we will not require you to perform any additional operations on the literal values other than those indicated by the problems given below. For each problem I will provide one example input ile and the expected output for that input. These are listed in the Appendix but will also be available as a zip ile from the module website. We will test your solutions on input iles diferent to these but all inputs used will satisfy the input format above.

Problem 1 - Simple Node Query

Given an input ile named “access.n4j”, representing personnel and guests with access to a building, output a graph that contains all of the nodes in the input graph that have label “Visitor” along with all nodes whose value for ield “age” is less than or equal to 25. The output should be in graph format as above given as a single node set only. The header row need only contain the ID ield, the single “age” property and the label. You should use null values for nodes with no “age” value.

Problem 2 - Simple Relationship Query

Given an input ile named “tasks.n4j”, representing the tasks on a construction job, output a graph that contains the whole of the input graph along with some additional relationship edges as follows. For all nodes nT that are the target of some relationship that has a ield named “priority” with value greater than or equal to 8 and for all nodes labelled “Worker” nS that are the source of some relationship that has a ield value named “available” with value true, include the extra relationship nS -PossiblyAllocated-> nT in the output graph.

Problem 3 - Parametric Queries

Given an input ile named “table.n4j” representing a graph of sports teams that contains data for the season’s results, ind a list of teams that are on the same points as a team that drew with another team that the irst team lost to. That is, ind a set of nodes n such that n、-Beat-> n for some source n、say. For each such n, ind all nodes n、、such that either n、、-DrewWith-> n、 -Beat-> n or n、-DrewWith-> n、、-Beat-> n for some n、. Return as output a graph consisting of nodes n whose ield value named “points” is non-null and equal to the ield value named “points” of n、、. Your output graph can be returned as a single node set with header row just the node ID and the property ield ”points” .

Problem 4 - Graph Filtering

Given a ile named “network.n4j” representing a graph of persons, their friendships and employers, indall personswithirst names beginning with “A”, “B”, or “C”, (as nodes) who have friends older than themselves that don’t work in a cafe. The employer nodes are labelled according to their type of business and all persons have a “irstName” and “age” ields. There are two relationship sets with types “IsFriend” and “WorksFor”. The “IsFriend” relationship is not necessarily symmetric.

You should return the subgraph of the input graph containg all person nodes that have irst names beginning with “A”, “B”, or “C” and older friends that don’t work in a cafe along with all of these friend nodes. Do not include the friends that work in a cafe. You should return the “IsFriend” relationship set between the remaining nodes in the subgraph also.

Problem 5 - Field Updates

Given a iled named “loyalty.n4j” containing a graph of customers of a number of businesses oper- ating a joint loyalty scheme, ind all persons who are customers of any business who recommended another customer of that same business. There are person and business node sets and there are two relationship sets with labels “Recommended” and “CustomerOf” . The “CustomerOf” rela- tionship have a propertyield named “reward” and the business nodes have a propertyield named “bonus”. You should output an modiied graph as follows. Output an updated graph in which for all persons p such that p -CustomerOf-> b for some b and p -Recommended-> q -CustomerOf-> b for some q, we have updated both the reward ield of p -CustomerOf-> b and p -CustomerOf-> b by incrementing them both by the value of the “bonus” ield of the business node b. You should also remove the entire relationship set for Recommended.

First submission - due Thursday April 25th 4pm

You will be required to submit a zip ile containing:

. the sources for the interpreter for your language, written in Haskell

. ive programs, written in YOUR language, that solve the ive problems speciied above. The programs should be in iles named pr1.gql, pr2.gql, pr3.gql, pr4.gql, pr5.gql.

We will compile your interpreter using the command ghc Gql.hs so you will need to include a ile in your zip ile called Gql.hs that contains a main function along with any other Haskell source iles required for compilation. Prior to submission, you are required to make sure that your interpreter compiles on a Unix machine with a standard installation of GHC (Version 8.10.7) or earlier: if your code does not compile then you may be awarded 0 marks. Before submission, we will release a simple “automarking” script that will allow you to check if your code compiles and whether each of your programs passes a basic test.

You can use Alex and Happy for lexing and parsing but make sure that you include the generated Haskell source iles obtained by running the alex and happy commands as well as the Alex and Happy source iles. Alternatively you can use any other Haskell compiler construction tools such as parser combinator libraries. You are welcome to use any other Haskell libraries, as long as this is clearly acknowledged and the external code is bundled with your submission, so that it can compile on a vanilla Haskell installation.

Interpreter spec. Your interpreter is to take a ile name (the program in your language) as a single command line argument. The interpreter should produce output on standard output (stdout) and error messages on standard error (stderr). For each problem, we will test whether your code performs correctly by using a number of tests. We only care about correctness and performance will not be assessed (within reason - our marking scripts will timeout after a generous period of time). You can assume that for the tests we will use correctly formatted input. For example, when assessing your solution for Problem 1 we will run

./Gql pr1.gql

in a directory where we also provide our own versions of graph.n4j. We will then compare the contents of stdout against our expected outputs. Whitespace and formatting is unimportant as long as the output is adheres to the prescribed simple Neo4j output format and contains no other text. We will parse your outputs and compare them against expected outputs as Graphs. For that reason, order of entries in the output ile does not matter.

Second submission - due Thursday May 2nd 4pm

Shortly after the irst deadline we will release a further ive problems. Although they will be diferent from Problems 1-5, you can assume that they will be closely related, and follow the same input/output conventions. You will be required to submit two separate iles.

First, you will need to submit a zip ile containing programs (pr6.gql, pr7.gql, pr8.gql, pr9.gql, pr10.gql) written in your language that solve the additional problems. We will run our tests on your solutions and award marks for solving the additional problems correctly.

Second, you will be required to submit a 5 page report on your language in pdf format that explains the main language features, its syntax, including any scoping and lexical rules as well as additional features such as syntax sugar for programmer convenience, type checking, informative error messages, etc. In addition, the report should explain the execution model for the interpreter, e.g. what the states of the runtime are comprised of, your key data structures used, and how they are transformed during execution. Languages that support strong static typing and type safety with a formal speciication are preferred. This report, together with the ive programs will be evaluated qualitatively and your marks will be awarded for the elegance and lexibility of your solution and the clarity of the report.

Please note: there is only a short period between the irst and second submission. I strongly advise preparing the report well in advance throughout the development of the coursework.

As you know, the coursework is to be done in groups of three. Only one submission per group is required. I don’t need to know who is in which group at the irst submission but as as part of the second submission we will require a declaration of who is in your group and how marks are to be distributed amongst the members of your group. You will receive all feedback and your marks by Friday May 30th. Please ensure that it is the same group member that submits for both the irst and second submissions.

Marks. This coursework counts for 40% of the total assessment for COMP2212. There are a total of 40 marks available. These are distributed between the two submissions as follows:

You will receive the test results of Submission One prior to the second deadline but no marks will be awarded until after Submission Two.

After Submission Two we will award up to 20 marks for the qualitative aspects of your solution, as described in your programming language report. We will also award up to 20 marks for your solutions to the ten problems. For each problem there will be 2 marks available for functional correctness only.

You have the option of resubmitting the interpreter after receiving your testing results from Submission One. This will however incur a 50% penalty on the marks for functional correctness of all ten problems. Therefore, if you decide to resubmit your interpreter in the second submission the maximum possible total coursework mark is capped at 30 marks (20 for the report plus 10 for functional correctness).

Any late submission to either component will be treated as a late submission overall and will be subject to the standard university penalty of 10% per working day late.