Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

comp10002 Foundations of Algorithms

Semester 2, 2022

Assignment 2

Learning Outcomes

In this project, you will demonstrate your understanding of dynamic memory and linked data structures (Chapter 10) and extend your program design, testing, and debugging skills.  You will also learn about the problem of process discovery and implement a simple algorithm for discovering a process model from event data.

Process Discovery

Process mining is a relatively young research discipline that combines studies of inferences from data in data mining and machine learning with process modeling and analysis to discover, monitor, and improve real-world processes. The studied processes get formally captured in process models.

Figure 1 shows an example process model captured in Business Process Model and Notation (BPMN) language.  In the figure, actions are drawn as rectangles with rounded cor- ners. Diamonds represent gateways. An exclusive gateway has the x” marker, while a parallel gateway has the “+” marker inside the diamond shape.  Directed arcs encode control ow dependencies. The circles with thin and thick borderlines de- note the start and end of the process, respectively.

A process model describes a collection of possible exe- cutions, or traces.  A trace is a sequence of events that rep- resent actions encountered when following control ow arcs

Figure 1: A process model.

and gateway routing decisions from the start to the end element of the model.  If execution reaches an ex- clusive gateway, it continues along exactly one of its outgoing arcs.   If execution reaches a parallel gate- way with multiple outgoing arcs, all the outgoing arcs are followed simultaneously.  A parallel gateway with multiple incoming arcs waits” for all the incoming arcs to complete and then passes” the execution to its outgoing arcs.   The fact that several paths in a model are executed simultaneously means that the ac- tions encountered on these paths are independent and can occur in any order.  According to these rules, the model in Figure 1 describes exactly three traces:  (a, b, c, d) , (a, b, e, f), and (a, b, f, e) .  For simplicity, the model in Figure 1 uses abstract action labels.  In practice, actions usually “wear” meaningful labels to de- scribe, for example, a business process or a distributed algorithm.  For instance, the process model from Fig- ure 1 with labels a =“receive loan application ”, b =“assess loan application ”, c =“reject loan application ”, d =“close loan application ”, e =“approve loan application ”, and f =“offer additional products ” describes some bank’s loan application handling process.

In the real-world, different loan applications can be processed by following the same sequence of actions; thus, a trace can be observed multiple times.  The problem of process discovery is a core problem in process mining that, given an event log, that is, a collection of observed traces, aims to automatically construct a process model that describes the traces.  A process model constructed in this way aims to aggregate and generalize information on how processes were executed in the real-world. Process analysts can then use this constructed process model to understand real-world processes, identify interesting and problematic patterns of actions, and use the model as a starting point for redesigning and improving the processes.

In this assignment, you will implement a program that reads an event log from input, analyzes patterns of actions in the input traces, and discovers and reports model fragments that could have produced the traces.

Input Data

Your program should read input from stdin and write output to stdout. The input will list traces, one per input line, where each trace is provided as a comma-separated list of events that terminates either with /n”, that is, one newline character, or EOF, that is, the end of file constant defined in the header file. An event is encoded as a single alphabetic character that when provided as input to the isalpha() function defined in the  header file returns a non-zero integer.

The following le test0 .txt uses fifteen lines to specify fifteen traces, each composed of four events.

1

a,b,c,d

4

a,b,f,e

7

a,b,c,d

10

a,b,e,f

13

a,b,f,e

2

a,b,f,e

5

a,b,c,d

8

a,b,c,d

11

a,b,e,f

14

a,b,c,d

3

a,b,f,e

6

a,b,c,d

9

a,b,c,d

12

a,b,e,f

15

a,b,e,f

In general, input data can contain an arbitrary number of traces, with each trace composed of at least one event.

Stage 0 Reading, Analyzing, and Printing Input Data (10/20 marks)

The first version of your program should read an event log from input, analyze it, and print basic statistics about the event log to the output. The first thirteen lines from the listing below correspond to the output your program should generate for the test0 .txt input le in Stage 0.

1 ==STAGE  0============================ 26 d  =  7 51 258  =  CON(e,f)

2 Number  of  distinct  events:  6 27 e  =  8 52 Number  of  events  removed:  8

3 Number  of  distinct  traces:  3 28 f  =  8 53 256  =  15

4 Total  number  of  events:  60 29 256  =  15 54 257  =  7

5 Total  number  of  traces:  15 30 ===================================== 55 258  =  8

6 Most  frequent  trace  frequency:  7 31 c        d        e        f    256 56 =====================================

7 abcd 32 c        0        7        0        0        0 57 256    257    258

8 a  =  15 33 d        0        0        0        0        0 58 256        0        7        8

9 b  =  15 34 e        0        0        0        4        0 59 257        0        0        0

10 c  =  7 35 f        0        0        4        0        0 60 258        0        0        0

12 e  =  8 37 62 259  =  CHC(257,258)

13 f  =  8 38 257  =  SEQ(c,d) 63 Number  of  events  removed:  0 14 ==STAGE  1============================ 39 Number  of  events  removed:  7 64 256  =  15

15 a        b        c        d        e        f 40 e  =  8 65 259  =  15

16 a        0      15        0        0        0        0 41 f  =  8 66 =====================================

17 b        0        0        7        0        4        4 42 256  =  15 67 256    259

18 c        0        0        0        7        0        0 43 257  =  7 68 256        0      15

19 d        0        0        0        0        0        0 44 ==STAGE  2============================ 69 259        0        0

20 e        0        0        0        0        0        4 45 e        f    256    257 70 -------------------------------------

21 f        0        0        0        0        4        0 46 e        0        4        0        0 71 260  =  SEQ(256,259)


23 256  =  SEQ(a,b) 48 256        4        4        0        7 73 260  =  15

24 Number  of  events  removed:  15 49 257        0        0        0        0 74 ==THE  END============================

25 c  =  7 50 ------------------------------------- 75

Lines 2 and 3 of the output report the number of distinct events and traces in the event log, respectively. The event log contains six distinct events, a through f, and three distinct traces, (a, b, c, d) , (a, b, e, f), and (a, b, c, d) . Lines 4 and 5 of the output report the total number of events and traces in the event log.  As there are fteen traces in the log, each composed of four events, in total, there are sixty events in the event log. Line 6 provides information about the number of times the most frequent trace appears in the event log. Trace (a, b, c, d) is the most frequent trace; it appears seven times in the event log. The subsequent lines of the output should list all most frequent traces of the event log (one occurrence of each trace). If there are multiple traces with the same maximum frequency, they should all be listed, one per line, in lexicographical order, where uppercase characters precede all lowercase characters. Each trace should be printed as a string composed of characters that encode trace events in the order the events appear in that trace. As there is only one most frequent trace in the example input event log, it is printed on line 7; see the abcd string. In the following lines, see lines 8 to 13, your program should report all the events that appear in the event log, one per line, in ascending order of the corresponding ASCII codes, with information on how many times they appear in the event log.

You should not make assumptions about the maximum number of traces and events supplied. Use dynamic memory and data structures of your choice, for example, arrays or linked lists, to store the input event log.

The test input will always follow the proposed format.

Stage 1 Identifying Sequences (16/20 marks)

Extend your program from Stage 0 to identify the most likely sequential (SEQ) patterns of actions based on the directlyfollows relationships between events in the event log. Two events are in the directly follows relationship if there is a trace in the log in which these events appear consecutively. First, your program should compute and print out all the directly follows relationships.  Lines 15 to 21 of the output listing report the directly follows relationships between the events together with their support, that is, the number of times the events follow each other in the traces of the event log, given as a two dimensional matrix. For example, event b directly follows event a once in each trace of the event log. Thus, its support is 15; see 15 in the third column of line 16 of the output; that is, sup(a, b) = 15. Each matrix column uses five characters, and each cell entry is right-justified.


Now, use the directly follows matrix to identify pairs of actions that most likely were arranged in sequences in the model that generated the event log. Given two events x and y, x y, if sup(x, y) > sup(y, x), then there is a chance that the model that generated the event log has actions x and y arranged in a sequence. To measure this chance, compute two values pd (x, y) =  (100 x abs(sup(x, y) - sup(y, x)))/max(sup(x, y), sup(y, x)) and w (x, y) = abs(50 - pd (x, y)) x max(sup(x, y), sup(y, x)); use integer division to obtain the truncated result (for example, 15/4 should result in 3). Here, pd (x, y) is percent difference of sup(x, y) and sup(y, x) that reflects confidence that x is directly followed by y, while w (x, y) is weight that further scales this confidence by the number of times y directly follows x; here, we assume that indeed sup(x, y) > sup(y, x).

Each pair of events (u, v), u v, for which sup(u, v) > sup(v, u) and pd (u, v) > 70 is a candidate for defining a sequential pattern of actions. To select one most likely sequential pattern from the candidates, pick one with the highest weight and encountered rst when traversing the directly follows matrix in row-major order.

Lines 23 to 29 of the output report the identified sequential pattern of actions and summarize information on the abstraction of the pattern in the event log.  The most likely sequential pattern identified according to the proposed rules in the input event log is defined by the pair of events (a, b); note that pd (a, b) = 100 and w (a, b) = 750. Line 23 reports the identified pattern, assigning it the code of 256; note that the code of each subsequently identified pattern of actions should be incremented by one; see lines 38, 51, 62, and 71 of the output.

Next, the identified pattern is abstracted in the event log to prepare it for discovering subsequent patterns. To abstract a pattern identified by a pair of events, rewrite each occurrence of one of these events in the traces of the event log with the abstract event identified by the code of the pattern.  Then, in every trace, replace each subsequence composed of only new abstract events with one such abstract event without changing the order of the other events. For example, the abstraction of pattern 256 in the input event log leads to the below event log; keep the resulting event log in memory and do not write it to a file.

1

256,c,d

4

256,f,e

7

256,c,d

10

256,e,f

13

256,f,e

2

256,f,e

5

256,c,d

8

256,c,d

11

256,e,f

14

256,c,d

3

256,f,e

6

256,c,d

9

256,c,d

12

256,e,f

15

256,e,f

Line 24 of the output reports the number of events removed from the event log as the result of abstracting pattern 256. One event was removed from every trace to result in 15 removed events. Finally, lines 25 to 29 report all distinct events, both original and abstract, and their frequencies in the resulting event log, in the ascending order of their codes, where codes of the original events are ASCII codes of the corresponding characters.  Lines 31

to 43 use the same template to report information on the next identified sequential pattern of actions.

The output of two consecutive patterns is separated by a sequence of “=” characters (see, for example, line 30), and the directly follows matrix is separated from the subsequent information on the identified pattern with a sequence of “-” characters (see, for instance, lines 22 and 37).  The identification of patterns should continue from the resulting log until no pattern can be discovered based on the proposed rules. NB: In Stage 1, only patterns based on two events from the original input log, that is, non-abstract events, should be discovered.