Programming Assignment #4 CS671
1 st submission due 7/6/2018
2 nd submission due 7/12/2018
The system in this assignment should search for phrases in documents in parallel. There are many ways
to implement this parallelism:
 directly in terms of threads;
 using higher-level abstractions from java.util.concurrent;
 using higher-level abstractions from scala.util.concurrent;
In this assignment, the system is implemented using the first approach only.
Document Search
Given a target phrase, we'll look in a document for as many consecutive words from the target as
possible. For a match to be valid, the words have to be consecutive in both the document and the
target. Non-letter characters will be ignored and treated as whitespace. For simplicity, both the source
and the target will be converted to lowercase. For example, given the following source document:
How I want a drink, alcoholic of course,
after the heavy lectures involving quantum mechanics!
and the target: “I will take the course after all”, the longest match that can be found is two words:
“course after”. (“I” and “the” are other matches, but they are shorter.)
Technically, this is an instance of the longest common substring problem, for which efficient (but non-
trivial) algorithms exist. Instead of solving it as a general instance of a longest common substring
problem, this assignment adopts a different approach. Since the search phrase is assumed to be much
smaller than the document, we can filter out all the document words that do not appear in the target
phrase and group the remaining words accordingly. For instance, given search above, the document can
be formatted as two groups: “I” and “course after the”. (Non-target words like “how”, “want”, “a”, etc.
are treated as whitespace and ignored.) We can then search each group for a longest common
substring. Since the groups are small, a naive algorithm will be good enough.
Specifically, the details of the implementation are as follows:
 Finder.words: produces all the words from a source, as a stream. Non-letter characters are
treated like whitespace and other characters are converted to lowercase. Words are consecutive
sequences of letters separated by whitespace (or non-letters). On the example document above,
this produces the strings: " how ", " i ", " want ", " a ", " drink ", " alcoholic ", " of ", " course ",
" after ", " the ", " heavy ", " lectures ", " involving ", " quantum ", and " mechanics ".
 Finder.groupWords: builds groups of words that belong to the target. All other words are
discarded. Given the target and document above, this produces the lists: List("i") and
List("course", "after", "the") .
 Finder.longestCommonPrefix: finds the longest common prefix of two lists. Note that this is
much easier than to find the longest common substring as there is no search involved. The longest
common prefix of list a,b,c,d,e and list a,b,x,c,d,e is a,b ; the longest common prefix of list
a,b,c,d,e and list x,a,b,c,d,e is empty.
 longestSublist: solves the longest common sublist problem in a naive way. The longest
common sublist of two lists L and L’ is calculated by enumerating all the suffixes of L and L’
50 pts.
(including L and L’) and by taking the two suffixes that produce the longest common prefix. For
instance, if L = (a, b, c) and L’ = (b, c, d), the suffixes of L are L 1 = (a, b, c), L 2 = (b, c), L 3 = (c) and L 4 = ();
the suffixes of L are L’ 1 = (b, c, d), L’ 2 = (c, d), L’ 3 = (d) and L’ 4 = (). The two suffixes that share the
longest common prefix are L 2 and L’ 1 with a common prefix (b, c). This is the longest common
substring of L and L’.
Note that the standard method List.tails can be used to iterate over all the suffixes of a list.
The starting point of a search is a stream of characters read from a document. 1 A search activity could
load the entire stream in a list and perform the search in terms of lists. Instead, the assignment chooses
to perform the search directly in terms of streams (which may, under certain circumstances, be more
efficient).
Searches are organized in search units. A SearchUnit instance contains a document to search (as a
stream of characters) and a target phrase (as a list of strings). It offers a method SearchUnit.search
that uses the helper functions defined in Finder to implement the search. This is the only search
method needed by the application. Class Finder could actually be private but is left public for testing
and grading purposes.
Finally, as a collection of documents are searched (in parallel), a list of search results is produced that
contains the longest match for each document. 2 Method SearchSummary.bestSearches finds the
longest matches in this list of results and returns them as a set.
Using Threads
To search multiple documents in parallel, a SearchUnit object is created for each document and the
goal is to perform multiple calls of the form unit.search() at the same time. This can be achieved by
creating additional threads of control and having different threads run unit.search() concurrently on
different SearchUnit instances.
The number of threads used is an important parameter of such a system: too few threads can result in
an underutilization of computing resources; too many threads can waste memory and put pressure on
the operating system scheduler, resulting in poor performance. In particular, although it would lead to a
clean and simple design, it is not a very good strategy to have the number of threads equal the number
of searching tasks (a search in 1000 documents would use too many threads).
There are mechanisms available to have M tasks processed by N threads (M >> N) in both
java.util.concurrent and scala.concurrent . In this assignment, we'll implement a simpler
mechanism from scratch. The code for it resides inside package parallelsearch.threads .
Because documents can vary in size, it is not a good strategy to assign M/N tasks to each thread (threads
with shorter documents might run out of work while threads with longer documents are still running).
Instead, the threads themselves need to pick up a new search after they complete their current task.
This requires a shared repository of tasks, which complicates the design (this repository needs to be safe
when accessed by multiple threads).
In this assignment, the repository is implemented as an instance of class WorkDispatcher . This class
has a public method getWork() which returns an option: either some work to do or None when the
1 The starting point is not the actual contents of the document, so as to enable loading multiple documents in
parallel.
2 If a document contains several matches of the same maximal length, it is not specified which one is returned.
40 pts.
repository is empty. The method needs to be properly synchronized to ensure than every task is
assigned to exactly one thread (no task is lost or processed twice).
A master thread (instance of class Master ) creates the repository and fills it with the SearchUnit tasks
(one per document). It then creates N worker threads (instances of class Searcher ). Each worker is
given a reference to the repository. Additionally, the master thread also serves as a repository of search
results (instances of SearchResult ) and will create a search summary at the end of the search.
The behavior of a worker thread is as follows:
1. Get work from the repository; if None , the thread terminates.
2. Otherwise, process the search unit by running its search method.
3. Store the result from this search by calling method found on the master thread (used here as a
passive repository object).
4. Go to step 1.
The behavior of the master thread is:
1. Create a repository of search tasks as an instance of WorkDispatcher .
2. Create N worker threads. Each worker has a reference to the dispatcher (to get work) and to the
master thread (to return results).
3. Start all the workers, which then begin to run searches concurrently.
4. Wait for all the workers to terminate.
5. Create a search summary (with a bestSearches method).
Note that methods Master.found and WorkDispatcher.getWork are called by multiple threads and
need to use proper synchronization.
Dealing with Failures
Since the search activities potentially involve I/O and networking, the system needs to be able to cope
with failure. In this assignment, we keep fault tolerance to a minimum and only require that the system
terminates with the results of the successful searches when some searches fail with
java.io.IOException . No results are produced for failed searches.
It is straightforward to make the thread-based system achieve this requirement by catching I/O
exceptions and by not calling Master.found for failed tasks.
Notes
 More information of the functions to implement can be found at
http://cs.unh.edu/~cs671/Programming/
 Methods Finder.words and Finder.groupWords generate streams from streams. They should
work lazily, meaning that they should not evaluate their entire input streams to produce fully
evaluated output streams that are, in effect, lists. In particular, these methods should work on
infinite streams or on streams that produce several values before they fail (see “lazy” tests in the
sample tests). Typically, to produce streams lazily is easy and boils down to using #:: or #::: to
build them. A list-based approach can usually be transformed into a stream-based approach by
replacing :: with #:: and ::: with #::: .
 In order to help debug parallelism and fault-tolerance, two utilities are implemented in the
parallelsearch package:
20 pts.
o Streams can be slowed down. If s is a stream (e.g., from a URL source) and d is a duration (e.g.,
10 seconds), then s.slow(d) is a stream that contains the same values as s but takes about d
amount of time to produce them.
o Streams can be made to fail: s.failOn(target) is a stream that produces the same values as
stream s but fails with an exception when it encounters value target .
These extensions are made available by importing parallelsearch._ ; the exception that is
thrown can be customized, e.g., stream.failOn(target, new java.io.IOexception) .
As an example,
import parallelsearch._
import scala.concurrent.duration._
val s = "so far, so good...".toStream.slow(10.seconds).failOn('.')
s foreach print
will start to slowly display the letters ' s ', ' o ', . . . until it reaches a period, where it fails with an
exception.
 Case classes Search , SearchUnit , SearchResult and SearchSummary are used as parameter
types in the thread-based application.
 The thread-based system must terminate gracefully. A worker thread that fails to terminate in the
thread-based application is likely to block the master and so the problem will be immediately visible.
 For this assignment, fault-tolerance should only focus on java.io.IOException . Code should not
blindly catch Exception . In particular, it should not “swallow” InterruptedException , which is
used when grading to implement timeouts.
 The package implements two command-line applications, which can be used to try the thread-based
system (writing unit tests for such systems is extremely difficult):
o App1 : performs a search on lists and URLs, using a fixed number of threads. 3 For example:
> runMain parallelsearch.threads.App1 3 "of the scala programming
language" http://cs.unh.edu http://scalatest.org http://scala-lang.org
Set(SearchResult(http://scala-lang.org,List(the, scala, programming,
language)))
The other two URLs contain smaller matches:
SearchResult(http://cs.unh.edu,List(of, the)) and
SearchResult(http://scalatest.org,List(the, scala)) .
o App2 : performs a search on artificially slowed down sources, with potential I/O errors. Each
source is of the form < string>(