Programming Assignment #6 CS671
1 st submission due 7/27/2017
make-up submission due 8/2/2017
This assignment focuses on some more advanced features of Scala, including
• stackable traits and abstract override methods;
• iterators and the difficulties involved in working with them;
• streams and the danger of memory leaks;
• implicit conversions.
Framework description
Document sources
The package you are to develop implements a notion of a document source. A document source is
similar to an iterator but produces its contents either as characters or as words, whichever is most
convenient to the underlying application. As with iterators, words and characters are produced “on the
fly” and are not stored. Correct implementations of document sources should be able to handle large
documents without loading them entirely in memory.
A document source offers two methods:
• chars : produces the contents of the document as an iterator of characters;
• words : produces the contents of the document as an iterator of words.
Application code is expected to use one or the other of these methods, but not both. Since they draw
their contents from the same source, using one after the other has already produced elements would be
unreliable. A typical use of a document source has the form:
for (char <- docSource.chars) {...}
or
for (word <- docSource.words) {...}
In addition to these two methods, a document source can be enriched by adding the so-called
“monadic” methods: foreach , filter , map and flatMap . 1 These work in terms of characters and can
be used to process or transform the document. Like standard iterators, document sources usually
become invalid once one of their methods has been called. Unless stated otherwise, only one of the six
methods chars , words , foreach , filter , map or flatMap can be called on a document source, and
it can only be called once.
The monadic methods are parts of a “rich” document source type. Regular document sources can be
converted into rich sources via implicit conversions. This allows an application to write:
for (c <- docSource) ...
instead of the equivalent:
for (c <- docSource.chars) ...
1 For the sake of simplicity, they have slightly-reduced signatures compared to their counterparts in trait
FilterMonadic.
Document transformations
Document sources can be transformed at the level of characters or at the level of words. These
transformations are handled through stackable traits. Traits that extend the Transformer trait can be
mixed in document sources to introduce processing on both words and characters (filtering,
transformation, etc.).
One difficulty that arises is that documents can be transformed in terms of words and/or characters and
then consumed as words or as characters. For instance, a document can have its characters converted to
uppercase and then fed as words to a document reader that will arrange words in lines. Or a document
can have its accented letters converted, some of its words censored, and then be sent through a socket
as packets of characters. Without knowing how exactly a document is to be transformed, an
implementation runs the risk of introducing cycles: derive the characters from the words (to send
character packets), which are derived from the characters (to convert accented letters), which are
derived from the words (to remove censored words), etc.
This assignment relies on a base trait CharSource , in which a document’s characters are defined in
terms of its words, which are defined in terms of its source (of characters). The characters of the
document always derive from the words and the words always derive from the source. Character-level
transformations are applied to the source—and are therefore reflected in both the words and the
characters of the document—while word-level transformations are applied directly to the words and are
reflected in the characters (but not in the source).
Example
Consider a document that starts as follows:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce vel.
and a class IteratorDocumentSource that produces a document source from it. Method words can
be used to obtain the words from the text:
val doc = new IteratorDocumentSource(lorem)
doc.words // "Lorem ", "ipsum ", "dolor ", "sit ", "amet, ", "consectetur ",
"adipiscing ", "elit. ", "Fusce ", "vel."
Trait PunctuationRemover operates at the character level and removes punctuation. If it is mixed in
class IteratorDocumentSource , method words produces the words without punctuation:
val doc = new IteratorDocumentSource(lorem) with PunctuationRemover
doc.words // "Lorem ", "ipsum ", "dolor ", "sit ", "amet ", "consectetur ",
"adipiscing ", "elit ", "Fusce ", "vel"
Note how the effects of a character-level filter is visible as the level of words.
Trait Capitalization works at the level of words and capitalizes them. If mixed in the class, method
words produces instead:
val doc = new IteratorDocumentSource(lorem) with Capitalization
doc.words // "Lorem ", "Ipsum ", "Dolor ", "Sit ", "Amet, ", "Consectetur ",
"Adipiscing ", "Elit. ", "Fusce ", "Vel."
Both traits can be mixed in together: 2
val doc = new IteratorDocumentSource(lorem) with PunctuationRemover with
Capitalization
2 In this particular case, mixing in the traits in reverse order would produce the same result, but in general, the order in which traits
are mixed in is significant.
doc.words // "Lorem ", "Ipsum ", "Dolor ", "Sit ", "Amet ", "Consectetur ",
"Adipiscing ", "Elit ", "Fusce ", "Vel"
Finally, the resulting document can be processed at the character level using one of the monadic
functions:
val doc = new IteratorDocumentSource(lorem) with PunctuationRemover with
Capitalization map {
case 'u' => 'v'
case c => c
}
doc.words // "Lorem ", "Ipsvm ", "Dolor ", "Sit ", "Amet ", "Consectetvr ",
"Adipiscing ", "Elit ", "Fvsce ", "Vel"
Note how the map method produces a document in which the words reflect all the transformations
applied to the characters and words of the original document, as well as the character-level
transformation due to the application of map.
Trait CharSource
This trait represents the base type that can be modified my mixing in transformers. Its source of
characters is abstract. From it are derived words and from words are derived characters.
There are three types of words in a document:
• regular words: These words start with a non-whitespace character, continues with more non-
whitespace characters and end with zero or more whitespace characters (but no newlines). For
example, "ipsum " and "vel." are regular words.
• newlines: A sequence of one or more consecutive newlines is considered to form a word.
• whitespace: A sequence of one or more whitespace characters can also form a word, but only at
the beginning of a line.
As an illustration, the following document:
Lorem␣ipsum␣dolor␣sit␣amet,␣consectetur␣adipiscing␣elit.␣Vivamus␣porta␣
scelerisque␣dolor.␣Vestibulum␣facilisis,␣lorem␣vel␣porttitor␣porttitor,␣
tortor␣nisi␣egestas␣lorem,␣vel␣tempus␣libero␣nisi␣ac␣neque.
␣␣Quisque␣id␣nibh␣vulputate,␣suscipit␣elit␣ullamcorper,␣ullamcorper␣
lacus.␣␣␣
contains the words:
"Lorem␣", "ipsum␣", "dolor␣", "sit␣", "amet,␣", "consectetur␣", "adipiscing
␣", "elit.␣", "Vivamus␣", "porta␣", "scelerisque␣", "dolor.", "\n",
"Vestibulum␣", "facilisis,␣", "lorem␣", "vel␣", "porttitor␣", "porttitor,␣
", "tortor␣", "nisi␣", "egestas␣", "lorem,␣", "vel␣", "tempus", "\n",
"libero␣", "nisi␣", "ac␣", "neque.", "\n\n", "␣␣", "Quisque␣", "id␣", "nibh
␣", "vulputate,␣", "suscipit␣", "elit␣", "ullamcorper,␣", "ullamcorper␣",
"lacus.␣␣␣".
Method words produces the document’s words from the source; method chars produces the
document’s characters from the words. Note that nothing is removed or modified at this stage: The
concatenation of all the characters from any iterator— source , words or chars —produces the same
string.
Note also that both methods should work lazily and without memory leaks. The contents of the
document should never be loaded entirely in memory.
15 pts.
Transformers
Transformers are implemented as extensions to the trait Transformer :
trait Transformer { self: CharSource =>
def source: Iterator[Char]
def words: Iterator[String]
}
Transformers have a self-type of CharSource , which restricts the type of classes they can be mixed in.
Transformers can override the behavior of source to implement character-level transformations, or the
behavior of words to implement word-level transformations, or both. They have to make sure they rely
on the behavior of the overridden method ( super.source or super.words ) to achieve stackability.
Some transformers are trivial to implement (e.g., Capitalization ) while others can be more difficult
(e.g., LineBreaking ). The transformers to implement are described below.
Trait Capitalization
This is a word-level transformer that capitalizes all the regular words (capitalization has no effect on
whitespace). For instance, " ipsum ." becomes " Ipsum ." and " Lorem " remains " Lorem ".
Trait PunctuationRemover
This is a character-level transformer that removes punctuation marks from the text. Punctuation marks
are any one of these characters:
!"#$%&’()*+,-./:;<=>?@[\]^_‘{|}~
Trait JustWords
This transformation removes the non-regular words (whitespace and newlines) and trims the regular
words. For instance, JustWords applied to the document above produces the words:
"Lorem", "ipsum", "dolor", "sit", "amet,", "consectetur", "adipiscing",
"elit.", "Vivamus", "porta", "scelerisque", "dolor.", "Vestibulum",
"facilisis,", "lorem", "vel", "porttitor", "porttitor,", "tortor",
"nisi", "egestas", "lorem,", "vel", "tempus", "libero", "nisi", "ac",
"neque.", "Quisque", "id", "nibh", "vulputate,", "suscipit", "elit",
"ullamcorper,", "ullamcorper", "lacus.".
Trait Censoring
This trait transforms the words of a document by censoring some of them. It relies on a predicate
isCensored , which is left abstract. Censoring works by replacing the inside of a word by a series of X
marks if the string passes the isCensored test. Note that words, in the context of this library, can
contain non-letter characters like punctuation and whitespace. The inside of a word is defined to be the
longest substring of the word that begins and ends with an alphabetic character. Alphabetic characters
are those that satisfy Java’s Character.isAlphabetic method. 3
For example, the inside of " Lorem " is " Lorem " and the inside of " ipso-facto! " is " ipso-facto ".
If any word that contains at least one "o" is censored, the example text would become:
XXXXX ipsum XXXXX sit amet, XXXXXXXXXXX adipiscing elit. Vivamus XXXXX
scelerisque XXXXX. Vestibulum facilisis, XXXXX vel XXXXXXXXX XXXXXXXXX, XXXXXX
nisi egestas XXXXX, vel tempus XXXXXX nisi ac neque.
Quisque id nibh vulputate, suscipit elit XXXXXXXXXXX, XXXXXXXXXXX lacus.
3 See also \p{Alpha} if you prefer to use regular expressions.
8 pts.
8 pts.
8 pts.
8 pts.
Trait LineBreaking
This trait is used to reformat a document according to a desired line width. Words are trimmed to
remove leading and trailing whitespace. Single newlines are ignored, but words that consist of multiple
newlines are kept unchanged, as they represent paragraph breaks. Words are then produced by the
iterator according to the following rules:
• A word is separated from the next word on a line by a single space character. The space is part
of the first word (i.e., words never start with a space).
• Lines have no leading or trailing spaces.
• Lines are as long as they can be without exceeding the maximum line width. 4 In order to achieve
this property, additional, single-newline words are emitted by the iterator. Lines are not
justified.
• No other newlines are added. In particular, no newlines are added around a multiple-newline
word, since it is already starting a new paragraph (hence a new line).
As an illustration, the text above, with a maximum line width of 30, becomes:
Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Vivamus porta scelerisque
dolor. Vestibulum facilisis,
lorem vel porttitor porttitor,
tortor nisi egestas lorem, vel
tempus libero nisi ac neque.
Quisque id nibh vulputate,
suscipit elit ullamcorper,
ullamcorper lacus.
Note that all the lines are trimmed and, in particular, the trailing spaces after " lacus. " have been
removed.
Class IteratorDocumentSource
This class provides us with a default implementation of the type RichDocumentSource . It can be used
to mix in traits in various combinations and observe the results. Note that this class implements the full
(“rich”) document source interface, including the monadic operators.
Implicit function source2RichSource
This implicit function makes a RichDocumentSource out of a regular DocumentSource , thus adding
to it the monadic functions. It does this by creating an instance of IteratorDocumentSource from
the characters of the given document source.
One more page to go… 
4 The only exception to this rule is the case of a word that is longer than the maximum line width: It is emitted
alone as a line (that is too long).
6 pts.
16 pts.
4 pts.
Notes
• There will be several integration tests (whole-system) worth 27 points. The entire assignment is
worth 100 points.
• More information of the functions to implement can be found at
http://cs.unh.edu/~cs671/Programming/
• The “library” developed in this assignment doesn’t necessarily make much sense as a way to handle
text documents. It is defined as it is for illustration purposes.
• As an example, the Capitalization transformer can be implemented as follows:
trait Capitalization extends Transformer { self: CharSource =>
abstract override def words = super.words map (_.capitalize)
}
• The most difficult transformer to implement is LineBreaking . All the other transformers are
pretty straightforward.
• All transformers have to be implemented lazily, i.e., from iterator to iterator, without expanding
entire documents. Given the limitations of scala.collection.Iterator , the easiest way is to
use streams. Be careful, however, with memory leaks. They are avoided by making sure that the
iterators you implement do not keep a reference to the head of their underlying stream. 5
• It might be a good idea to turn off implicit conversions between document sources while debugging.
The easiest way is to remove function source2RichSource . But remember to put it back before
submitting or the grading code will not compile.
5 Something, apparently, that method Stream.iterator fails to do.
27 pts.