Assessed Coursework MA407
Algorithms and Computation
due: Monday, 23 January, 2017 at 14:00
Instructions
Tasks 1–4 on the following pages are assessed coursework, constituting 25% of the final mark for this course.
It is due on Monday, 23 January, 2017 at 14:00 (week 3 of Lent Term 2017). The tasks ask you, step by step, to complete a Java program consisting of five files in plain text format. One of these files,
Input.java, is given to you entirely. Another file, PhyloTree.java, is given to you with a complete main method, but it is your task to complete this class. It is also your task to write the remaining three files, PhyloData.java, TreeNode.java, and TreeNodeList.java. Make sure these files compile on the LSE computers. Submit the four files PhyloTree.java, PhyloData.java, TreeNode.java, and TreeNodeList.java on the Moodle page for MA407, under the submission link in the Assessed Course-work section. Make sure that each of your files contains your candidate number (see below).
Please only submit a single final version of your four files. Otherwise, you risk having the wrong submis-sion marked. Do not submit any other files.
The deadline is sharp. Late work carries an automatic penalty of 5 deducted marks (out of 100) for every 24 hours that the coursework is late.
Submission of answers to this coursework is mandatory. Without any submission your mark for this course is incomplete, and you may not be eligible for the award of a degree.
The work should be yours only. Plagiarism is considered an assessment offence at the LSE and is, as an instance of academic misconduct, taken very seriously. In a case of suspected plagiarism, the Department will act according to the School’s Regulations on Assessment Offences - Plagiarism. So only submit work that is solely your own. This also means that for developing your solutions you are not allowed to collaborate with other students.
The contents of your work must remain anonymous, so do not write your name in any of the program files.
Instead, you must identify your work with your candidate number. Please insert your candidate number as a comment in the first line in each of your files. You can check your candidate number via ‘LSE for You’.
Coursework MA407 – Phylogenetic Trees This coursework asks you to write a Java program for constructing and analysing (a very specific type of) phylogenetic trees. In biology such trees are used to represent the evolution of collections of living and non-living species. The basic idea is that a species alive at some point may evolve or split into two different species through (possibly several) genetic mutations. In biology different types of phylogenetic trees are used and various auxiliary data may be depicted in such a tree. Here we use the following simplified model. We assume that whenever a mutation (recorded in the tree) occurs then one species splits into two new species (that is, the old species ceases to exist). Hence, our phylogenetic trees have the following form. They are binary trees such that all leaves represent currently living species and each inner node represents a species not living anymore and has exactly two children and one parent (unless the node is the root). The following picture shows one of these phylogenetic trees twice: on the left as it is often depicted in biology, and on the right as we usually depict binary trees.
Luca
Eurybacteria
Endobacteria
Actinobacteria
Eukarya Archaea Chlorobacteria
Luca
Eurybacteria
Endobacteria Actinobacteria
Eukarya Archaea
Chlorobacteria
Moreover, for each inner node in such a tree we also record the time when this species split into its two children. For simplicity this time will just be an integer (without explicit unit) −t, indicating that that the species split t years ago. (So the current time is t = 0.)
Thegoalofthisprojectistoimplementdatastructuresforconstructingandrepresentingphylogenetictrees, and to design algorithms for determining certain properties of such trees. You should start off from the two files PhyloTree.java and Input.java, which are provided on the moodle page for MA407. PhyloTree is the main class of this program. Its main method makes use of the static method read(String file), which is the only method of the class Input. This method reads in the input, consisting of the information about the species in a phylogenetic tree and their relations from a file whose name is given as the first argument to the program. (We did not talk about how this is done in Java, so you do not need to understand the code in Input.java.) The format of the input is explained at the end of the file Input.java. On the moodle page for MA407 you also find several example input files: Data1.txt, Data2.txt, Data3.txt, Data4.txt, DataEmpty.txt. In your program you may assume that there are no errors or inconsistencies in the input files (that is, that the given input file describes a valid phylogenetic tree). The method read then uses the class PhyloData to store the given information in an object of type PhyloTree.
ForthistoworkcorrectlyyouneedtocompletetheclassPhyloTree, andimplementtheclassesPhyloData, TreeNode and TreeNodeList. The following tasks guide you through this process.
Firstly, you should implement a class TreeNode for representing the nodes in a phylogenetic tree (which is a binary tree). Each such node represents a species and has three attributes: an identifier (id) which uniquely identifies the species, a name, and a split time t (as int value) indicating how long ago the species split into two new species. If the species did not split (until present times) then this time should be 0. We remark that in the following, when printing information about this time we will usually write it as a negative number −t (indicating that some species split t time units ago).
Task 1. 20p
Write a class TreeNode in a file TreeNode.java implementing the following API.
(a) Constructor: TreeNode(int id, String name)
Creates a TreeNode with the given id and name, and no children and no parent.
(b) Instance method: void splitInto(TreeNode ch1, TreeNode ch2, int timeAgo)
This method should do nothing if this node has children already. Otherwise, this method inserts ch1 and ch2 as children of this node and sets the split time of this node to timeAgo.
(c) Instance method: boolean isSplit()
Returns true if the species represented by this node has split.
(d) Instance method: int getID()
Returns the id of this species.
(e) Instance method: TreeNode getLeft()
Returns the left child of this node.
(f) Instance method: TreeNode getRight()
Returns the right child of this node.
(g) Instance method: TreeNode getParent()
Returns the parent of this node.
(h) Instance method: int getSplitTime()
Returns the split time of this node.
(i) Instance method: boolean alive(int timeAgo)
Returns true if the species represented by this node was alive at -timeAgo. A species is alive from the time that its parent split until, but not including, the time when the species itself split.
(j) Instance method: boolean equals(TreeNode node)
Returns true if the species represented by this node equals the species represented by the given node. Two species are equal if and only if they have the same id.
(k) Instance method: String toString()
Returns a string representation of the species represented by this node providing information about id, name and split time of this species.
Moreover, in several operations that we later want to perform on phylogenetic trees it will turn out useful to have a linked list data structure which can store nodes of such a tree. It is your next task to implement such a data structure. You can, but do not have to, base your implementation on one of the linked data structures we discussed in class.
Task 2. 20p
Write a class TreeNodeList in a file TreeNodeList.java implementing a list data structure sup-porting the following operations. Items in this data structure should store a value of type TreeNode.
(a) Instance method: void addAsFirst(TreeNode node)
Adds a new item at the start of the list with value node.
(b) Instance method: boolean empty()
Determines if the list is empty.
(c) Instance method: TreeNode extractFirst()
Returns null if the list is empty. Otherwise, returns the value stored in the first item of the list and deletes this item from the list.
(d) Instance method: TreeNode extract(int id)
If a TreeNode with the given id is stored in the list, it is returned and the item storing it is deleted from the list. Otherwise, returns null.
(e) Instance method: String toString()
Returns a string listing all the nodes stored in the list in order, separated by comma.
Also describe in comments in your class which type of linked list you used and why, and what is the worst case running time of your methods addAsFirst, extractFirst, and extract.
Your next task is to implement a class PhyloData which can be used to construct a phylogenetic tree consisting of nodes of type TreeNode. This class contains the following three methods. The method addNewSpecies for creating a new node representing a species, which should later be part of the phy-logenetic tree. A method addNewSplit which is used to add information about splits of species. And a method toPhyloTree which is used to create an object of type PhyloTree representing the phylogenetic tree.
For simplicity, you may assume that this class is used as follows. First, all species (living as well as non-living) that appear in the tree are specified by repeatedly using addNewSpecies. Then, all the infor-mation about when one of these species split into two other of these species is given in repeated calls to addNewSplit. You may assume that these splits are given in chronological order, that is, first the split that occurred the longest time ago, and so on. Finally, the method toPhyloTree is used to return an object of type PhyloTree representing the phylogenetic tree given by the data.
Task 3. 20p
Write a class PhyloData in a file PhyloData.java implementing the following API.
(a) Instance method: void addNewSpecies(int id, String name)
Adds a new species with the given id and name to the phylogenetic tree.
(b) Instance method: void addNewSplit(int time, int id, int id1, int id2)
Adds a new split to the phylogenetic tree, representing the split of the species with the given id into the two species given by id1 and id2 that occurred -time ago.
(c) Instance method: PhyloTree toPhyloTree()
Returns a PhyloTree representing the inserted species and splits.
Observe that PhyloTree has one constructor which takes a node of type TreeNode as parameter and assumes that this node forms the root of an already entirely constructed binary tree representing a phylogenetic tree. Hence you need to decide how to construct this binary tree as part of the three methods described above. Also describe in a comment in this class what your strategy for constructing this binary tree is (and what data structures you use in this process).
Also explain for each of the three methods addNewSpecies, addNewSplit, and toPhyloTree in a comment what their worst case running time is and justify this.
Hint: It might be useful to use a TreeNodeList in the implementation of this class.
Finally, you should implement the class PhyloTree. An instance of this class stores the root (of type TreeNode) of a phylogenetic tree. The class PhyloTree then provides instance methods for inspecting the following properties of this phylogenetic tree. Firstly, we want to determine the most recent common
ancestor of two species, that is, the species among the common ancestors which comes chronologically most recently. Secondly, we want to to determine all species that were alive a certain time ago. Thirdly, we want to determine the closest living species to a given species s, that is, a species that is still alive and different from s but has shortest distance in the tree to s. Here, the distance between two species s and s 0 in the tree is the number of species on the path from s to s 0 . (For example, in the tree depicted on page 2 the path from Actinobacteria to Chlorobacteria contains the nodes Actinobacteria, Eurybacteria, Luca, Chlorobacteria.) In addition the class PhyloTree contains some additional methods to print information about the phylogenetic tree and to find a given species in the tree.
Task 4. 40p
Complete the class PhyloTree in the file PhyloTree.java implementing the following API (which does not mention the main class which is given to you).
(a) Constructor: PhyloTree(TreeNode root)
Creates a new PhyloTree with the given root. This method assumes that all species contained in the phylogenetic tree represented by this PhyloTree are already inserted as children, grand-children, and so on of root.
(b) Instance method: TreeNode find(int id)
Returns the TreeNode representing the species with the given id if it exists in this phylogenetic tree; otherwise returns null.
(c) Instance method: void print()
Prints this phylogenetic tree so that all species contained in the tree get printed, and one can easily read off the information which species split into which other species at what time.
(d) Instance method: TreeNode MostRecentCommonAncestor(TreeNode n1, TreeNode n2)
Returns the node representing the most recent common ancestor of n1 and n2.
(e) Instance method: TreeNodeList alive(int timeAgo)
Returns a list TreeNodeList storing all nodes of this phylogenetic tree which represent species that were alive at time -timeAgo.
(f) Instance method: TreeNode closestLiving(TreeNode node)
Returns the node x in this phylogenetic tree which represents the closest living species to node.
You need to decide how to print the phylogenetic tree in print. I give an example at the end of this document. But you may decide on an alternative way of printing your tree; please describe your strategy in a comment.
You also need to decide how to solve the algorithmic problems presented by the three methods mostRecentCommonAncestor, closestLiving, and alive. Again, describe in comments what your approaches are and justify why they work. Also determine and justify the worst-case running times of your algorithms. The method closestLiving can be implemented to run in time O(n), where n is the number of species in the tree. This is difficult, but if you can, use such a linear time algorithm.
As described at the beginning the main method of PhyloTree uses the method Input.read() to read the information about the phylogenetic tree from a file (given as first input parameter to the program) and stores it in an object of type PhyloTree (by using the methods addNewSpecies, addNewSplit, and toPhyloTree of PhyloData).
The main method then reads the remaining input parameters args[1],args[2],... (in this order) and when it reads args[i] it performs the following operations (and prints information about their result):
• If args[i] is m , then main interprets args[i+1] and args[i+2] as the ids of species in the phy-logenetic tree and determines their most recent common ancestor (using the method mostRecentCommonAncestor of PhyloTree),
• If args[i] is a, then main interprets args[i+1] as a positive integer t and determines the species that were alive at time −t (using the method alive of PhyloTree),
• If args[i] is c then then main interprets args[i+1] as id of a species in the tree and determines the closest living species (using the method closestLiving of PhyloTree).
Once you solved the tasks above the output of your program should look similar to the following:
$ java PhyloTree Data3.txt a 6000 c 2 m 3 1
3000 Ancestral amniote (split -10000)
1000 Reptiles (split -8000)
2000 Diapsids (split -7000)
300 Lepidosaurs (split -3500)
8 Squamates (alive)
7 Tuataras (alive)
400 Archosaurs (split -5000)
2 Crocodilians (alive)
500 Dinosaur and Pterosaur ancestor (split -4000)
200 Dinosaurs (split -2000)
100 Saurischians (split -1500)
5 Saurischian dinosaurs (alive)
6 Birds (alive)
4 Ornithischian dinosaurs (alive)
3 Pterosaurs (alive)
1 Parareptiles (alive)
9 Mammals (alive)
species alive at time -6000: 300 Lepidosaurs (split -3500), 400
Archosaurs (split -5000), 1 Parareptiles (alive), 9 Mammals (alive)
the living species closest to 2 is 3
1000 is the most recent common ancestor of 3 and 1
Remember to explain all your code well with comments and to indent properly. Do not use any Java libraries (other than the classes String or System) in your files (even though Input.java uses some).
END of assessed coursework, 2016/17 Good luck!