COMP4650/6490 Document Analysis – Semester 2 / 2022


This course is an introduction to document analysis from the perspective of interpreting the content of text-based documents, and introduces the key concepts of Natural Language Processing and the Machine Learning that supports it.  It considers the “document” and its various genres as a fundamental object for business, government and community.  It introduces the broad skills required for processing semi- structured documents such as internet pages, RSS feeds and their accompanying news items, and PDF brochures.

To do this, the course covers four broad areas:  (a) information retrieval (IR), (b) machine learning (ML) for Natural Language Processing (NLP), (c) Natural Language Processing (NLP), and (d) NLP in Practice.  Basic tasks are covered including content collection and extraction, formal and informal natural language processing, information extraction, information retrieval, classification and analysis. Fundamental probabilistic techniques for performing these tasks, some common software systems, and electing and applying relevant machine learning algorithms will be covered. The course covers a breadth of common and emerging techniques from statistics and computer science with the aim of understanding

where they might be useful and how to use them properly, while understanding limitations. Detailed course content will be made available during the course at the Wattle site.

Learning Outcomes

Upon successful completion of this course, students will:

1. understand the role documents play in business and community, and the various digital resources available for document analysis;

2. have the background theory and practical knowledge necessary to plan and execute a basic document analysis project;

3. be able to differentiate between the basic probabilistic theories of language and document structure, information retrieval, classification and clustering;

4. be able to identify the basic algorithms and software available for probabilistic theories of language and be proficient at using common libraries for natural language processing to perform basic analysis tasks;

5. be able to index a document collection for use in an information retrieval system. Demonstrate ad- vanced knowledge of basic theories and algorithms to determine large scale named-entity matching and standardization of names within a collection; and

6. be able to extract people, places and entities from within a document and perform automatic summarization of documents.

Course Syllabus

0. Course Administration

1. Introduction to Document Analysis

2. Information Retrieval (IR)

Introduction to Information Retrieval

Ranked Retrieval

● Evaluating IR systems

 Web Search

3. Machine Learning (ML) for Natural Language Processing (NLP)

● Supervised Learning: linear and non-linear classification (including Neural Networks), embeddings

● Unsupervised Learning: including self-supervised learning and BERT

4. Natural Language Processing (NLP)

● Semantics

● Parsing

● Language Models

5. NLP in Practice (NLP_IP)

Information Extraction


Quick Reference

Mode of Delivery

Online delivery via live Zoom lectures

Assumed Knowledge

Programming ability in C, C++, Java, Python or R, and basic mathematical and statistical knowledge, at an undergraduate-level

Course Convener

Dr. Dawei Chen, dawei .chen@anu .edu .au

Course Assistance

Dr. Dawei Chen, Dr. Jo Ciucă, Dr. Alex Mathews



Reference Books

Introduction to Information Retrieval.

C.D. Manning, P. Raghavan and H. Scutze. Cambridge University Press. 2008. https://nlp.stanford.edu/IR-book/

Speech and Language Processing (3rd ed. draft).

Dan Jurafsky and James H. Martin. 2022.



Lectures: online live via Zoom (recordings made available)

Labs: online live and in-person on campus

Assessment Scheme

Assessment components, weighting and learning outcomes

Assessment Task

Value %

Module Topics

Learning Outcomes

Noting that

A late penalty of 5% will be applied to assignments submitted up to 24hrs after the due date. Any assignments submitted more than 24hrs late will be graded with 0.

No late submission of the online quizzes will be permitted, without a pre-arranged extension.

● For assessment task due dates and time please see the Course Schedule on the Wattle site.

● For assignment details please see the Wattle site where updates will be posted.

● Online quizzes will be offered weekly. Marks for all quizzes will be totalled and scaled to contribute 5% to the overall course mark. All quizzes (except the ungraded practice quiz in week 1) contribute the same amount to the overall course mark.

You must complete the quiz before the due date, once started you will have 30 minutes to complete the quiz.  Quizzes are primarily intended for self-learning. Two attempts are permitted for each quiz and the questions in the two attempts may not be the same. Automated feedback on

correct answers is given after each attempt.

● The Final exam will be a 3-hour online exam. Detailed information will be provided via the Wattle course site.

Dierences in Assessment between COMP4650 and COMP6490

● Additional assessment may be provided for COMP6490 in the form of additional assignment questions.

● Not all assignments will contain additional questions.

● These additional questions will be compulsory for COMP6490.

● Additional questions will be clearly marked as COMP6490 Only.

● If these additional questions are completed by students enrolled in COMP4650 then the additional question will not be marked as they do not contribute towards the grade.

Final course mark

 Students must submit for all assessment items.

The Final Exam is a hurdle assessment. This means that students must achieve a minimum of 50% in the final exam in order to pass the course.

● At least 50% overall is required to pass the course.

● Supplementary assessment will be offered to any student who has

 passed the hurdle assessment items AND has achieved a final overall mark of at least 45% and less than 50%; OR

  achieved between 45% and 49% for the hurdle assessment, and if that assessment item were passed, would otherwise pass the course.

Note: raw marks for assessment components, as well as final overall marks, may be scaled by the convener or as a result of school or college academic review.

Policy on late assessment and re-marking

A late penalty of 5% will be applied to assignments submitted up to 24hrs after the due date. Any assignments submitted more than 24hrs late will be graded with 0.

● Extensions to the due date for submission will only be granted if requests are made to the convener at dawei .chen@anu .edu .au well in advance, stating the reasons for requiring the extension, evidence to support the request (usually a medical certificate), and the extension period requested.

● Students may consider applying for special consideration (https://www.anu.edu.au/students/ program-administration/assessments-exams/special-assessment-consideration). An application form must be completed and lodged online within three business days of the original due date of the assessment task.

● Any appeals or request for re-consideration regarding an assessment piece must be submitted in writing to the teaching team via the Piazza forum using a private note directed to all instructors within two weeks of the assessment results being released. Please tag your request with the appro- priate tag.  For example, for assignment 1 regrade requests use the tag assignment1-regrade, for clarification use assignment1-marking_clarification. Requests received via email or other means will not be considered.

● Requests for re-grading of assignment questions may result in marks going up, staying the same or going down depending on the outcome of the re-grading. Assessment marks will be fixed two weeks after they have been released.

Student Assistance

● For general questions relating to assessments, labs etc, please post questions to the Piazza forum for the benefit of all other students.  For personal/special consideration etc queries, please direct them to the course convener.

There are further details below on use of the Piazza forum.

Academic Misconduct

Students are expected to have read the ANU Academic Misconduct Rule before commencement of the course.  No group work is permitted in any part of the assessment in this course.  Plagiarism will not be tolerated, and University procedures will be applied ruthlessly.  Therefore, your contributions are expected to be yours alone, except for work that is clearly attributed appropriately. You may nd this a helpful guide to understanding what constitutes plagiarism and how seriously various violations will be treated:


Every student is expected to be able to explain and defend a submitted assessment item.  The course convener may conduct or initiate an additional interview about any submitted assessment item. If there is a significant discrepancy between the two forms of assessment, it will be automatically treated as a case of suspected academic misconduct.

Turnitin or Moss may be used in this course to check for plagiarism.

Support for Students

The University offers a number of support services for students. Information on these is available online from https://www.anu.edu.au/students.

Course Organisation

Please familiarise yourself with the Wattle course site. You should be able to access this about 24 hours after your enrolment in the course.

You will see that there is a section for each online week of the course. Sections may not be visible until the respective time period commences, to help you pace your way through the course. You are expected to work through the course notes by self-study or in self-organised study groups if you prefer.

Each of the course topic contain a quiz to test your understanding.   All the quizzes are considered mandatory. The content covered is examinable and you will have trouble if you do not keep up with them. Each quiz is automatically marked and contributes to the quiz component of the assessment scheme. The quiz should be attempted as your final learning task for each section.

Each week will include weekly lectures, given by the lecturers, and supervised laboratory sessions.

Communication and getting help

The course Piazza forum is the primary mechanism to raise questions or observations on the course material and this will be monitored very frequently by the course convener, lecturers and tutors.

Please pay attention to course announcements on the Piazza forum as these are sometimes critical for course completion and assessment.

Feedback will be provided for submitted assignments, generally within two weeks of due date.

Unless you are specifically directed, please do not contact lecturers or course tutors outside scheduled classes as they have been engaged to assist on specific tasks in the course. You may contact the course convener for private or personal matters using the contact information given at the top of this document, but, to repeat, the Piazza forum is to be used as the primary method for engagement with course staff.

Generally, if you nd that you do not understand something, or that it might be erroneous, or that something is particularly interesting, many of your co-students will also nd it confusing, wrong or interesting, and we can all benefit from your post.

ANU is committed to the demonstration of educational excellence and regularly seeks feedback from students. One of the key ways offered to students to provide feedback is through Student Experience of Learning Support (SELS) surveys. The feedback given in these surveys is anonymous and provides the Colleges, University Education Committee and Academic Board with opportunities to recognise excellent teaching, and importantly, opportunities for improvement.  For more information on student surveys at ANU and reports on the feedback provided on ANU courses, see                                                           https://unistats.anu.edu.au/surveys/selt/students/and                                                   https://unistats.anu.edu.au/surveys/selt/results/learning/.

Once or twice during the course students may also be asked to complete a survey on specific matters that can inform the remainder of the course or future course design.

Required resources

A laptop or desktop with a reliable internet connection is required for accessing the course material on

Wattle and for completing the practicals and assignments.  Python and Jupyter Notebook will be used extensively in this course so being able to install freely available software will be necessary. An alternative is to have access to a laptop or desktop where appropriate software is already installed.

Further details on software used, and instructions, can be found on the Wattle site for the course.

And finally, the convener’s expectations of you as a learner

Kindly refer to the Learning Expectations for students of the Research School of Computer Science, provided on the course Wattle site. Despite our best efforts, there are sure to be some errors or confusing messages slipping through in the course materials and in the course administration, and we encourage you to assist in their resolution or improvement.  Please, if you ask for clarification or correction of administrative matters or the course material, ask it publicly via the Piazza forum or during the lab sessions so that we can share the answer, for the benefit of all of us.