Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

SI 650 / EECS 549 Information Retrieval

Course Syllabus
Lectures: Wednesdays 1:00PM - 4:00PM
Sections:

Class Page https://umich.instructure.com/courses/624245

Instructor: David Jurgens ([email protected]);

Office Hours: Thursday 11:00am-12:00pm (EDT) in 3385 NQ).
GSIs:
Prithvijit Dasgupta ,
Katsumi Ibaraki
Ji Eun Kim ,
Lu Xian ,

Siyuan Cao

The explosive growth of online textual information (e.g., web pages, email, news articles, social media, and scientific literature) has made it increasingly important to develop tools to help users access, manage, and exploit the huge amount of information. Web search engines, such as Google and Bing, are good examples of such tools, and they are now an essential part of everyone's life. In this course, you will learn the underlying technologies of these and other powerful tools for connecting people with information, for accessing and mining unstructured information, especially text. You will be able to learn the basic principles and algorithms for information retrieval as well as obtain hands-on experience with using existing information retrieval toolkits to set up your own search engines and improving their search accuracy.

Unlike structured data, which is typically managed with a relational database, textual information is unstructured and poses special challenges due to the difficulty in precisely understanding natural language and users' information needs. In this course, we will introduce a variety of techniques for accessing and mining text information. The course emphasizes basic principles and practically useful algorithms. Topics to be covered include, among others, text processing, inverted indexes, retrieval models (e.g., vector space and probabilistic models), ir evaluation, text categorization, text filtering, clustering, topic modeling, deep learning, retrieval system design and implementation, web search engines, and applications of text retrieval and mining.This course is designed for graduate students and senior undergraduate students of the School of Information and Computer Science and Engineering. The course is lecture based. Grading is based on four individual assignments, class participation, a final exam, and a course project.

Learning Objectives

● Understand how search engines work.

● Understand the limits of existing search technology.

● Learn to appreciate the sheer size of the Web.

● Learn to write code for text indexing and retrieval.

● Learn about the state of the art in IR research.

● Learn to analyze textual and semi-structured data sets.

● Learn to appreciate the diversity of texts on the Web.

● Learn to evaluate information retrieval systems.

● Learn about standardized document collections.

● Learn about text similarity measures.

● Learn about semantic dimensionality reduction.

● Learn about the idiosyncrasies of hyperlinked document collections.

● Learn about web crawling.

● Learn to use existing tools of information retrieval.

● Understand the dynamics of the Web by building appropriate mathematical models.

● Understand how text classification and clustering works

● Build working systems that assist users in finding useful information on the Web.

Textbooks and Readings

● Required: [Zhai and Massung] ChengXiang Zhai and Sean Massung, Text Data Management: A Practical Introduction to Information Retrieval and Text Mining. ACM and Morgan & Claypool Publishers, July 2016. http://dl.acm.org/citation.cfm?id=2915031(Links to an external site.) (Free U of M access)

● Required: [Manning et al.] Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. http://nlp.stanford.edu/IR-book/information-retrieval-book.html (Links to an external site.)

● [Fairness] Solon Barocas, Moritz Hardt, Arvind Narayanan. Fairness and Machine Learning: Limitations and Opportunities. https://fairmlbook.org/pdf/fairmlbook.pdf

Lecture and Sections

This course consists of a weekly three-hour lecture. Lecture will cover topic content and concepts, with occasional interactive events between students. We will show demos as well. Lectures are expected to be highly interactive and questions are always encouraged. Occasionally, we will do in-class programming together.

This class only has an in-person lecture for the F22 semester. Class will be held synchronously at the official meeting time. No lecture recordings will be made available.

There are no sections for this course (unlike previous semesters).

Office Hours Policy (and why you should come to them!)

Office hours will be held in a hybrid format this semester. We’ll have office hours in either the instructor’s office or in a reserved GSI space in the first floor of NQ. This is to help accommodate folks who can’t come in person but also to support folks who are working or have a difficult commute to get to in-person office hours.
What are office hours even for?

● Come talk about the homework and I can try to debug

● Brainstorm ideas about your projects

● Discuss research or the field of Information Retrieval

● Talk about career opportunities in Information Retrieval and beyond

● Talk about grad school, applications, strategy

● Give feedback about the class

● Ask literally any questions you have

● Get advice on classes/career/whatever

Office hours are here to support you and don’t have to be strictly related to the course content if you have questions that you think the instructor or GSIs can answer for you.

Homework and Grading

Assignments

The course will feature five programming assignments that will have you build a search engine from the bottom up with increasingly sophisticated features. You will reuse parts of your code between assignments so getting familiar with the material is essential. We will provide static data files between assignments so you can catch up if needed (this will make more sense when you see the assignments).All assignments will be submitted to both the autograder and to Canvas. The autograder will check for implementation correctness. The canvas submission will test for more qualitative skills.

Final Project

The project is a chance for you to get much deeper, hands-on knowledge in a topic of your choosing. You’ll develop the project throughout the semester, which may be done individually or in small teams. An open-ended project falls in one of three categories:

● IR system - develop a workable, useful IR system. Students will be responsible for identifying a vertical domain, implementing a search engine or other types of information retrieval systems for the domain, evaluating it, and deploying it either on the Web or as an open-source tool. A general purpose search engine or Web-scale system is discouraged (due to complexity and training times).

● Research project - Students will take on a research problem, conduct hypothesis formulation, experiments, and a technical report in the format of a conference submission.

● Survey paper (in rare conditions) - identify a cutting-edge topic of research in IR and summarize at least 10 recent papers on it, along with a synergy that compares and contrasts the papers involved.

Exam

The course will have an in-person exam during the regularly scheduled exam time during finals week.

Grading

● Homework assignments: 50%

● Final Exam (take-home): 20%

● Course Project: 30%

○ Proposal: 5%
○ Progress progress check: 10%
○ Final report: 15%

Grading Scale

Grading follows the following scale.
93-100: A90-93: A-
87-90: B+
83-87: B
80-83: B-
77-80: C+
73-77: C
70-73: C-
67-70: D+
63-67: D
60-63: D-

Below 60: F

A grade of A+ is given only to students who perform exceptionally well on the class project.

There is no curve for this class. That said, usually the majority of students have received As. If you are concerned about your grade, please come see the instructors early (not the last week) while there is still time to improve your performance.

No extra credit or make-up assignments are allowed (e.g., no retroactive work after the end of the semester to bump up grade). However, if a numeric grade is just below the letter grade borderline, the instructor reserves the right to increase the letter grade if the student has done a truly exceptional project (but will never decrease the grade!).

Tentative Schedule:

Week 1 (8/30): Introduction to information retrieval
● Vector spaces and similarity;
● Basic Text Processing
● Probabilities and Statistics
Reading:
● Vannevar Bush "As We May Think" Bush 45
● [Zhai and Massung] Chapter 2, “Background”
● Safiya Umoja Noble’s book Algorithms of Oppression: How Search Engines Reinforce Racism Chapters 1-3
● Ian O’Hara “Feedback Loops Algorithmic Authority, Emergent Biases, and Implications for Information Literacy”
● [Zhai and Massung] Chapter 3: Text Data Understanding
● [Fairness] Chapter 1
● Thorsten Brant, "Natural Language Processing in Information Retrieval (Links to an external site.)"Week 2 (9/6): Building Text Retrieval Systems.
● System architecture;
● Boolean models;
● Inverted Indexes;
● Document ranking;
● IR Evaluation;
Reading:
● [Zhai and Massung]: Chapter 5 “Overview of text data access”; Chapter 9 “Search Engine Evaluation”.
● [Manning] Chapter 1 “Boolean retrieval”; Chapter 8 “Evaluation in information retrieval”
● Ulloa, Roberto, Mykola Makhortykh, and Aleksandra Urman. "Algorithm Auditing at a Large-Scale: Insights from Search Engine Audits." arXiv preprint arXiv:2106.05831 (2021). Week 3 (9/13): Retrieval Models: Vector Space Models
● Vector space models;
● TF-IDF weighting;
● Retrieval axioms;
● Implementation issues;
Reading:
● [Zhai and Massung] Chapter 6: “Retrieval Models”, up to 6.3
● [Manning] Chapter 6 "Scoring, term weighting & the vector space model"
● A. Singhal. Modern information retrieval: a brief overview. IEEE Data Engineering Bulletin, Special Issue on Text and Databases, 24(4), Dec. 2001  (http://singhal.info/ieee2001.pdf (Links to an external site.)).
● Hui Fang, Tao Tao and ChengXiang Zhai. "A Formal Study of Information Retrieval Heuristics". In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 49-56, Sheffield, United Kingdom, 2004. (http://www.ece.udel.edu/~hfang/pubs/sigir04-formal.pdf (Links to an external site.)) Week 4 (9/20): Retrieval Models: Probabilistic models
● Okapi/BM25;
● Language models;● KL-divergence;
● Smoothing;
Reading:
● [Zhai and Massung] Chapter 6: “Retrieval Models”, 6.4
● [Manning] Chapter 11 & 12
● Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?