关键词 > Stat4/6250

Stat 4/6250 Individual Project (Spring 2021)

发布时间:2021-05-06

Stat 4/6250 Individual Project (Spring 2021)


General Rules

1. Academic Honesty: This individual project is an open book and take home project. Though discussion is encouraged, each student should finish the project on his/her own effort. Students can read and search related materials from both offline and online sources. Another author’s intellectual contributions (e.g. language, codes, figures, thoughts, ideas, expressions etc.) should be be properly cited in they appear in your project report. Students are expected to abide by the UGA academic honor code and must not copy from the works of others. This includes published or unpublished articles, Wiki documents, etc. Plagiarism will result in failure and, very likely, will have more severe consequences.


2. Goal: The goal of this project is for students to demonstrate they are capable of taking a problem description and its associated dataset and producing a report explaining statistical analyses that solve the stated problem. There is not necessarily a uniquely correct answer. A successful report should produce not only correct analysis methods, but also explanations that would be comprehensible to someone with only a basic knowledge of statistics.


3. Grading: This individual project contributes 30% of the final score of 4/6250 course. The grading is based on the overall quality of the report. The following aspects are considered important for a high quality report: (a) the methodology used should be suitable; (b) the implementation and results should be correct and clear; (c) the explanation should be comprehensible; (d) the mathematics involved should be rigorous; (e) the presentation should be precise and concise; (f) the report format should be correct.


4. Deadline and Submission: The due date of the project is 06:00 PM of May 2nd 2021 (Sunday). No late submission is allowed. The project should be submitted electronically in eLC in terms of a formal research report. The students can write up the report in any word processing software, however the submission should contain only one single PDF file.


5. Format: The report should be prepared on A4/US letter sized paper. The main report should be no more than 8 pages including everything (e.g. text, equations, tables, figures etc.) except the list of references. The list of references should be added at the end of report. The font size should be no smaller than 11 pt and the line space should be at least single spaced. The margins should be at least 1 inch on top and bottom and 1.25 inches on left and right. All tables and figures included in the report should be properly numbered. You may only number the equations that are referenced somewhere else in the report. The sections and subsections should also be properly numbered. Remember that you have limited budget on pages, please use that wisely by choosing what to include in your report.


6. Implementation and Coding: Students can implement any suitable and reason-able method to solve the problem. For each method, students should give a clear description towards its implementation including the details like how to measure the similarity or how to choose the tuning parameters. The standard of clarity is that someone else can replicate the method based on your description. Implementing meth-ods beyond the scope of lectures are neither encouraged nor punished. Students can code in any programming language. Please do not include the codes in the report. If necessary, you can summarize your program in an algorithm format (like the ones showed in lecture slides). Again, remember the page limits.


7. Role of Instructor: If any part of rule or project problem seems ambiguous to you, please contact the instructor for further clarification. Your are encouraged to meet and discuss your final project with instructor during office hours or making an appointment.


Problem: Room Occupancy Prediction

The goal of this problem is to build a model to automatically predict whether a room is occupied or not, based on attributes such as room temperature, humidity and light.

        The data consist of three data-sets, a training set (Training.txt), a validation set (Val-idation.txt) and a test set (Test.txt) two for training and one for testing. For any model you try, you should only use the training set to fit the model. Any hyper-parameters should be selected on the validation set. The test set should only be used to assess the prediction performance but not touched in model training and parameter selection.

        The variables contained in the data-sets are:

(a) Date: in the form: year-month-day hour:minute:second

(b) Temperature: in Celsius

(c) Relative Humidity: in

(d) Light: in Lux

(e) CO2: in ppm

(f) Humidity Ratio: a derived quantity from temperature and relative humidity; in kg(water-vapor)/kg(air)

(g) Occupancy: 0 or 1; 0 for not occupied, 1 for occupied status. Ground-truth occupancy was obtained from time stamped pictures of the rooms that were taken every minute.

        The following article (saved as CandanedoFeldheim2016.pdf in the QEP folder) discusses various analyses of this data-set, using some well-known classification methods.

Candanedo and Feldheim (2016), Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Energy and Buildings. 112(15), 28–39.

        Here are some suggested (not compulsory) steps you can follow in your own analysis:

(1) Do a literature review on this problem (include but not limited the to aforementioned paper).

(2) Explore the data through visualization tools and exploratory statistics.

(3) Try and compare different classifiers.

(4) Fine tuning one or more classifiers to squeeze more prediction performance.

(5) If you are satisfied with some method you tried, explain why this method outperforms its competitors in this problem. If you are not satisfied with all methods you tried, think about why they do not work well in this problem. What can you do to improve the performance?