Homework 2 (Based on Lab 4 Hive)

Big Data


1. Follow Lab 4 on Hive. Import a new dataset to Hive: Crimes_-_2001_to_present.csv. (3 points)

Please note that this is the original crime dataset (1.5G). It is not preprocessed. Thus, you need to pay attention to data delimiters, naming of columns and setting up the correct datatype. Pay attention to address and date fields.

You may encounter error during import process. This exercise is to practice common problems you may encounter during data importing. You will list the following:

1) Problems/errors you have encountered during the import procedure.


2) Describe how you solve these problems. If you wrote a python program to preprocess data, please also upload your python program. You may use any creative ways to solve the import errors.


3) A screenshot showing you successfully import the data.


2. SQL practice (7 points)

Please answer the following questions using SQL. Please paste your SQL query, as well as the result screenshot (or partial result if the result is very long). For some questions, you may further summarize the result in a table.

1) Please find Top 5 primary criminal type in all crimes of Chicago.


2) Please retrieve # of “THEFTand “ROBBERY” in each year.


3) Which district has the highest “ROBBERY” cases?


4) Which 2 months have the highest # of “THEFT” cases?


5) Find the total number of crimes in January 2008 for each Ward and to print out the Ward name.


6) Find all types of location description, and identify those location descriptions that represent “residence” (notice other descriptions may be used to describe residence). Compare # of crimes that occurred in residence location and non-residence location by crime primary type from year 2005 to year 2010.