Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assignment 2 CSC2062

Worth 25% of the module assessment. Assignment is marked out of 100 marks.

Deadline: 11pm Friday, 17th March 2023.

This version: 2023-01-19.

Changelog: None

Introduction

In this assignment, you will:

(a)   Create a dataset of hand-drawn pictures (which you will use for your analyses and experiments

in the rest of Assignment 2, and in Assignment 3).

(b)   Perform feature engineering,  i.e. calculate features  (variables) from the  hand-drawn  images

which may be useful for identifying the handwritten symbols automatically.

(c)    Perform statistical analysis of the datasets, using methods of statistical inference.

(d)   Implement introductory machine learning models that perform classification on the dataset.

When you use a procedure that has an element of randomness, please use the seed value 2062 (this is so that your code gives the same results each time it runs). This assignment must be completed in R. You may not use Microsoft Excel to complete any part of this assignment.

Please read carefully the information about the assessment criteria, deliverables, and marking process at the end of this document.

Section 1 (10 marks): Creating a dataset

This section asks you to build a dataset of hand-drawn images for each of 4 living thing objects {cherry, banana, lemon, tree} and 4 nonliving thing objects {envelope, golfclub, pencil, wineglass}. You will create 14 example images for each of the 8 classes, giving a total of 112 images. Each image should be obtained by hand-drawing the image yourself (either with a touch screen or simply using the computer mouse). The quality of the drawing is not very important, as long as the objects can easily be distinguished.

The  images for each of the eight objects should follow the same  basic canonical form as  in the examples presented on the next page, though the images will vary slightly from item to item, simply due to the variability in how you draw them, as shown in the examples. Each drawn object should fit reasonably well in a 45 x 45 box (i.e. do not draw a tiny object in one corner of the 45 x 45 box; this will make your life easier when it comes to doing analyses!).

Class

cherry

Example

Images


Class

banana

Example

Images


Class

lemon

Example

Images


Class

tree

Example

Images

Figure 1: Examples of drawings for the four living thing objects. Your pictures should have the same basic form.

Class

envelope

Example

Images


Class

golfclub

Example

Images


Class

pencil

Example

Images


Class

wineglass

Example

Images

Figure 2: Examples of drawings for the four non-living thing objects. Your pictures should have the same basic form.

Each image will be represented by a black & white matrix with 45 rows by 45 columns. In the matrix, the number “1” represents black pixels and “0” represents white pixels. As such, one image can be stored in a plaintext tab-delimited “ .tsv” file containing the matrix (and no headers).  Below is an example of how the tab-delimited data might appear in a text editor.

Figure 3: Matrix representation of a cherry drawing. The data consist of a tab-delimited file (.tsv) consisting of 45 rows and 45 columns with entries of “1” denoting the black pixels and “0” denoting the white pixels.

You may use whatever means you prefer to obtain the 112 .tsv files, provided they are hand-drawn by  you  and  are  saved  in the  tab-delimited  .tsv  format  specified  above.  However,  it  is strongly recommended that you use the software GIMP (http://www.gimp.org). GIMP is available for free for all PC OSs, and is also installed on the lab machines and the EEECS virtual machines. Using GIMP, you can create a new image with 45 by 45 points (px), advanced options 1 pixel/pt, color space grayscale, fill with background colour. This will give you a small white square, which you can magnify to e.g. 1600% in order to make it easier to draw on (see Fig 4).

Figure 4: Creating the blank canvas of size 45 x 45 pixels in the GIMP interface.

To draw on the image, you can select the pencil tool and adjust the brush size to 1 pixel (see Fig.5).

Figure 5. A cherry drawn on the 45x45 grid with pencil size of 1 pixel.

The standard file formats of GIMP are useful to save the images, but we need a more easily readable format. One good option is to export as PGM, type ASCII. This PGM file can be opened in GIMP, but it is also simply a text file that can be opened in a text editor (or read as a text file by R code). The PGM text file has a header consisting of the following four lines:

P2

# CREATOR: ...

45 45

255

The third and fourth lines of the header above specify the pixel array size and the maximum allowed pixel  value,  respectively.  (The  images  are  greyscale,  with  0  representing  fully  black  and  255 representing fully white).1

The remaining lines of the file specify the pixel values, with one value on each line; the total number of pixel values should correspond to the specified array size (i.e. 45*45=2025).

For our purposes, a number < 128 will represent a black pixel, while a number >= 128 represents a white one. Such a format can be easily converted into a matrix containing ones and zeros, as presented in Figure 3 above (you must write some R code to do this; reading in the PGM file and writing out the square .tsv file). (As well as creating the square tsv files, you may also want to keep the PGM files, in case you need to inspect the data later on).

You shall save each image matrix as a tsv file following the specification above, and using the filename STUDENTNR_LABEL_INDEX.tsv, where STUDENTNR is your student number (e.g. 4012345), INDEX is a two numeral code from ‘01’ to ‘14’, indexing the set of 14 images you must create for each object class, and LABEL is the name of the object in the image (i.e. one of {cherry, banana, lemon, tree, envelope, golfclub, pencil, wineglass}).

Be sure your filenames have exactly this format, with precisely these labels. For example, if your student  number  is  4012345, then 4012345_lemon_08.tsv would  be  the  eighth  image  you created for the lemon class.2

As part of your submission, upload the tsv files that you create in a directory called “images” . Any code you wrote to create the tsv files should be presented in the assignment R markdown notebook (see submission instructions at the end of this document).

It is very important to upload the images in the correct tsv format as these files will be used to verify your calculations in the next section. The .tsv files should be tab-delimited, not comma-delimited or anything else. File Encoding should be UTF8 (not UTF8-BOM or anything else). You can check the encoding in Notepad++.

In your report notebook, very briefly (2-3 sentences) explain in your own words how you created the images and obtained the matrices from them.

Section 2 (35 marks): Feature Engineering

Using each 45x45 matrix obtained from an image as described above, you must create an array of characteristics that describe some features of the image. Each feature will be a number (i.e. each feature is a numeric variable). There are 16 features in total.

Features to be calculated (corresponding to columns of your features output file):

Feature

Index

Short Feature

Name

Feature Description

label

The true name of the object in the image (i.e. one of the eight possible labels). The label is not a true feature, and should not be used as a feature for statistical tests or during model training.

index

The index of this image instance (a number from 01 to 14). The index is not a true feature, and should not be used as a feature for statistical tests or during model training.

1

nr_pix

The number of black pixels in the image.

2

rows_with_1

Number of rows with exactly 1 black pixel

3

cols_with_1

Number of columns with exactly 1 black pixel

4

rows_with_2

Number of rows with exactly 2 black pixels

5

cols_with_2

Number of columns with exactly 2 black pixels

6

rows_with_3p

Number of rows with 3 or more black pixels