Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DATA1902 - Informatics: Data and Computation

DATA1902 Revision Questions

Q1 [20 minutes]

You have been given a text file sources.txt containing the following five lines of data:

filename length maxvalue

Anthony.txt 10 5

Anthony2.txt 20 6

Betty.txt 5 10

Annie3_txt 15 15

1(a) [3 minutes]  Write the result of running the following Unix shell commands

1(a)(i)

% wc -w sources.txt

1(a)(ii)

% grep '^[Aa].*.txt' sources.txt

1(a)(iii)

% tail n +2 sources.txt | gawk '$2 >= 12'

1(b) [7 minutes]

The file structure used above for sources.txt causes problems for some processing commands in case filenames are allowed to contain the space character. Describe how you would modify the file structure or the processing commands or both, to deal with this situation.

1(c) [10 minutes]

Write a shell pipeline to process sources.txt that will output the filename with the largest value  for length. For full marks, you should output all filenames that share the largest value; a solution that correctly outputs one among these could get up to 75% of the points.

Q2 [20 minutes]

2(a) [10 minutes]

You have been given a text file products.csv containing lines of comma-separated data about some household products. The first few lines look like this (note that the first line is a header, and also note that the fields do not themselves contain any commas):

prodID,makerName,energyScore,category

37089,Artem,3,dishwasher

47115,Goldrod,2,fridge

51092,4Star,1,dryer

53490,FASTAr,3,fridge

We would like to find the prodID and makerName, for each product in the file whose                energyScore is less than 4. Write a shell pipeline that accesses products.csv and prints out the desired information. You do not need to deal with misformatted files or other errors. You are  allowed to use AWK, but this is not required.

2(b) [5 minutes]

Explain in English the purpose of the following Unix pipeline, and describe in detail how the regular expression operates to achieve this purpose

% cat products.csv | grep '^[^,]*,[A-Z]*,'

2(c) [5 minutes]

State Schneiderman’s mantra” for interacting with a visualisation, and explain the meaning of the terms used in the mantra.

Q3 [10 minutes]

Suppose that you have been given a text file unis.csv containing lines of comma- separated data about some universities, and how their graduates report on outcomes from the education. Here are the data fields (also called data attributes):

Field name

Description

UniName

Abbreviation of the Universitys name

State

Abbreviation of the state where the University is mostly located

Employment(2018)

Percentage of 2018 graduates in full-time      employment, three months after graduation

Employment(2019)

Percentage of 2019 graduates in full-time      employment, three months after graduation

The first few lines look like this (note that the first line is a header, and also note that the fields do not themselves contain any commas):

UniName,State,Employment(2018),Employment(2019)

CQU,QLD,79.1,79.6

Curtin,WA,72.4,71.4

Deakin,VIC,72.8,73.4

Suppose that you are part of a team whose task is to analyse the data in unis.csvto      calculate the following: for each state, find how many universities in that state have a score for Employment(2018) which is greater than 75.

Provide a Unix shell pipeline that will perform this calculation. You do not need to deal with     misformatted files or other errors. You are allowed to use awk in the pipeline, but this is not     required. You should also write an explanation of what is done in each of the commands in the pipeline.

Q4 [10 minutes]

Describe one way in which the Bokeh library allows production of a visualisation with which       users can interact, and explain how this interactivity can be more helpful for a user than a static plot.  In your answer, give an example from your experiences during DATA1902 (lab, project       Stage 4, etc).