Faculty of Engineering, Environment and Computing

5011CEM Big Data Programming Project

Assignment Brief

Module Title

Big Data Programming Project

Individua

Cohort

2021JanMay

Module Code

5011CEM

Coursework Title (e.g. CWK1)

CW1

Hand out date:

04/03/21

Lecturers

Richard Hyde (ML), [email protected]

Nurudeen Aigoro [email protected]

???

Due date and time:

Online: 16/04/21 18:00:00

Estimated Time (hrs):

Word Limit*: 2,000

Coursework type:

Project report

Weighting:

10 Credits

Submission arrangement: Online via Aula TurnItIn

File types and method of recording: TurnItIn suitable document, e.g. Word, PDF etc.

Mark and Feedback date (10/05/21):

Mark and Feedback method: Completed rubric and additional feedback via Aula / TurnItIn

Module Learning Outcomes Assessed:

B1: COMPUTATION THINKING: develop and understand algorithms to solve problems; measure and optimise algorithm complexity; appreciate the limits of what may be done algorithmically in reasonable time or at all.

B2: PROGRAMMING: create working solutions to a variety of computational and real world problems using multiple programming languages chosen as appropriate for the task.

B4: DATA SCIENCE: work with (potentially large) datasets; using appropriate storage technology; applying statistical analysis to draw meaningful conclusions; and using modern machine learning tools to discover hidden patterns.

B5. SOFTWARE DEVELOPMENT: develop a product from the initial stage of requirement / analysis all the way through development to its final stages of testing / evaluation.

B6: PROFESSIONAL PRACTICE: understand professional practices of the modern IT industry which include those technical (e.g. version control / automated testing) but also social, ethical & legal responsibilities.

B7: TRANSFERABLE SKILLS: apply a wide variety of degree level transferable skills including time management, team working, written and verbal presentation to both experts and non-experts, and critical reflection on own and others work.

B8: ADVANCED WORK: apply the above to advanced topics selected according to the interests of individual students.

Task and Mark distribution:

The report is grade out of 150 and contributes 10 credits towards the module.

For detailed guidance on mark allocation, see the grading scheme below.

This is also available as a separate Excel document on Aula.

Assessment Overview

Over the course of this module you have been introduced to a range of techniques that may be used for programming a big data project. This assessment allows you to pull together these techniques in a realistic scenario to complete a big data analysis project. Below is a realistic project scenario. By using the techniques presented during class you are to carry out the project and write a final project report for your client.

Project Scenario

You have been approached by a client who analysis atmospheric science and climate model data. They have developed a new analysis technique, but it takes too long to run for them to use it. They have asked you to investigate the use of big data techniques to reduce the processing time.

They have a large volume of data to process, and the analysis needs to be repeated frequently. They have the following basic requirements:

1. Current analysis time is approximately 2.5 hours to analyse the climate model output data for a 1-hour time period.

2. The data for a single day of model output is approximately 250MB. However, they have over 100 years’ worth of data to analyse making a total of over 9TB.

3. Each day, they need to analyse the new data set for that day, so they wish to complete the analysis of the data for a 24-hour period (25 data sets) in under 2 hours.

4. It is not possible to hold on this in memory at one time, so the new process should load only 1 hour of data for processing at a time. If parallel processing is to occur, then 1 hour of data per worker can be loaded as needed.

You have been tasked with investigating the use of parallel processing to achieve the analysis speed required, with the following expectations:

1. Test and compare the processing speed of sequential and parallel processing

2. Extrapolate your findings to indicate the number of processors required to achieve the target processing time.

3. Test how your code responds to common errors, e.g. data that is text instead of numeric, use of NaN in the data as an error code.

4. Run automated tests that allow your client to set the tests running and return later to see the results, without user intervention.

The data has been provided by the European Centre for Medium Range Weather Forecasts (ECMWF)

Project Deliverables

Your project should deliver the following:

1. Working code that demonstrates:

a. Loading of only the data required for the processing taking place

b. Sequential processing of the data

c. Parallel processing of the data

d. Plots of the comparisons between sequential processing and parallel processing with different numbers of workers

e. Automated testing of your code to deal with pre-defined data error types.

2. A formal project report for your client covering:

a. Comparisons between parallel and sequential data processing

b. Estimated number of processors required to achieve the goal of processing 24-hours of data in under 2 hours.

c. Testing the code to see how it deals with:

i. Text instead of numeric values

ii. NaN values indicating data errors.

iii. Note: it is not necessary to solve these problems to pass, but you should be able to suggest methods of dealing with these problems so code will not crash.

d. A summary of the evidence generated during your project and how it helps you arrive at your conclusions

e. Recommendations

f. References

g. Appendices containing:

i. Code flow charts

ii. Gannt chart for your project

iii. Logbook

iv. Specification items

3. VIVA / presentation. You will be expected to present your work in a formal presentation / VIVA. Details of this can be found in the VIVA assessment brief.

This assessment brief covers only parts 1 and 2. The assessment brief for part 3, VIVA, is found in a separate document.

Additional Information

1. You will be provided with NetCDF data files:

a. One complete, correct data file

b. One file containing instrument errors, recorded as NaN.

c. One file containing data storage error where the numerical values have been saved as text

2. You are provided with code files for the analysis technique. You should not edit this file in any way. You are required run the analysis, for timing purposes, but are not expected to analyse, display, report on, or deal with the results of the analysis in any way.

3. You are expected to define your project by means of a list of 5 SMART specification items. These should be included in an appendix.

4. You are expected to plan the work required for this project and provide a complete Gannt chart, including identifying the critical path. This should be included in an appendix.

5. This is a formal report and it is expected that appropriate formal grammar and language are to be used. For help with formal writing, please contact the Centre for Academic Writing.

Notes:

1. You are expected to use the Coventry University APA style for referencing. For support and advice on this students can contact Centre for Academic Writing (CAW).

2. Please notify your registry course support team and module leader for disability support.

3. Any student requiring an extension or deferral should follow the university process as outlined here.

4. The University cannot take responsibility for any coursework lost or corrupted on disks, laptops or personal computer. Students should therefore regularly back-up any work and are advised to save it on the University system.

5. If there are technical or performance issues that prevent students submitting coursework through the online coursework submission system on the day of a coursework deadline, an appropriate extension to the coursework submission deadline will be agreed. This extension will normally be 24 hours or the next working day if the deadline falls on a Friday or over the weekend period. This will be communicated via your Module Leader.

6. You are encouraged to check the originality of your work by using the draft Turnitin links on Aula.

7. Collusion between students (where sections of your work are similar to the work submitted by other students in this or previous module cohorts) is taken extremely seriously and will be reported to the academic conduct panel. This applies to both courseworks and exam answers.

8. A marked difference between your writing style, knowledge and skill level demonstrated in class discussion, any test conditions and that demonstrated in a coursework assignment may result in you having to undertake a Viva Voce in order to prove the coursework assignment is entirely your own work.

9. If you make use of the services of a proof reader in your work you must keep your original version and make it available as a demonstration of your written efforts.

10. You must not submit work for assessment that you have already submitted (partially or in full), either for your current course or for another qualification of this university, with the exception of resits, where for the coursework, you maybe asked to rework and improve a previous attempt. This requirement will be specifically detailed in your assignment brief or specific course or module information. Where earlier work by you is citable, i.e. it has already been published/submitted, you must reference it clearly. Identical pieces of work submitted concurrently may also be considered to be self-plagiarism.

Marking Rubric

Topic	Total	Section	Marks	Description / Breakdown
Total	150
Report
This is a formal report and it is expected that appropriate formal grammar and language are to be used. Where this is not the case, a penalty of up to 10% may be applied to the marks for the report structure. For help with formal writing, please contact the Centre for Academic Writing.
Report Structure	30	a.	5	Introduction
Max 30 Marks				This should be clear and concise, introduce the project, the aims and how the report is structured
		b.	5	Code description
				Describe the functionality of the code files, what they are used for and how they achieve their tasks, including testing. Do not describe syntax.
		c.	10	Comparisons of parallel and sequential timing
				Detailed explanation of the meanings of the results, how parallel processing achieves higher speeds, detailed analysis and extrapolation to achieving the processing goal, makes good use of figure and visual aids.
		d.	5	Summary
				Pull together the key information well, present the information clearly
		e.	5	Conclusions and recommendations
				Make clear references to the report content to recommend the number of processors that may be required, describe limitations of the analysis, how different systems may perform differently etc. Additional research, information and understanding of e.g. HPCs, cloud computing etc. will be a benefit.
Report Figures and Diagrams	25	a.		Code flow charts showing processes (appendices):
Max 25 Marks				Sequential processing, parallel processing, testing
		b.		Plot graphs of worker / processing speed and extrapolated graph to show processing in the required time.
		Marks per figure	25	5 Mark allocation for each of the 5 plots and charts in parts a and b (no marks if not present): Clarity of main plot, colours, line styles, markers etc, title, axis labels, legend, caption
References and Appendices	30	a.	5	References
Max 30 Marks				Appropriate use of references. References should be in a standard format, e.g. APA 7th Edition. No penalty will be applied for using another common standard, e.g. IEEE. A high number of references are not required for this report but should include as a minimum Matlab, non-standard Matlab toolboxes and the data provider's website.
	b.		5	Gannt Chart
				Complete, detailed, including sub-tasks, critical path identified
	c.		15	SMART targets
				5 SMART targets
	d.		5	Log Book
				Provide a detailed log book, with many detailed entries, total time should add up to ~150 hours or more
Code
Analysis Code	20	a.	10	Sequential processing code
Max 20 Marks		b.	10	Parallel processing code
				Mark allocation for clearly structured, clear and detailed annotations, block or function descriptions, clear variable names, consistent structure and formatting, breaking your code into functions
Automated Testing Code	20	a.	10	Automating all speed comparisons
		b.	10	Automating the 'break tests'.
				Mark allocation for clearly structured, clear and detailed annotations, block or function descriptions, clear variable names, consistent structure and formatting, breaking your code into functions
Results Display Code	15			Displaying the results automatically in the analysis code will be eligible for full marks. Using a separate file with manually entered results will be capped at 5 marks
		a.	5	Comparison plot of sequential vs parallel
		b.	5	Sequential vs parallel on same plot
		c.	5	Mean processing time per datum
				Marks will be allocated for the clarity of the data and graph, colours, symbols etc, title and axis labels and a legend
Version Control	10		10	Marks will be allocated for sufficient versions overall, separation of function and versions for each, detailed commit notes and readme