Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

SUMMER 2022 RESIT ASSESSMENT for COMP2121 Data Mining, Semester 2 2021/22

Data collection and analysis from Leeds University

English is an international language, used as a first or second language in many countries. SketchEngine.EU gives access to several general English text data-sets or corpora, such as EnTenTen. Large organisations such as Leeds University may use specialised student education terminology which differs from general English; (Birtill et al 2021) refer to this as “the hidden curriculum”.  Your task is to collect a sample of English text data from Leeds University, and compare this with general English text data, to discover words and phrases which are examples of specialist student education terminology used at Leeds University.

You must use SketchEngine and WebBootCat to collate a sample of 100,000 to 200,000 words of text from Leeds University website. You will then use SketchEngine tools to compare your Leeds University data with a general English corpus, to identify words and phrases which are characteristic of your Leeds University data-set. You will write up your methods, results and conclusions in a short research report, submitted via Minerva for assessment.  

Learning objectives: this exercise will enable you to

- Investigate theory, methods and terminology used in Data Mining and Text Mining;

- Experience how to apply algorithms, resources and techniques for implementing and evaluating Data Mining and Text Mining in a practical research exercise;

- Summarise and present your achievements in a research conference paper.

In your project, you can use SketchEngine for text data-set collection and analysis. However, you may use other tools for data preparation and modelling, such as WEKA or Python; if so, report on these in your research paper.

Report submission (100% of resit grading):

You must submit a 4-page short research report (DOCX or PDF file) on collection and analysis of your Leeds University text data-set, covering CRISP-DM Data Mining phases: Business Understanding, Data Understanding, Data Preparation, Modelling and Evaluation.

You MUST keep to limits: 4 pages main contents, PLUS optional additional page(s) for references.

Marking scheme:

In your research report, we will assess:

1. Business Understanding: state objectives & requirements, and data mining problem definition (0-2)

2. Data Understanding: explain data collection process, parameter settings in SketchEngine (0-6)

3. Data Preparation: how the data was prepared or converted for analysis (0-2)

4. Modelling: how the data was investigated, with example outputs (0-4)

5. Evaluation: evaluation methods, results (0-4)

6. Deployment: conclusions about English language terminology at Leeds University (0-4)

TOTAL: up to 20 marks