CISC7107 Data Mining and Decision Support Systems Assignment 3.0
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CISC7107 Data Mining and Decision Support Systems
Assignment 3.0
Due Date: Tuesday 16 May 2023
Time-series Forecasting
Visual Mining
Text Mining
Objectives:
Students are to gain experience in using several prediction software programs in doing time-series forecasting, visual mining and text mining. Learn how to analyze real-life time-series, visual mining and text mining using different techniques, and to interpret the meanings of the results.
Tasks to do:
1. Time-series Forecasting
Warm up task (no need to submit)
Find one or multiple time-series datasets of your choice. Multiple time-series are needed for multiple regression and/or correlation analysis
Based on your chosen time-series dataset, your task is to find the most reliable forecasting model for doing your prediction. You may forecast up to n steps ahead (where n can be anything meaningful).
Run forecasting by using at least two software tools, such as Weka and Crystal
Ball. (Alternatively, if Crystal Ball is not working on your computer, try others like Miner3D or RapidMiner or Orange).
Try forecasting with different popular algorithms of your choice provided by the
software programs.
Record down the ‘fitting errors’ of your model, such as MAE, MSE, RMSE etc. Produce forecasting charts based on each software tool you tried
Tabulate the forecasting performances, and conclude your findings (which are
most accuracy in terms of lowest errors.)
Copy-and-paste your forecasting graphs into your report, together with the Table
of forecasting performances
Analyse and draw your own conclusion, especially to discuss the differences in
the results obtained by different forecasting methods.
Samples of Table of forecasting performances (just examples only!) What will happen to Chengdu after 365 units of time?
Data source: 1900-2015, m>=3, Common Events
Algorithm |
Mag. Error (MAD) |
||
Double Moving Average Linear Regression Multilayer Preceptron SMO Regression Random Forest |
4.952156667 6.7158 14.3258 5.6106 5.15 |
0.4023 1.5373 0.4091 0.4215 |
What will happen to Chengdu after 183 units of time?
Data source: 1900-2015, m>=5, Rare Events
Algorithm |
Mag. Error (MAD) |
Double Moving Average 5.70 0.259237162 Linear Regression fill in yourself fill in yourself Multilayer Preceptron fill in yourself fill in yourself SMO Regression fill in yourself fill in yourself Random Forest fill in yourself fill in yourself |
Task (need to submit)
Find time-series dataset of your interests. You are encouraged to try something ‘significant’, e.g. predicting world economy bubbles, major events, pandemics, important disasters, etc. Repeat the process similar to the above warm-up task.
In addition to forecasting the future values, try to find the association rules and/or correlations, if any.
2. Visual Mining
Visual Mining is about using computer tools to visualize out your data, in order to reveal special patterns, interesting observations, so you can make sense out of the massive data.
Data can be either structured or unstructured, but it needs to be in some reasonably large volume.
There is no limit in what tools you use, what algorithms, what methods, as long as you can extract interesting visual patterns out of the data.
This task can be combined with data stream mining (in which you can visualize the input data streams and the output results).
In your report, document about the source of the data, brief description of the data, what you are looking for, how you find them, and what do you think the results are, explanations and limitation (if any). Both graphics and written results should be documented properly in the report, although there is no specific required writing format.
3. Text Mining
Text mining is about the task of extracting relevant information from natural language text and to search for interesting relationships between the extracted entities. Text classification is one of the basic techniques in the area of text mining. It is one of the more difficult data-mining problems, since it deals with very high-dimensional data sets with arbitrary patterns of sparse data.
First, familiarize yourself with the two examples which are demonstrated in class: Classifying Different Language Texts, Classifying Moods from Online News, and Classifying news of different topics. Open the sample files where were demonstrated in class in Weka, in the preprocessing tab apply the Filter called StringToWordVector onto the data. You will notice how the STRING values are converted into a set of attributes that represent the frequency of each word in the strings.
The following boxes show an example of string conversion implemented by this filter.
Similar to the dataset, mood.arff, create your own dataset with a minimum of 100 records from other online text sources such as News website, Facebook comments, Blogs, Twitters, etc. Classify the records into some meaningful groups, e.g. emotions, good or bad, female or male, young or old, news categories (local, world, finance, sports, entertainment, etc.) You are free to choose any text sources and free to choose any meaningful groups. But for this training dataset, classify them according to your own judgment. Name the dataset with a new name of your choice. After applying the conversion filter StringToWordVector on the dataset, run it under J48 classification algorithm (or others of your choice), try to optimize the parameters, and try with and without text transformation, and record down the performance results of each run on your report. Discuss what you observe.
Submission:
Submit your experiment report in Excel and all the materials (both datasets in ARFF or any other format + performance results, charts, report and any other file in Excel) as a single zipped file to UMMOODLE by the due date.
Additional Options:
The tasks listed above are for the fundamental requirements for passing this assignment. If you will want to score a very high mark, consider doing the following tasks that are more challenging and requires more time.
Challenge 1: Try applying some “dimensionality reduction” techniques or “attribute transformation techniques” which are available in Weka. The aim is to improve the accuracy of the text classification model by reducing the attributes, removing bad records or both.
Challenge 2: For your time-series forecasting datasets, how do “data stream mining algorithms” perform in comparison to the traditional data mining algorithms for good accuracy?
Challenge 3: Are there any relation or correlation which you observe between the time- series forecasting and text mining? How do you use text mining to perhaps improve or to ascertain the forecasts from time series forecasting?
Challenge 4: Quite often in data mining, a single algorithm or technique may not give you complete or good quality results. Is there any possibility that you can combine two or three of the methods (TS-forecasting, Visual mining and Text mining – as sentiment analysis, e.g. headline news) together, so to generate some better results than using just any individual alone?
2023-05-18