Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assignment 1

We discussed the workload characterization of the WWW 2007 Web server in class. We discussed usage behaviour, popularity characteristics, client errors, and geographic distributions of users, among other things. The Web site for the WWW2007 conference (http://www2007.org) is hosted at the University of Calgary. The site remains online to this date. Your task is to analyze the content hosted on the Web site. This exercise will help you attain skills in data analytics, statistical analysis, graphing, and interpretation of networking data. This exercise is courtesy of Prof. Carey Williamson from the University of Calgary.

The file www2007data.txt contains the output of the Unix command "ls -lR" in the home directory of the WWW2007 Web site (/home/projects/www2007). The output shows information such as the name of each file and directory, the file permissions, the file size, the file modification date, and so on.

Answer all the questions below. For each question, please provide a short explanation, implications of the results, and insights. Connect your explanations and insights with the paper discussed in class and any other papers you have researched. Additional papers researched for your answers should be appropriately cited in the submission document. Present your answers in a properly formatted manner. Use emphasis where applicable to convey the takeaway message. The names of participating and non-participating members should be provided on the cover page of the submission. The submission should be a single document uploaded on Canvas.

1. What measurement approaches were used in this work? Active or Passive measurements.

2. What are the measurement vantage points used in this work? Edge or Core.

3. How many viewpoints are considered in this work? One viewpoint or more.

4. What hardware and software tools are used in this measurement study?

5. Are online or offline analyses performed in this measurement study?

6. Are active and passive measurements performed to measure the same metric in this study?

7. How would you perform a workload characterization study of a server you cannot access (physically or through other means)?

8. How many different regular files (not directories) are stored on the site? What is the aggregate size of these files (in bytes)?

9. What is the largest file on the site? How big is it? How many empty files (0 bytes) are there? What is the smallest non-empty file on the site? How big is it?

10. What is the mean file size on the site? What is the standard deviation of file size? What is the median file size (50th percentile value)? What is the file size distribution's mode (most frequently occurring value)?

11. Plot a graph showing the file size distribution. Make one graph for the empirical probability density function (pdf) and a separate one for the cumulative distribution function (CDF). Use a graph style (e.g., lines, boxes, histogram, scatter plot) and axis scaling (e.g., linear, logarithmic, log-linear, log-log) to convey the distribution effectively.

12. Analyze the file type distribution: File types can be determined heuristically based on the (optional) suffix in the file name (e.g., foo.html, paper127.pdf, painful.doc). Produce a table showing the site’s top 10 known file types in sorted order from most prevalent to least prevalent. Within this table show the number of files of each type, the percentage of files of each type, the number of bytes for each file type, and the percentage of bytes for each file type. If necessary, use a category "Unknown" for any file types that are not easily discernible from the file name suffix. In the table, add a category "Other" for those files not accounted for among the top 10 file types so that the percentages in the table sum correctly to 100%.

13. Plot a graph showing the file size distribution for the PDF versions of the papers and posters in the conference proceedings (i.e., from the subdirectories ./papers and ./posters). Plot a CDF graph with two lines (one for papers, one for posters). Use a graph style and axis scaling to convey the distributions effectively.

14. Calculate (or estimate) the age of each file on the Web site (i.e., the number of days since it was last modified). What is the oldest file on the Web site? How old is it? What is the newest file on the Web site? How old is it? What are the mean, median, and mode for the file age distribution?

15. Plot a CDF graph showing the file age distribution. Use a graph style and axis scaling to convey the distribution effectively.