Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Homework 3

Relational Databases & SQL

Statistics 141B 2023

1    Data

We will work with the posts and related data from the StackOverflow site for questions about statistics  Cross Validated, the Stats.stackexchange forum. This is a site where people ask questions about statistics,  generally, and people often provide answers to these questions, and some comment on the question or the  responses. This is a community of users who gain points for providing answers when others vote up (or  down) their answers. Often, there is one“accepted”answer for a question. While this is typically the“best”

answer, there may be others that are better, perhaps posted later.

Some questions come with a financial incentive - a bounty.

Users can also gain badges for certain knowledge and skills.

In addition to the posts (questions and responses), we have comments associated with individual posts. We also have the history of changes to a post.

The SQLite3 database is available from the Box folder https://ucdavis.app.box.com/folder/ 206515800849. It is 2.65 gigabytes.

There are 14 tables in this database: 8 containing information about posts, users, tags, etc. and 6 providing meta-data, specifically text descriptions of different types. Documentation for each of the tables is available at Schema for StackOverflow Databases and an Interactive Schema Diagram showing a super-set of the tables and some of the relationships between them.

The tables with names ending in TypeMap or TypeId map the numerical values to labels for the Post types(question, answer, . . . ), Vote types (up, down, accepted, offensive, . . . ), history of the post actions (edit title, body, tags, closed, reopened, . . . )

You will need to install the RSQLite and DBI R packages.

2    Report Structure

For this assignment, your report will focus on each question separately rather than having a narrative of the overall, high-level process.

For each question,

• include the full text of the question,

• map the (intentionally) casually written question to more precisely identify the relevant variables you need and the rows from the different tables, describing how you are interpreting some of the ambiguous terms,

• interpret the results, justifying with numerical or graphical summaries,

• show the SQL and R code you use to get the relevant data.

3    Questions

Do as much of the computations in SQL and then bring the resulting table back to R to finish the computa-

tions. Not all of the questions can be (easily) answered directly in SQL alone.

Give the answer and show the SQL and R code used to answer each question.

For each question, start by determining which tables are needed for each query.

Answer at least 13 of the first 20 questions and all of the questions in the Required Questions section below.

1. How many users are there?

2. How many users joined since 2020? (Hint: Convert the CreationDate to a year.)

3. How many users joined each year? Describe this with a plot, commenting on any anomalies.

4. How many different types of posts are there in the Posts table? Get the description of the types from the PostTypeIdMap table. In other words, create a table with the description of each post type and the number of posts of that type, and arrange it from most to least occurrences.

5. How many posted questions are there?

6. What are the top 50 most common tags on questions? For each of the top 50 tags on questions, how many questions are there for each tag.

7. How many tags are in most questions?

8. How many answers are there?

9. What’s the most recent question (by date-time) in the Posts table?

• Find it on the stats.exchange.com Web site and provide the URL.

• How would we map a question in the Posts table to the corresponding SO URL?

10. For the 10 users who posted the most questions

• How many questions did they post?

• What are the users’names?

• When did they join SO?

• What is their Reputation?

• What country do they have in their profile?

11. Following from the previous questions, for the 10 users who posted the most questions, how many gold, silver and bronze badges does each of these 10 individuals have?

12. For each of the following terms, how many questions contain that term: Regression, ANOVA, Data Mining, Machine Learning, Deep Learning, Neural Network.

13. Using the Posts and PostLinks tables, how many questions gave rise to a ”related”or ”duplicate”ques- tion?

• And how many responses did these questions get?

• How experienced were the users posting these **questions**.

14. What is the date range for the questions and answers in this database?

15. What question has the most comments associated with it?

• how many answers are there for this question?

16. How many comments are there across all posts?

• How many posts have a comment?

• What is the distribution of comments per question?

17. Is there any relationship between the number of tags on a question, the length of the question, and the number of responses (posts and comments)?

18. Do the people who vote tend to have badges?

19. How many questions were edited by the original poster? by other users?

20.  . How many posts have multiple different people who edit it?

4    Required Questions

21. Compute the table that contains

• the question,

• the name of the user who posted it,

• when that user joined,

• their location

• the date the question was first posted,

• the accepted answer,

• when the accepted answer was posted

• the name of the user who provided the accepted answer.

22. Determine the users that have only posted questions and never answered a question?  (Compute the table containing the number of questions, number of answers and the user’s login name for this group.) How many are there?

23. Compute the table with information for the 75 users with the most accepted answers. This table should include

• the user’s display name,

• creation date,

• location,

• the number of badges they have won,

  the names of the badges (as a single string)

• the dates of the earliest and most recent accepted answer (as two fields)

  the (unique) tags for all the questions for which they had the accepted answer (as a single string)

24. How many questions received no answers (accepted or unaccepted)?  How many questions had no accepted answer?

25. What is the distribution of answers per posted question?

26. What is the length of time for a question to receive an answer? to obtaining an accepted answer?

27. How many answers are typically received before the accepted answer?

5    Useful References and Web Pages

• Cross Validated - Stats.stackexchange site

• Schema for StackOverflow Databases

• Interactive Schema Diagram

• SQLite Documentationwith links to many sets of functions, etc.

SQLite Tutorial

SQLite built-in functions

• SQLite Date and Time Functions