关键词 > CA675 Hadoop代写 Java代写

CA675 TF-IDF

发布时间:2026-01-05

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CA675 TF-IDF

Requirement

Tasks:

1. Using MapReduce, carry out the following tasks:

2. Acquire the top 250,000 posts by viewcount (see notes)

3. Using pig or mapreduce, extract, transform and load the data as applicable

4. Using mapreduce calculate the per-user TF-IDF (just submit the top 10 terms for each user)

5. Bonus use elastic mapreduce to execute one or more of these tasks (if so, provide logs / screenshots)

6. Using hive and/or mapreduce, get:

The top 10 posts by score

The top 10 users by post score

The number of distinct users, who used the word ‘java’ in one of their posts

Notes

TF-IDF

The TF-IDF algorithm is used to calculate the relative frequency of a word in a document, as compared to the overall frequency of that word in a collection of documents. This allows you to discover the distinctive words for a particular user or document.

The formula is:

TF(t) = Number of times t appears in the document / Number of words in the document

IDF(t) = log_e(Total number of documents / Number of Documents containing t) The TFIDF(t) score of the term t is the multiple of those two.

Downloading from Stackoverflow

You can only download 50000 rows in one query. Here is a query to get to get most popular posts:

select top 50000 * from posts where posts.ViewCount > 1000000 ORDER BY posts.ViewCount

To count the number of records in a range:

select count(*) from posts where posts.ViewCount>15000 and posts.ViewCount < 20000

To retrieve records from a particular range:

select * from posts where posts.ViewCount > 15000 and posts.ViewCount < 20000