COMP SCI 3306, COMP SCI 7306 Mining Big Data Semester 1, 2022
计算机作业代考 Compare the two algorithms implemented above on ALL the following datasets: T10I4D100K, T40I10D100K, chess, connect, mushroom
1 Overview 计算机作业代考
This assignment must be done individually. This means all the rules regarding individual submission will apply and the submission must be solely your own work. Therefore, we will not use the groups on MyUni. You will need to submit on the assignment page as an individual.
2 Assignment
Exercise 1 Collaborative Filtering (Exercise 9.3.1) (30 points) (Postgraduate Students Only (COMP SCI 7306))
Table 1 is a utility matrix, representing the ratings, on a 1–5 star scale, of eight items, a through h, by three users A, B, and C. Compute the following from the data of this matrix:
- Treating the utility matrix as boolean, compute the Jaccard distance be tween each pair of users. 计算机作业代考
- Repeat Part (1), but use the cosine distance this time.
- Treat ratings of 3, 4, and 5 as 1 and 1, 2, and blank as 0. Compute the Jaccard distance between each pair of users.
- Repeat Part (3), but use the cosine distance this time.
- Normalize the matrix by subtracting from each non-blank entry the aver age value for its user.
- Using the normalized matrix from Part (5), compute the cosine distance between each pair of users.
Exercise 2 Frequent Itemsets (50 points) 计算机作业代考
For this exercise, you will need to fifirst study the Section 6.4 (up to 6.4.3) of the text-book, Mining Massive Datasets, Leskovec, Rajaraman, Ullman (third edition, 2020), and then follow the instructions below:
- Implement the simple, randomized algorithm given in Section 6.4.1 of the text-book.
- Implement the algorithm of Savasere, Omiecinski, and Navathe (SON al gorithm), as explained in Section 6.4.3.
- Compare the two algorithms implemented above on ALL the following datasets: T10I4D100K, T40I10D100K, chess, connect, mushroom, pumsb, pumsb star; those datasets are available at:
http://fimi.ua.ac.be/data/ Report the observed outcomes and comparisons on individual datasets and in overall. 计算机作业代考
- Perform difffferent experiments on the simple randomized algorithm, using the following sample sizes: 1%, 2%, 5% and 10%. Compare your experi mental results. Additionally compare the results of the above experiments with the results produced by the SON algorithm. Your approach should be as effiffifficient as possible in terms of runtime and memory requirements. Report on challenges that you might have come across during the imple mentation and running the experiments.
Exercise 3 Advertising (Exercise 8.4.1 and more) (10+5+5 points)
Part 1 Explain both the Greedy Algorithm (Section 8.2.2 of the textbook) and Balance Algorithm (Section 8.4.4 of the textbook) and explain what Competi tive Ratio is.
Part 2 Consider Example 8.7. Suppose that there are three advertisers A, B, and C. There are three queries x, y, and z. Each advertiser has a budget of 2. Advertiser A only bids on x, B bids on x and y, and C bids on x, y, and z. Note that on the query sequence xxyyzz, the optimal offiffiffine algorithm would yield a revenue of 6, since all queries can be assigned.
- Show that the greedy algorithm will assign at least 4 of the 6 queries xxyyzz. 计算机作业代考
- Find another sequence of queries such that the greedy algorithm can assign as few as half the queries that the optimal offlfflffline algorithm would assign to that sequence.
3 General assignment submission guidelines 计算机作业代考
Your submissions will need to include the following, at minimum:
- a PDF fifile of your solutions for any theoretical exercises. The solutions should contain a detailed description of how to obtain the results, not just the fifinal results.
- PDF or txt fifile with a brief descriptions of your implementations to un derstand your code.
- Files containing the output results of running your algorithms on the pro vided datasets.
- PDF or txt fifile of your computation times of the algorithms on the pro vided datasets.
- All source fifiles, logs, and all the project fifiles.
- a README.txt fifile containing instructions to run the code, student ID, and email address.
- the submissions that do not follow the above guidlines may lose points accordingly.
Please do not hesitate to reach out using the discussion forum, workshops, or the contact details of the teaching assistants on the home page of MyUni, should you have any questions or concerns.
发表回复
要发表评论,您必须先登录。