Machine Learning 762 – Assignment 1
Worth: 5% of total grade [5 marks]
Due Date: Friday March 13th 2020, 23:59
机器学习作业代写 Make a random version by shuffling the items in the class attribute – you will need to remove the column with the class variable
Signal or No Signal ??? (5 marks):
There are two data files given: obscuredA.arff and obscuredB.arff. One of them is the real dataset and one of them is a randomized dataset where the class attribute has been randomly shuffled. Can you determine which is which? Most of the marks will be for explaining how you made this determination. Please try to determine which is which using at least two methods and discuss why you think one method is more reliable than the other. You can use whatever tools you wish to do this assignment, but Weka will be taught in tutorial.
In addition you will be given two smaller datasets for each case. 机器学习作业代写
obscuredA-50.arff, obscuredB-50.arff which contains ½ the number of rows as the original
obscuredA-25.arff, obscuredB-25.arff which contains ¼ the number of rows as the original
Do your techniques continue to work with the smaller datasets?
Now go to the UCI dataset repository https://archive.ics.uci.edu/ml/datasets.php and find 1 dataset that is more difficult to decide which is random and 1 dataset that is easier to decide which is random than was the case with the “obscured” dataset. What happens to these two cases as you reduce the dataset to ½ or ¼ of the current size. You will need to discuss what are the important features of these datasets and what impact the size has.
Two achieve the task above you will need to:
1) Find the datasets
2) Make a random version by shuffling the items in the class attribute – you will need to remove the column with the class variable, shuffle that column and reattach it
3) Make reduced sets of both the random and non-random version be removing rows – to do this you should shuffle the rows before you cut off 50% of the rows, because the dataset could be ordered in some way.
What you need to turn in:
You need to turn in MAXIMUM 2 page .pdf which includes:
a) Which dataset you think is the real file
b) 2 paragraphs, one on each of the 2 different ways you found to tell which dataset had a signal in it.
c) A paragraph discussing how well these techniques continue to work as the dataset gets smaller, please indicate the reasons for this behaviour.
d) A paragraph describing each of the two datasets you found in UCI, what you think it is about them that makes it easier or harder to tell whether the data has been randomised. Also include what happens when you reduce the size of the dataset.
e) A final paragraph discussing which method you feel is more reliable and why. Marking is based on the following material:
0.5 mark choosing the right data set
0.5 marks for clearly describing each of 2 methods used to determine the signal strength
0.5 mark for discussing whether these methods work as the dataset gets smaller
0.5 find a dataset that is easier to tell if there is a signal than is the case with “obscured”
0.5 find a dataset that it is harder to tell if there is a signal than is the case with “obscured”
1.0 discuss why these two datasets behave differently, make sure you discuss the reduction in size as well
0.5 marks for good presentation and English
0.5 marks for your final paragraph with a comparison of the two methods
The datasets can be found at 机器学习作业代写
The assignment must be submitted to Canvas. They will be run through Turnitin.
Copyright Warning Notice
Copyright © University of Auckland. This material is provided to you for your own use. You may not copy or distribute any part of this material to any other person. Failure to comply with this warning may expose you to legal action for copyright infringement and/or disciplinary action by the University.
更多代写: HomeWork cs作业 金融代考 postgreSQL代写 IT assignment代写 统计代写 CS assignment代写