# 大数据作业代考 DATA 604, Spring, 2022, Project 1

## DATA 604, Spring, 2022, Project 1

### 2) (15 pts) Study the role of the split of the training vs testing data for classifification. Specififically:

• Divide each class into two sets: the training set consisting of N examples for each of the classes 0 through 9. And the testing set consisting of 7000 N examples of each of the classes 0 through 9.  Where N ranges from 1000 to 6000 Propose and describe a selection algorithm for choosing N out of 7000 images for any integer value of N.

• Design and implement a method to test and estimate your computer’s capa bility to perform numerical computations. Estimate the computational cost of this Project and use it to decide on a reasonable number of experiments you can perform. Quantify this information and use it to guide your personal numerical goals. Such as the number of difffferent values of N that you will test. 大数据作业代考

• Use the k nearest neighbors classifification scheme in the standard Euclidean metric with fifixed k = 20 to verify the global success rate of your classififications for each chosen value of N.

• Draw conclusions about the impact of the size of the training set on the performance of the classifification scheme. To do this provide a method for choosing an optimal size of the training set. Describe what is the notion of optimality that you choose. Substantiate your conclusions with numerical evidence.

### 3) (15 pts) Study the role of the structure of the split of the training vs testing data for classifification. Specififically: 大数据作业代考

• Propose a new method of selecting the training set.  Which is difffferent from the one proposed in Part 2. Describe the new selection algorithm.

Divide each digit class (using this new split method) into two sets: the training set consisting of N examples for each of the classes 0 through 9. And the testing set consisting of 7000 N examples of each of the classes 0 through 9, where N is the optimal value chosen in Part 2.

• Use the same k nearest neighbors classifification scheme in the standard Eu clidean metric with fifixed k = 20 to verify the success rate of your classifification for the chosen optimal value of N with the new split of training vs testing data. 大数据作业代考

• Draw conclusions about the impact of the structure of the training vs testing split based on comparison of results of Part 2 and Part 3.