Math 10 Final Project, 21 Spring
Due: 11:59 PM, PDT, June 9th 2021
Submission: Upload the .ipynb file to Canvas.
数学期末考试代写 Domain Knowledge Insights: In the original paper of Kuzushiji dataset, they stated that there might be multiple ways to
The submission should be a well-organized report (with well-structured sections, high-quality figures and necessary descriptions as markdown in the notebook file, with all code blocks already executed). Submitting merely the codes and/or incomplete results will severely impact your grades. Please include everything in one single ipynb file, and any other formats (.pdf, .doc …) or redundant files are not valid and won’t be graded.
Dataset Downloading: 数学期末考试代写
In the final project, we provide three choices of datasets: 1) Kuzushiji-MNIST (perhaps the easiest) 2) American sign language-MNIST and 3) Kuzushiji-49 (most challenging one).
You can pick one of the datasets and conduct all the tasks described below. The choice won’t affect your basic grades except in Task 7 (see below). Of course, exploring all datasets is especially welcome, and you may receive higher grades in Task 6 if you do so (see below).
The .csv file of both training and test data in three datasets can be downloaded from Canvas.
More information about the data: 数学期末考试代写
1) Kuzushiji-MNIST: Each sample is a 28x28 gray-scale image about ancient calligraphy Japanese characters (flattened as one row in the csv file) in Kuzushiji (http://naruhodo.weebly.com/blog/introduction-to-kuzushiji), and the first column in dataframe is the label (10 categories of different characters). The training dataset contains 60,000 images and test contains 10,000.
Reference: https://github.com/rois-codh/kmnist
2) American sign language-MNIST: Each sample is a 28x28 gray-scale image about hand gesture to represent English letters (flattened as one row in the csv file) in American sign language(https://en.wikipedia.org/wiki/American_Sign_Language), and the first column in dataframe is the label (26 categories of english letters). The training dataset contains 27,455 images and test contain 7,172.
Reference: https://www.kaggle.com/datamunge/sign-language-mnist
3) Kuzushiji-49: The expansion of Kuzushiji MNIST with 49 categories of whole Japanese characters (Hiragana). The training dataset contains 232,365 images and test contains 38,547.
Reference: https://arxiv.org/abs/1812.01718 数学期末考试代写
Tasks and Grading Policy (20pts in the total course grade)
Task 1: Loading the data (2 pts)
Basic Requirements:
1) Using Pandas to load the csv data, and generate X_train, y_train (from traning data file) as well as X_test, y_test (from test data file) in the Numpy array format. The labels and pixels can be distinguished from column names of the tabular data.
2) Report the shapes of above arrays.
3) All the remaining tasks below should be based on this data.
Optional: Use pandas, matplotlib and seaborn to do some data exploration and visualization.
Hint: Since both the training and test data are provided, you don’t have to split the data further. Of course, it would be great if you also consider using the validation dataset (or using cross-validation).
Task 2: Logistic Regression (5pts) 数学期末考试代写
Basic Requirements:
1) Write the code to implement logistic regression for classification problem. You can only use basic Python and Numpy (Scipy) functions. Calling functions in Scikit-Learn or other machine learning packages is NOT allowed. (3pts)
2) Detailed document strings and comments should be included. (1pt) 数学期末考试代写
3) Evaluate and report the performance of logistic regression on the dataset you choose. (1pt)
Task 3: Principle Component Analysis (5 pts)
Basic Requirements:
1) Write the code to implement PCA and return the first n principle components (n as the parameter). You can only use basic Python and Numpy/Matplotlib/Pandas/Seaborn/Scipy functions. Calling functions in Scikit-Learn or other machine learning packages is NOT allowed. (3pts)
2) Detailed document strings and comments should be included. (1pt)
3) Run PCA on test data, and visualize the results by scatter plot. The true labels of each sample should be distinguished by different colors. (1pt)
Task 4: Try other methods by calling function in Scikit-Learn. (4 pts)
Basic Requirements:
1) Try at least one supervised (2pts) and one unsupervised (2pts) methods other than logistic regression and PCA on the dataset you choose.
2) Before calling the function, write one paragraph in Markdown file to introduce the basic model and algorithm of each method. You’re required to use Latex to type formulas.
3) If possible, compare the results with logistic regression/PCA. If you use clustering methods, please evaluate the performance of the clustering by comparing to the true labels.
Hint: Please choose the appropriate methods. For example, using regression models for classification problem is not appropriate.
Task 5: Try other python package in Machine Learning. (3 pts) 数学期末考试代写
Try one of the following package on the dataset you select. In this task, you may find running in Kaggle Notebook or Google Colab very helpful, especially for tasks involving GPU computation.
- Scikit-learn (https://scikit-learn.org/stable/index.html). If still want to use sklearn in this task, you have to apply another 2 supervised learning algorithms and 1 unsupervised learning algorithms. In this case, the requirement for each algorithm is the same with Task 4 (code + markdown description of the algorithm+ comparison are all required).
- PyCaret (https://pycaret.org/). Following the workflow of classification (compare + tune +finalize +predict), and report the result on test dataset.
Hint: In PyCaret, the compare_model might be very slow if every model is included. You may use the “include” parameter to pick some interested model for comparison. You may also find this post by original author of PyCaret useful: https://towardsdatascience.com/5-things-you-are-doing-wrong-in-pycarete01981575d2a
3. cuML (https://github.com/rapidsai/cuml). Try one supervised or one unsupervised algorithm, report the result, and compare with the same algorithm (performance or running time) in scikit-learn. 数学期末考试代写
4. Tensorflow (including Keras) or PyTorch (including PyTorch Lightning). Try one supervised or one unsupervised deep learning model, and report the result is enough.
Task 6: Organizing your report (1 pt)
This 1pt will be determined by the overall quality of your report in ipynb form (judged by the teacher when grading). There is no guarantee that you will get the 1 point in full if only submitting the correct (instead of nearly-perfect) report. In other words, it’s totally possible that you receive zero in this task, if your report merely addresses the basic requirements -- for example, just copy and paste codes and Markown from lecture notes/discussion files.
Try to write the descriptions and codes (including document strings and comments) in wellorganized and logical way, and generate high-quality figures. A practical tip is to imagine that you’re writing a thesis instead of a report – therefore you need sections and subsections at different levels, or abstract/conclusion/transition paragraphs. Especially, repeating the words of this file and copying requirements of the tasks are not necessary and not encouraged – I know them well and expect to see your results and associated descriptions/explanations. Addressing more tasks (e.g. trying more packages in Task 5, analyzing both datasets, using crossvalidation to tune parameters, comparing different algorithms) will also gain bonus in this 1 pt.
Task 7: Bonus Point (additional 1pt to final grade) 数学期末考试代写
For exceptional report that displays novelty and insights at the research level, up to 1 additional credit will be added directly to the final grade. This is also important to determine A+ in the final grading, and for the request about recommendation letter in the future.
The difference between Task 6 and Task 7 is about novelty. The point in task 6 is included in 20pts of the final project, while task 7 is the bonus. For Task 6, we expect to see your positive attitudes and active efforts, while for Task 7 we expect to see something interesting. We suggest you do not touch it unless you’ve finished and are fully confident about basic tasks 1-6.
The expectation for task 7 is VERY high -- applying an existing model/package written by other people (whether we have learned it or not) and merely reporting the accuracy (or other metrics) to the datasets is NOT considered as interesting and novel.
Some example directions to explore include:
1) Technical Challenges: You may find that task 2 written in Numpy is very slow for Dataset 3. Can you try to rewrite the logistic regression codes and SGD from scratch with JIT compiler or GPU acceleration? The possible choices include Numba, Pytorch or Google’s latest JAX (https://github.com/google/jax). Compare your new codes with Numpy.
2) Theoretical Reflections: For one specific dataset in the project, some model performs significantly better than others. Rather than the hand-waiving claim of “no-free-lunch theorem”, can you be more specific about the specific reasons why? (e.g. what unique characteristic of this dataset makes kNN perform very well/poor?) Support your explanations with the evidence or observations directly from dataset. 数学期末考试代写
3) Domain Knowledge Insights: In the original paper of Kuzushiji dataset, they stated that there might be multiple ways to write one character (the phenomenon is called Hentaigana, and is the typical feature of ancient Japanese). Can you systematically identify Hentaigana in the K-49 dataset with machine learning? Will this prior knowledge leads to the more accurate classification? Of course you can formulate the problem of your own while they should be closely relevant to the dataset. I’m looking forward to reading about the exciting discoveries. To apply for the bonus point, in addition to the results and codes, please also add a separate summary paragraph in the report stating that 1) what problem you have solved 2) why it is important and novel.
Other Requirements/Resources:
- Each student should work on the final project independently, and direct discussion on the content (especially about debugging) with other students/ TA/ teacher is NOT allowed. Violations of the academic integrity rules will be reported to the department.
- Make sure to submit the final project .ipynb file to Canvas before the deadline. We do not allow for any reason/excuse to extend the deadline.
- Computer/software issue is never a valid excuse to submit incomplete results, since we have already tested the datasets and basic tasks in personal laptop satisfying the minimum requirement asked by university– not to mention that we have also introduced free Kaggle or Google Colab resources to run the codes in cloud.
- You may also connect to the university computer lab remotely through this link
https://www.oit.uci.edu/labs/remote-computer-access/
商科代写 cs代写 法律学代写 经济学代考_经济学作业代写 艺术代写 心理学代写 哲学代写 伦理学代写 体育学代写 化学代写 教育学代写 医学代写 历史代写
发表回复
要发表评论,您必须先登录。