STATS 415 Final Project
Due by 11:59pm on April 18, 2021
统计Final Project代写 Note that here we use the validation approach instead of cross validation (CV) to tune K, because we should not use future
Overview 统计Final Project代写
In this project, you will apply what we have learned in STATS 415 to build predictive models based on a real financial dataset. The main goal is to predict the forward return of a target asset (Asset 1) based on the his- torical price series of the target asset and other two assets (Asset 2 and Asset 3). You will be given minutely prices of the three assets over a year, which correspond to T = 524, 160 rows in the csv file we provided.
The project has two parts: a basic part and an advanced part:
The advanced part is worth 40 points. There you are given full free- dom to build your predictive models. You need to submit a prediction function to Canvas; then our OJ will assess and report its performance on a testing dataset that is withheld from you. The ranking of a team depends on the out-of-sample correlation r of their model. You will receive 10 points as long as your team makes a valid submission and will receive full points once r 5%. We will consider giving extra bonus points to top teams depending on their performance. The spe- cific details of the bonus points will be given after the final project is due.In the basic part, there are six problems, each worth 10 points and having standard solutions. Two problems require you to submit your outputs to Canvas, which are then assessed by our Online Judge (OJ). The other problems require you to present your analysis and results in your project report, which are graded manually.
Everyone within a team receives the same score for the final project and can submit results or code to the OJ for the entire team. Each OJ-graded problem allows three submissions per day per team, and only the highest score will be counted toward the grade. Therefore, please start early and exploit every opportunity to hit a higher score! During the final project, you will be updated with your team’s current ranking based on your state-of-the- art result every 24 hours.
Basic part
- Backward returns
For any t, h ∈ N+, define the h-min backward return at time t as:
|
where s(t) denotes the price at time t. Load final project.csv in R. Cal- culate the 3-min, 10-min and 30-min backward returns of all the three assets at t = 1, . . . , T . Create a dataframe with columns named in the form of Asset i BRet h, where i 1, 2, 3 and h 3, 10, 30 , such that the col- umn Asset i BRet h corresponds to the time series of the h-min backward returns of Asset i. The resulting dataframe should have 524,160 rows and 9 columns. Export this dataframe to a csv file named as bret.csv and submit it to OJ to verify its correctness. Please round all the entries of the dataframe to four decimal places; the maximum file size to upload is 40MB.
(Hint: Vector/matrix-based calculation is much more efficient than loops in R.)
Rolling correlation
Given two times series X = {Xt}1≤t≤T and Y = {Yt}1≤t≤T , the w-min back- ward rolling correlation between X and Y at time t0 is defined as
|
where Cor is the sample correlation. Calculate the (21 24 60)-minute (3 weeks) backward rolling correlation of 3-min backward returns of each pair of the three assets at t = 1, 2, . . . , T . Create a dataframe with column names in the form of Rho i j, which corresponds to the rolling correlation between Asset i and Asset j, and where i < j. The resulting dataframe should have 524,160 rows and 3 columns. Export the dataframe to a csv file named as corr.csv, and submit it ot our OJ to verify its correctness. Please round all the entries of the dataframe to four decimal places; the maximum file size to upload is 15MB.
(Hint: The rolling correlation can be computed in an incremental manner,
given that the rolling window is shifted by only one minute at each step.)
Linear regression 统计Final Project代写
The h-min forward return at time t is defined as:
Fit a linear regression to predict rf (t, 10) of Asset 1 using rb(t, 3), rb(t, 10), rb(t, 30) of the three assets you calculated in Section 2.1 as features. Hence, you have
9 features in total in your linear model. Use the first 70% data as training data and the last 30% data as testing data. Are the backward returns of As- sets 2 and 3 significant in predicting the forward return of Asset 1? Report the in-sample and out-of-sample correlation between your prediction rˆf (t, 10) and true response rf (t, 10). Also plot the three-week backward rolling corre- lation between rˆf (t, 10) and rf (t, 10). Is this correlation structure stationary over the year?
KNN
Run KNN by using the same features and response variable as in Section
2.3 with K = 5, 25, 125, 625, 1000. Use the first 70% data as training data and the last 30% data as validation data. Plot the training and validation MSE against K. Find the best K based on the validation MSE and gener- ate prediction for the whole year. Report the in-sample and out-of-sample correlation between your prediction and true response.
Note that here we use the validation approach instead of cross validation (CV) to tune K, because we should not use future data for training and past data for validation.
Ridge and LASSO 统计Final Project代写
Consider backward returns in more time horizons. Calculate {rb(t, h)}t∈[T ],h∈{3,10,30,60,120,180,240,360,480,600,720,960,1200,1440} for all the three assets. Use these returns as features to fit Ridge and LASSO regression to predict rf (t, 10) of Asset 1. Use the first 70% data as training data and the last 30% data as validation data. Use the validation MSE to seek the best tuning parameter in LASSO and Ridge, and generate the corresponding prediction for the whole year. Report the in-sample and out- of-sample correlation between your prediction and true response.
Principle component regression (PCR)
Run PCR with the same features and response as in Section 2.5. Use the first 70% data as training data and the last 30% data as validation data. Use the validation MSE to seek the optimal number of principal components to include in PCR and generate the corresponding prediction for the whole year. Report the in-sample and out-of-sample correlation between your prediction and true response.
Advanced part 统计Final Project代写
You have tried some basic features and models in the previous problems. Now you are in position to derive new features based on the dataset and develop your own sophisticated statistical models.
Your task is to write a R function prediction() that takes a dataframe
of past one-day minutely price data of Assets 1, 2 and 3 (i.e., a 1440-by-3 nu- meric dataframe) as input, and that returns the prediction of the 10-minute forward return of Asset 1 at the last minute of the input dataframe as output. Conceptually speaking, this prediction function is your estimate of the regression function fˆ. Note that there SHOULD NOT be any model fitting inside this function. Rather, you should train your model based on the given data OUTSIDE this function, and extract the fitted model to build prediction().
You should submit two files to our OJ: prediction.R and model.RData. The R script prediction.R includes the function prediction. Feel free to include other utility functions in prediction.R. The file model.RData includes all the objects you need to build your prediction function, e.g., the optimal choice of tuning parameters, the estimate of the coefficients in the linear model, etc. The size limit for both files is 32MB. The OJ will apply your prediction function to the testing dataset that covers half a year and return to you the correlation between your prediction and true forward return. Your prediction function will be called for around 10 thousand times, and the total time limit for this is 10 minutes. Therefore, please ensure that your prediction function is both accurate and fast!
In terms of package usage, feel free to use all the packages in the lab materials. Please do not use any packge that has not appeared in the lab materials.
Some tips regarding code submissions
- Prior to submitting your code, see if you can call your prediction func- tion 10,000 times within five minutes on your local If not, then please simplify your model or code.
- You don’t have to load model.RData or the packages you need inside the prediction Instead, do them outside the prediction func- tion. This can avoid repeatedly loading the model objects and save substantial amount of time.
- It is recommended that you put rm(list=ls()) in the beginning of your script prediction.R to refresh the local environment of the
更多代写: HomeWork cs作业 金融代考 postgreSQL代写 IT assignment代写 统计代写 企业金融专业代写
发表回复
要发表评论,您必须先登录。