Theoretic part (20 pts, 5 pts each)
时间序列代考 We aim at predicting compressed daily ridership (5 PCA components values) from 12-hour lag variables. Parameter tuning is required in
Multiple choice questions: please select all that applies and explain your answer.
Question 1 (Autocorrelation). 时间序列代考
The autocorrelation plot of the daily time-series has local peaks at t=7,14,21,28 etc.. How would you interpret that?
A. The time-series reaches its maximum on the days 7,14,21,28...
B. The time-series reaches its minimum on the days 7,14,21,28...
C. The time-series is likely to have a periodic pattern with a period of 7 days
D. The time-series is likely to have 7 periods per day
E. The appropriate AR model for the time-series should have at least 7 terms.
Question 2 (Stationarity).
Which of the following time-series models are stationary:
A. Linear trend
B. AR(1) model
C. White noise
D. Random walk 时间序列代考
E. ARMA(1,2) model
F. ARIMA(1,1,1) model
Question 3 (PCA). 时间序列代考
Which of the following statements regarding the model dimensionality reduction through Principal Component Analysis (PCA) are true:
A. Leading principal components of the features are the most efficient for modeling the output variable.
B. Principal components of the standardized features are uncorrelated and this way less exposed to multicollinearity.
C. The model using principal components of the features can't overfit.
D. Feature selection based on the principal components of the features is often more efficient in preventing overfitting comparered the feature selection over the original features.
E. Principal components are harder to interpret compared to the original features making the PCA regresssion model less interpretable compared to the regression model using original features. 时间序列代考
Question 4 (MapReduce).
What is true about MapReduce:
A. MapReduce is a Python module enabling parallel computing
B. Using MapReduce approach makes the code more suitable for parallel computing.
C. MapReduce code always runs faster compared to the code using more traditional approaches, like loops or list comprehensions.
D. MapReduce code will always efficiently run on multiple cores of you CPU or multiple machines within your cluster if available.
E. Multiprocessing and PySpark efficient alternatives to MapReduce. 时间序列代考
Practice part: Taxi ridership from JFK to other taxi zones prediction. 时间序列代考
This project is an example of applying PCA to predict hourly yellow taxi ridership at the taxi zone level. Modeling taxi ridership at a fine spatial and temporal granularity is challenging due to the low signal-to-noise ratio and high dimensionality. In this case, dimension reduction is essential in feature engineering. This project has five steps: data downloading, data preprocessing, baseline modeling, feature engineering, and RandomForest modeling.
Let's start with data downloading.
1. Data downloading (5pts)
Design a function to download yellow taxi data from 2017-01-01 to 2018-12-31 at https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.
2. Data Preprocessing (10 pts, 7 for dask, 3 for sanity check) 时间序列代考
Use dask to aggregate all months' records into one dataframe, and aggregate dataset by date and hour to get the ridership from JFK to each taxi zone each hour. The expected output has columns: date, hour, drop-off location 1, drop-off location 2, etc.
- JFK taxi zone id is 132.
- time column should be the pickup time, and ridership is passenger count.
- Try read_csv("*.csv") to read all csv file in a folder
- files in 2017 and 2018 have different columns; apply argument usecols to select desired columns.
- using .compute() function to convert processed dask dataframe to pandas dataframe for further modeling.
2.1 Data loading
2.2 Sanity check 时间序列代考
Then, we need to do some basic sanity checks. It is possible that in a particular hour, there were no yellow taxis dispatched from JFK. Check if each day has 24-hour records and add missing records to the dataframe. The final output should have 17520 rows ( )
3. Time-series exploratory analysis
Apply exploratory analysis over the daily aggregated dataset at first.
3.1 aggregate the ridership from each dropoff location and sum it to get daily records. (3pts) 时间序列代考
3.2 Period detection and report the strongest period length on the 2017 data. (3pts)
Hint: using periodogram or acf plot.
3.3 Trend, seasonality, noise decomposition (using additive model) on 2017 data, period is from question 3.2 (3 pts)
4. Predict the total daily ridership from JFK using ARIMA. 时间序列代考
ARIMA is a common method to predict taxi ridership. Before we predict taxi zone level hourly ridership, let's try to predict the aggregated daily ridership using ARIMA.
4.1 Using adfuller test to test the stability of the aggregated dataset. If not stable, apply differencing method until the p-value from adfuller test is smaller than 0.05. (3pts)
4.2 Find out proper AR and MA terms in an ARIMA model using pacf and acf plots. (4 pts, 2 for each plot)
Hint: positive autocorrelation is usually best treated by adding an AR term to the model and negative autocorrelation is usually best treated by adding an MA term. In general, differencing reduces positive autocorrelation and may even cause a switch from positive to negative autocorrelation.
Identifying the numbers of AR and MA terms: 时间序列代考
- if the pacf plot shows a sharp cutoff and/or the lag-1 autocorrelation is positive then consider adding one or more AR terms to the model. The lag beyond which the PACF cuts off is the indicated number of AR terms.
- if the acf plot displays a sharp cutoff and/or the lag-1 autocorrelation is negative then consider adding an MA term to the model. The lag beyond which the ACF cuts off is the indicated number of MA terms.
- It is generally advisable to stick to models in which at least one of AR and MA term is no larger than 1, i.e., do not try to fit a model such as ARIMA(2,1,2).
4.3 build an ARIMA model using terms from 4.2, training on the first 700 days, forecast on the last 31 days. Print ARIMA model results and plot in-sample and out-of-sample prediction in different colors. (8 pts, 3 for correct terms, 3 for training and summary, 2pts for the plot)
Taxi zone level prediction 时间序列代考
This project aims to predict hourly yellow taxi ridership volume from JFK to each taxi zone. The ARIMA experiment in section 3 forecasts the total ridership amount from JFK. However, based on the reported , this model is not a good fit. ARIMA model has four main shortcomings: 1) they rely heavily on stationarity assumption which does not hold in real-world traffic systems 2) they do not consider spatial and structural dependencies that traffic networks exhibit and forecast each sensor as an individual time series 3) they are unable to model non-linear temporal dynamics 4) they suffer from the curse of dimensionality. Due to the limitation of ARIMA, we need to apply another method to predict taxi zone level ridership. 时间序列代考
5. Feature engineering
Our workflow is first standardizing the dataset, then using PCA to compress the dataset. As we predict future ridership, PCA should be learned from historical data (2017) then apply to the following year (2018). Next, add lag features (PCA components) from the past 12 hours and apply a Random Forest regressor to predict each PCA component's values in the next hour. After we had the PCA component prediction, inverse PCA, and inverse standardization to retrieve taxi ridership prediction in its original scale and dimension, in other words, we are predicting the PCA components instead of taxi zone level ridership and then using the inverse PCA method to reconstruct 时间序列代考
5.1. standardization. (3 pts)
The standardscaler stores information of this standardization process, including the mean and standard deviation values required when converting the prediction back to the raw scale. Split the whole dataset into two parts: 2017 and 2018, standardize each separately.
5.2. PCA 时间序列代考
5.2.1 train PCA on 2017 data. Let's arbitrarily set PCA components as 5, and gamma is None, try kernel ‘linear’, ‘poly’, ‘rbf’, and ‘sigmoid’. Select the transformer which has the lowest mean squared error in data reconstruction (inverse transform). (5 pts)
5.2.2 Apply the selected transformer from 4.2.1 to the standardized 2018 dataset and report the mean squared error between the standardized data and reconstructed data. Hint: fit the PCA on 2017 data and apply it to transform 2018 data.(5pts)
5.3 Add lag (5pts)
add 12 lags of each component from 5.2.2 (compressed 2018 data only). The expected output should have 65 dimensions. In the further modeling step, we will apply the 60 lag variables to predict the 5 components. 时间序列代考
6. RandomForest modeling (23pts)
We aim at predicting compressed daily ridership (5 PCA components values) from 12-hour lag variables. Parameter tuning is required in this section, including min_samples_split, min_samples_leaf, and n_estimators. First 80% days for training, test on the rest 20%. And in the training dataset, validate the model on the bottom 20%.
Using grid_search function in sklearn instead of a for-loop when tuning parameters in a RandomForest. The train, validation, and test datasets should be split in the same way as described above. Hint: To fix train and validation in a grid search, you might need the PredefinedSplit function from sklearn.
6.1 train test split (3pts) 时间序列代考
Please keep in mind that random train test split is not applicable in this case.
6.2 parameter tuning (10pts)
Please search the best parameter set in the following range: min_samples_split: 2 to 10, min_samples_leaf: 2 to 10, and n_estimators equal to 50.
6.3 model performance measurement (10pts)
Prediction results are PCA components instead of taxi zone level ridership. To reconstruct the data back to its original size and scale, we need to inverse PCA and inverse standardization. Report the taxi zone level value. 时间序列代考