Multivariate Analysis: Assignment 2
The University of New South Wales
MATH数学代写 Recall from the lecture that SAS provides standardised coefficients as a part of its output. In R, they may be obtained by
• Submission instructions will be posted shortly.
• No late assignments will be accepted without a successful application for a Special Consideration.
• For computational and applied exercises, you may use either R or SAS. Include commands used and a reasonable amount of relevant output.
• Use of computer algebra systems is permitted and encouraged, though note that one may not be available during the exams.
1. Consider identifying the neurotic state of an individual referred for psychiatric examination. Three measurements A, B, and C are made on each
individual. The mean scores for each of 3 groups are: MATH数学代写
The pooled within group covariance matrix
(a) Discriminant analysis For the following, calculate from the information provided here,
i. Assuming equal misclassification costs and equal priors for the three groups, calculate the linear discriminant scores for classifying each of the three groups.
ii. Based on the above scores, classify the following newly observed individuals：
iii. Suppose that in the population of people administered this examination, 20% are, in fact, “normal”, 40% have anxiety, and 40% have obsession. Show how this changes the linear discriminant scores and classifications of the three individuals. MATH数学代写
iv. Consider classifying individuals from the “Anxiety” and “Obsession” groups only. Determine the linear discriminant function and estimate the probabilities of misclassification P(1|2) and P(2|1).
(b) Discriminant analysis continued Load the original dataset from neurotic.csv provided. Using R or SAS:
i.–iii. Repeat the corresponding parts of Part (a).
iv. Calculate the in-sample confusion matrix for LDA (assuming equal prior probabilities).
v. Use an appropriate hypothesis test to check that the equal withingroup variance assumption required by LDA is satisfied. Report the test statistic, the p-value, and state the conclusion in the context of the problem.
(c) Support vector machine Fit and tune a support vector machine of your choice for predicting the patient group from the measurements. Report the following for the SVM fit:
i. Selected tuning parameters.
ii. In-sample confusion matrix.
iii. Out-of-sample accuracy estimated by cross-validation.
iv. Predictions for the individuals in 1(a)ii.
(d) Principal component analysis Perform a principal component analysis on the three measurements A, B, and C, ignoring grouping.
i. Report the coefficients for the components, the eigenvalues, and the cumulative variance explained.
ii. How many components are needed to explain at least 90% of the variation in the data?
iii. How many components are needed according to the Kaiser’s rule?
2. Data on n = 20 consecutive years has been collected reflecting annual average prices of beef steers X1 and of hogs X2 and the annual per capita consumption of beef X3 and of pork X4. We are interested in the relationship of livestock prices to meat production. The file price-cons.csv contains the variables Y (year index) and X1, X2, X3, X4. We could proceed by calculating U = (X1 + X2)/2, V = X3 + X4 and then regressing U on V.
(a) Canonical correlation A perhaps better procedure would be to construct a (weighted) price index U = a1X1+a2X2 and consumption index V = b3X3 + b4X4 and to look at the maximal correlation between U and V. This is the canonical correlation analysis approach.
i. Find and list both the canonical correlations and the related canonical variates (i.e., U and V ). Express the canonical variates using the raw coefficients and also by using the standardised coefficients (i.e., coefficients obtained by first standardising the variables involved). Since the prices are in dollar units but the consumption is in pounds, does it make sense to standardise here?
Hint: Recall from the lecture that SAS provides standardised coefficients as a part of its output. In R, they may be obtained by first using the scale() function to standardise the inputs and then performing canonical correlation analysis on those.
ii. Using canonical correlation analysis, formulate and test the hypothesis of independence of the price index and of the consumption index (intuition shows that it must be rejected). Report the test statistic, the p-value, and state the conclusion in the context of the problem. MATH数学代写
iii. Is one only canonical variable pair enough (i.e., is the second canonical correlation also significant)?
(b) Multivariate linear model Now, suppose that our goal is not correlation but explanation: we wish to model consumption as a function of the prices.
i. Fit a multivariate linear model with the consumption variables as responses and prices as predictors. Report the coefficients, the standard errors, and the estimated variance–covariance matrix of the residuals.
ii. Briefly (in 2–3 sentences), interpret the regression coefficients and their significance.