## Computations and R commands | MyTopTutor

Computations and R commands

Problem set 2

problemset2.d

For all questions please explain, step by step, all your computations and R commands

1 (25 points) Linear Regression Explanation

First, generate 1,000 data points from a normal distribution with mean 0 and standard

deviation 1 by typing var1 <- rnorm(1000, 0, 1). Generate a second variable in the same way

(call it var2). Run a linear regression of var2~var1. Then run summary() to see the

results

Interpret and explain the regression results.

Start with

set.seed(100)

var1 <- rnorm(1000, 0, 1)

var2 <- rnorm(1000, 0, 1)

2 (25 points) Data Transformation

The file “rollingsales_manhattan.csv” is New York City (Manhattan) housing sales data.

Analyze sales using regression with two predictors “ZIP.CODE” and “YEAR.BUILT”. Hint:

SALE.PRICE is the target variable of interest and should hence be on the left-hand side of the

regressions. Since the sales data is very skewed (you can verify this by plotting the data as a

histogram), you should perform a log() transformation of the sales variable first. Interpret your

findings.

3. (25 points) SVM

We will use the spam dataset which comes with the {kernlab} package. First, we will split the

spam data randomly into two halves: one half we will use as the training data, the other half we

will use as the test data. The target variable is “type” which is a binary class spam and nospam.

You can look at the help page for the dataset to find out what the different columns mean

(hint:?spam).

1. Fit a support vector classifier using svm() on the training data. type is the target and all other

variables can be used as predictors (hint: you can use the . notation which automatically

includes all columns of the data.frame as predictors except the target variable).

2. Predict spam/nonspam classes for the data in the test dataset. How does the predicted

classification compare with the true classes? What is the classification error?

3. Can you improve the classification accuracy? (Hint: Start by exploring different settings for the

cost attribute and using different predictors.)Use the following code fragment to get you started. You may have to

install.packages(“kernlab”)

# install.packages(“kernlab”)

library(e1071)

library(kernlab)

data(spam)

set.seed(02115)

sample <- sample( c(TRUE, FALSE), nrow(spam), replace=TRUE)

train <- spam[sample,]

test <- spam[!sample,]

4. (25 points) Least Squares Method

The calculation of “Sum of Squared Errors” is illustrated as following.

For the columns x and y, if we use an estimate equation ̂ = 1.5698 + .0407, we can calculate the “Sum of

Squared Errors” ∑( − ̂)2 = 0.0517 + 0.0029 + 0.0152 + 0.0433 = 0.1131

The estimate equation ̂ = 1.5698 + .0407 is not the best linear estimate equation. We would like to find

the best estimate linear equation ̂ = a + b , so that the “Sum of Squared Errors” is the minimum among

all choices of a and b. There are many algorithms to find the best choice for a and b. Here we use “brute

force” to find the best choice for a and b.

Write a R code to find the best a and b using “brute force”, where a takes values in seq(3, 5, 0.1), b

takes values in seq(-0.5, 0.5, 0.01). Use the following code to get you started.

df<-data.frame(x=c(61,63,67,69),y=c(4.28,4.08,4.42,4.17))

a_range<-seq(3,5,0.1)

b_range<-seq(-0.5,0.5,0.01)

X y

̂ = 1.5698 + .0407

− ̂ ( − ̂)2

61 4.28 1.5698+0.0407*61=4.0526 4.28-4.0526=0.2274 (0.2274)*(0.2274)=0.0517

63 4.08 1.5698+0.0407*63=4.134 4.08-4.134=-0.054 (-0.054)*(-0.054)=0.0029

67 4.42 1.5698+0.0407*67=4.2968 4.42-4.2968=0.1232 (0.1232)*(0.1232)=0.0152

69 4.17 1.5698+0.0407*69=4.3782 4.17-4.3782=-0.2082 (-0.2082)*(-0.2082)=0.0433

0.0517+0.0029+0.0152+0.0433=0.1131

**Are you overwhelmed by your class schedule and need help completing this assignment? You deserve the best professional and plagiarism-free writing services. Allow us to take the weight off your shoulders by clicking this button.**

Get help