In a similar vein as the previous post that dealt with analysis of variance (ANOVA) let’s shift our focus on a another problem solving approach, that is, linear regression. In principle, this model gives special prominence on the ability to predict an outcome based on a given set of variables. In other instances it attempts to discern whether there are relations or dependencies among certain facts and whether one influences the other. Exemplarily, you are tasked to inspect whether the number of sales calls done by a sales representative during a certain period of time has a bearing on the number of copiers sold. You might suspect that more sales calls result in more copiers sold. In the realm of regression, statisticians have a certain speak if referring to the components making up regression analysis. The variable triggering or influencing all other values is called *independent variable* while the resulting variables are referred to as *dependent variables*. In our example above the phone calls would be labelled as independent variable that allegedly drive the number of copiers sold, hence being the dependent variable. We would like to take up this example and calculate it using the R statistics package capabilities.

## The Sales Representative Example

Let’s turn our attention to the aforementioned example taken from Lind et al. (2015). Are number of sales calls related to number of copier sold? Refer to the listing on the right.

The sample comprises 15 sales reps and their respective sales calls and the number of copiers sold. From the top it does look like as if there is a positive relation between calls and sells but how strong is this trend actually? To illustrate this relationship graphically we usually resort to a scatter diagram where we plot the number of sales calls (independent variable) on the x-axis and the resultant copier sells (dependent variable) on the y-axis.

Apparently, there is a positive, that is, upward relationship between calls and sells, however not as strong as we might have assumed. As the regression line represents the least squares of each data point, that is, the closest line of all plotted points there some major outliers, especially in the area of 80 < x < 100.

## Scripting with R

Once you have access to your R environment you simply need the data in as a csv file. You may download the data file survey.csv as well as the full working R skript on this share.

At first you need to import the csv file using this command.

sales_rep = read.csv(file.choose()) # this opens up a file import window

Next you can display imported components from various angles.

dim(sales_rep) # retrieves/sets the R object dimension str(sales_rep) # compact display of R object structure head(sales_rep) # returns the first or last part of an object

After executing the *head(survey)* command you should get following output. Surely, it represents simply an excerpt from the sample.

sales_rep sales_call copiers_sold 1 Brian Virost 96 41 2 Carlos Ramirez 40 41 3 Carol Saia 104 51 4 Greg Fish 128 60 5 Jeff Hall 164 61 6 Mark Reynolds 76 29

Using the next command you tell R to attach the database referred to as *sales_rep* so it can use it as a so-called data.frame or list and leverage its variables.

attach(survey)

The following command produces above scatter diagram.

plot(sales_call, copiers_sold, pch = 16, cex = 1.3, col = "blue", main = "Copier Sales based on Sales Calls", xlab = "# sales calls", ylab = "# copiers sold")

Now add a regression line to the graph.

abline(lm(copiers_sold ~ sales_call))

Next execute the regression calculation out of the function of copiers_sold and sales_calls.

regression <- lm(copiers_sold ~ sales_call)

Display the results.

summary(regression)

It produces all relevant regression number.

Call: lm(formula = copiers_sold ~ sales_call) Residuals: Min 1Q Median 3Q Max -11.873 -2.861 0.255 3.511 10.595 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 19.9800 4.3897 4.552 0.000544 *** sales_call 0.2606 0.0420 6.205 3.19e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 6.72 on 13 degrees of freedom Multiple R-squared: 0.7476, Adjusted R-squared: 0.7282 F-statistic: 38.5 on 1 and 13 DF, p-value: 3.193e-05

## References

- YouTube video on linear regression by statisticfun 2014
- Lind, D.A., Marchal, W.G., and Wathen, S.A. (2015). Statistical Techniques in Business and Economics (New York, NY: McGraw-Hill Education).