Lecture Notes
Chapter 5 Correlation and Regression
Modified: 2003-07-23
Repeat after me, "Correlation does not imply causation." "Correlation
does not imply causation." "Correlation does not imply causation."
"Correlation does not imply causation." "Correlation does not imply
causation." "Correlation does not imply causation." Got it? A
correlation simply is a statement of relationship between two
(bivariate) or more (multivariate) variables. Just because you
discover a relationship does not mean that one variable caused the
other. Regression is also a form of prediction. In linear regression,
a line of best fit is computed for a given set of paired scores.
Finally, scatterplots help you visualize the relationships between
bivariate and multivariate data.
- Bivariate Distributions
- So far, we have been looking at distributions involving a
single variable, univariate distributions. Now we look at cases
where two measurements are made from each subject. Of
interest is whether or not those two variables show any
relationship to each other.
- Correlation
- A correlation is a way of describing the relationship
between two or more variables. Several statistical tests
measure this type of relationship including the Pearson
Product-Moment, the Point Biserial, and the Thurstone Rank
Order. Each of those test is suited to a particular data type.
The Pearson Product-Moment is used on interval and ratio data,
the Point Biserial where one variable is continuous and the
other dichotomous, and the Thurstone Rank Order when the data
are ordinal. The important thing to remember is that just
because two (or more) variables are highly correlated does not
mean that one caused the other. That may be the case, but
correlation alone cannot prove causality. Recall that
correlations range from a value of -1.00 to +1.00, and
that the higher the absolute value of the correlation the
stronger it is. The sign of the correlation indicates the
direction, positive or negative (or inverse), of the
correlation.
- Examples of high positive
(+1.00), high negative, (-1.00) and
little correlation (-0.146)
- Scatterplots
- Traditionally, scatterplots have been used to display two
correlated variables. Computer statistics program graphing
utilities can display three correlated variables in a 3-D style
plot. However, these last are not recommended because most
students and others are not sufficiently practiced in the
interpretation of such plots. Again, a series of two-way plots
could be used to display more than two variables.
Examples of scatterplots: weight
vs. mileage of cars, U.S. Patents
1940 vs. 1950
- Regression
- When two or more variables for a set of data are known, one
variable can be used to predict the other. Such prediction is
usually accomplished via a regression line, and the process is
known as linear regression. Other functions (quadratic,
hyperbolic, curvilinear, or exponential) can be used to predict
data, but they are beyond the scope of this course. Examples of
regression: weight vs. mileage
of cars, U.S. Patents 1940 vs.
1950
- History
- Sir Francis Galton
- Genius with an interest in measurement
- Tried to measure everything from the weather to female
beauty
- Discoverer of the fingerprinting technique
- Invented correlation and regression
- Karl Pearson
- Colleague (younger) of Galton's
- Worked out formulas for correlation (Pearson r or
product-moment correlation coefficient)
- Correlation coefficient or Pearson r
- Definitional formula
r = S(zxzy)/N
- where:
- r = correlation coefficient
- zx = z score for variable X
- zy = z score for variable Y
- N = number of pairs of scores
- Computing the correlation coefficient
- see text (pp. 86-89), we will not hand calculate
correlation coefficients
- Blanched formula
- Raw scores
calculation example
- Using a statistics program to find a correlation
- In class example of using Statistica to compute a
correlation coefficient
- Scatterplots
- What Does + and - Sign Mean?
- Remember to interpret the sign of the correlation also. The
sign tells you the direction of the relationship.
- Positive sign
- Both variables are in the same direction (i.e.,
height and weight)
- Negative sign
- Both variables are in opposite direction (i.e.,
smoking and distance run)
- Effect Size for r
- What does the size of a correlation mean?
- Small relationship: r = .10
- Medium relationship: r = .30
- Large relationship: r = .50
- Uses of r
- Reliability
- one of the best and most common uses of r is to assess
the reliability of sets of measurements.
- typical examples include correlating the observations of
two or more observers, estimating the variability within a
population, and assessing whether or not developmental
processes are at work
- Sign of possible causation
- Some correlations will turn out to have a causal
relationship, you just cannot use r to prove it. So, another
common use of r is as a quick way to run a pilot study to be
followed later by experimentation.
- Coefficient of determination (r2)
- This is a very useful feature of r. The coefficient of
determination, r2, tells you what proportion of
the variance is shared by two variables. This is useful as
an estimate of how much of the variance in particular
behavior is accounted for by other variables.
- Other Issues Related to r
(new)
- Nonlinearity
- r is not the appropriate statistic to use for data that
are not linear. More complex relationships (i.e.,
curvilinear, exponential) may exist, but r will not indicate
those relationships. Other nonlinear correlations techniques
must be used for such data.
- Truncated Range
- A truncated (or shortened) range occurs when the
sample's range is less than the population's range. In such
cases, r will not indicate the existence of actual
correlations either.
- Other Correlation Coefficients
- Biserial or Point Biserial-use for dichotomous
(two-value) data
- Correlation Ratio-eta (h)
is used for curved data relationships
- Multiple Correlation-is when several variables
are combined and then correlated with another variable,
yielding better predictions
- Partial Correlation-is when the effects of one
variable are separated (or partialed out) from a correlation
of two variables
- Spearman r-is used to correlate data that come
from ranks instead of scores
- Linear Regression
- Calculating the b coefficient
- Calculate b as follows (see footnote 9 on page 103 for
another formula):
b = r(Sy/Sx)
- where:
- r = correlation coefficient for X and Y
- Sy = standard deviation of Y
- Sx = standard deviation of X
- can you see that for positive correlations, b will be
positive too?
- can you see that for negative correlations, b will be
negative too?
- Calculating the a coefficient
- Finding Y'
- use:
Y' = a + bX
- or use:
Y' = r(Sy/Sx)(X - X) + Y
- Drawing Regression Lines
- Use the mean of X and the mean of Y for your first
point
- Pick a value of X for second point (any value will work)
and compute Y using the regression equation.
- Draw a line between the two points (and extend the line in
either direction)
- Two Regression Lines
- Remember, there are two regression lines:
- Y on X (like above and the one most commonly used)
- X on Y
- Assign Y to the variable to be predicted
Back to Statistics
Home Page