Is when there is a predicted relationship between two variables observed?
In this lesson, we will examine the relationship between measurement variables; how to picture them in scatterplots and understand what those pictures are telling us. The overall goal is to examine whether or not there is a relationship (association) between the variables plotted. In Lesson 6, we will discuss the relationship between different categorical variables. Show
G131%18%64%3%12%28%21%9%6%80%37%91%2%52%47%P1P2P3G2G3G4G5RELATIONSHIPS BETWEENVARIABLESTWO CATEGORICALVARIABLESTWO MEASUREMENTVARIABLESONE MEASUREMENT ANDONE CATEGORICALVARIABLEBar graph ofpercentsby groupsBar graph ofproportionsby groupsSide-by-side BoxplotScatterplotTable Figure 5.1 Variable Types and Related Graphs ObjectivesAfter successfully completing this lesson you should be able to:
In a previous lesson, we learned about possible graphs to display measurement data. These graphs included: dotplots, stemplots, histograms, and boxplots view the distribution of one or more samples of a single measurement variable and scatterplots to study two at a time (see section 4.3). Example 5.1 Graph of Two Measurement VariablesThe following two questions were asked on a survey of 220 STAT 100 students:
Notice we have two different measurement variables. It would be inappropriate to put these two variables on side-by-side boxplots because they do not have the same units of measurement. Comparing height to weight is like comparing apples to oranges. However, we do want to put both of these variables on one graph so that we can determine if there is an association (relationship) between them. The scatterplot of this data is found in Figure 5.2. 5010020030060708090HeightWeight Figure 5.2. Scatterplot of Weight versus Height
In Figure 5.2, we notice that as height increases, weight also tends to increase. These two variables have a positive association because as the values of one measurement variable tend to increase, the values of the other variable also increase. You should note that this holds true regardless of which variable is placed on the horizontal axis and which variable is placed on the vertical axis. Example 5.2 Graph of Two Measurement VariablesThe following two questions were asked on a survey of ten PSU students who live off-campus in unfurnished one-bedroom apartments.
The scatterplot of this data is found in Figure 5.3. 0.00.10.20.30.40.50.60.70.80.9600700800900DistanceRent Figure 5.3. Scatterplot of Monthly Rent versus Distance from campus In Figure 5.3, we notice that the further an unfurnished one-bedroom apartment is away from campus, the less it costs to rent. We say that two variables have a negative association when the values of one measurement variable tend to decrease as the values of the other variable increase. Example 5.3 Graph of Two Measurement VariablesThe following two questions were asked on a survey of 220 Stat 100 students:
The scatterplot of this data is found in Figure 5.4. 010200102030405060Exercise HoursStudy Hours Figure 5.4. Scatterplot of Study Hours versus Exercise Hours In Figure 5.4, we notice that as the number of hours spent exercising each week increases there is really no pattern to the behavior of hours spent studying including visible increases or decreases in values. Consequently, we say that that there is essentially no association between the two variables. This lesson expands on the statistical methods for examining the relationship between two different measurement variables. Remember that overall statistical methods are one of two types: descriptive methods (that describe attributes of a data set) and inferential methods (that try to draw conclusions about a population based on sample data). CorrelationMany relationships between two measurement variables tend to fall close to a straight line. In other words, the two variables exhibit a linear relationship. The graphs in and show approximately linear relationships between the two variables. It is also helpful to have a single number that will measure the strength of the linear relationship between the two variables. This number is the correlation. The correlation is a single number that indicates how close the values fall to a straight line. In other words, the correlation quantifies both the strength and direction of the linear relationship between the two measurement variables. Table 5.1 shows the correlations for data used in to . (Note: you would use software to calculate a correlation.) Table 5.1. . Correlations for Examples 5.1-5.3ExampleVariablesCorrelation ( r )Example 5.1Height and Weight\(r = .541\)Example 5.2Distance and Monthly Rent\(r = -.903\)Example 5.3Study Hours and Exercise Hours\(r = .109\)
Watch the movie below to get a feel for how the correlation relates to the strength of the linear association in a scatterplot.
Features of correlationBelow are some features about the correlation.
As you compare the scatterplots of the data from the three examples with their actual correlations, you should notice that findings are consistent for each example.
Statistical SignificanceA statistically significant relationship is one that is large enough to be unlikely to have occurred in the sample if there's no relationship in the population. The issue of whether a result is unlikely to happen by chance is an important one in establishing cause-and-effect relationships from experimental data. If an experiment is well planned, randomization makes the various treatment groups similar to each other at the beginning of the experiment except for the luck of the draw that determines who gets into which group. Then, if subjects are treated the same during the experiment (e.g. via double blinding), there can be two possible explanations for differences seen: 1) the treatment(s) had an effect or 2) differences are due to the luck of the draw. Thus, showing that random chance is a poor explanation for a relationship seen in the sample provides important evidence that the treatment had an effect. The issue of statistical significance is also applied to observational studies - but in that case, there are many possible explanations for seeing an observed relationship, so a finding of significance cannot help in establishing a cause-and-effect relationship. For example, an explanatory variable may be associated with the response because:
Remember the key lesson: correlation demonstrates association - but the association is not the same as causation, even with a finding of significance. There are three key caveats that must be recognized with regard to correlation.
Correlation and CausationIt is often tempting to suggest that, when the correlation is statistically significant, the change in one variable causes the change in the other variable. However, outside of randomized experiments, there are numerous other possible reasons that might underlie the correlation. Thus, it is crucial to evaluate and eliminate the key alternative (non-causal) relationships outlined in section 6.2 to build evidence toward causation.
Example 5.4: Effect of Outliers on CorrelationBelow is a scatterplot of the relationship between the Infant Mortality Rate and the Percent of Juveniles Not Enrolled in School for each of the 50 states plus the District of Columbia. The correlation is 0.73, but looking at the plot one can see that for the 50 states alone the relationship is not nearly as strong as a 0.73 correlation would suggest. Here, the District of Columbia (identified by the X) is a clear outlier in the scatter plot being several standard deviations higher than the other values for both the explanatory (x) variable and the response (y) variable. Without Washington D.C. in the data, the correlation drops to about 0.5. 93691212151821Infant mortality rate per 1000 births% of juveniles not in school Figure 5.5. Scatterplot with outlier Correlation and OutliersCorrelations measure linear association - the degree to which relative standing on the x list of numbers (as measured by standard scores) are associated with the relative standing on the y list. Since means and standard deviations, and hence standard scores, are very sensitive to outliers, the correlation will be as well. In general, the correlation will either increase or decrease, based on where the outlier is relative to the other points remaining in the data set. An outlier in the upper right or lower left of a scatterplot will tend to increase the correlation while outliers in the upper left or lower right will tend to decrease a correlation. Watch the two videos below. They are similar to the video in section 5.2 except that a single point (shown in red) in one corner of the plot is staying fixed while the relationship amongst the other points is changing. Compare each with the movie in section 5.2 and see how much that single point changes the overall correlation as the remaining points have different linear relationships.
Even though outliers may exist, you should not just quickly remove these observations from the data set in order to change the value of the correlation. As with outliers in a histogram, these data points may be telling you something very valuable about the relationship between the two variables. For example, in a scatterplot of in-town gas mileage versus highway gas mileage for all 2015 model year cars, you will find that hybrid cars are all outliers in the plot (unlike gas-only cars, a hybrid will generally get better mileage in-town that on the highway). Regression is a descriptive method used with two different measurement variables to find the best straight line (equation) to fit the data points on the scatterplot. A key feature of the regression equation is that it can be used to make predictions. In order to carry out a regression analysis, the variables need to be designated as either the: Explanatory or Predictor Variable = x (on horizontal axis) Response or Outcome Variable = y (vertical axis) The explanatory variable can be used to predict (estimate) a typical value for the response variable. (Note: It is not necessary to indicate which variable is the explanatory variable and which variable is the response with correlation.) Review: Equation of a LineLet's review the basics of the equation of a line: \(y = a + bx\) where: a = y-intercept (the value of y when x = 0) b = slope of the line. The slope is the change in the variable (y) as the other variable (x) increases by one unit. When b is positive there is a positive association, when b is negative there is a negative association. ayxEquation of the line is:y = a + bxChange in y1 unit of increase in x
Example 5.5: Example of Regression EquationConsider the following two variables for a sample of ten Stat 100 students. x = quiz score Figure 5.6 displays the scatterplot of this data whose correlation is 0.883. 55758565955060708090100QuizExam Figure 5.6. Scatterplot of Quiz versus exam scores We would like to be able to predict the exam score based on the quiz score for students who come from this same population. To make that prediction we notice that the points generally fall in a linear pattern so we can use the equation of a line that will allow us to put in a specific value for x (quiz) and determine the best estimate of the corresponding y (exam). The line represents our best guess at the average value of y for a given x value and the best line would be one that has the least variability of the points around it (i.e. we want the points to come as close to the line as possible). Remembering that the standard deviation measures the deviations of the numbers on a list about their average, we find the line that has the smallest standard deviation for the distance from the points to the line. That line is called the regression line or the least squares line. Least squares essentially find the line that will be the closest to all the data points than any other possible line. Figure 5.7 displays the least squares regression for the data in Example 5.5. 55758565955060708090100QuizExam Figure 5.7. Least Squares Regression Equation As you look at the plot of the regression line in Figure 5.7, you find that some of the points lie above the line while other points lie below the line. In fact the total distance for the points above the line is exactly equal to the total distance from the line to the points that fall below it. The least squares regression equation used to plot the equation in Figure 5.7 is: \begin{align} &y = 1.15 + 1.05 x \text{ or} \\ &\text{predicted exam score = 1.15 + 1.05 Quiz}\end{align} Interpretation of Y-InterceptY-Intercept = 1.15 points Y-Intercept Interpretation: If a student has a quiz score of 0 points, one would expect that he or she would score 1.15 points on the exam. However, this y-intercept does not offer any logical interpretation in the context of this problem, because x = 0 is not in the sample. If you look at the graph, you will find the lowest quiz score is 56 points. So, while the y-intercept is a necessary part of the regression equation, by itself it provides no meaningful information about student performance on an exam when the quiz score is 0. Interpretation of SlopeSlope = 1.05 = 1.05/1 = (change in exam score)/(1 unit change in quiz score) Slope Interpretation: For every increase in quiz score by 1 point, you can expect that a student will score 1.05 additional points on the exam. In this example, the slope is a positive number, which is not surprising because the correlation is also positive. A positive correlation always leads to a positive slope and a negative correlation always leads to a negative slope. Remember that we can also use this equation for prediction. So consider the following question: If a student has a quiz score of 85 points, what score would we expect the student to make on the exam? We can use the regression equation to predict the exam score for the student. Exam = 1.15 + 1.05 Quiz Figure 5.8 verifies that when a quiz score is 85 points, the predicted exam score is about 90 points. 55758565955060708090100QuizExam Figure 5.8. Prediction of Exam Score at a Quiz Score of 85 Points Example 5.6Let's return now to the experiment to see the relationship between the number of beers you drink and your blood alcohol content (BAC) a half-hour later (scatterplot shown in ). Figure 5.9 below shows the scatterplot with the regression line included. The line is given by predicted Blood Alcohol Content = -0.0127 +0.0180(# of beers) 20.040.080.120.16468Number of beers consumedBlood-alcohol content (BAC) Figure 5.9. Regression line relating # of beers consumed and blood alcohol content Notice that four different students taking part in this experiment drank exactly 5 beers. For that group we would expect their average blood alcohol content to come out around -0.0127 + 0.0180(5) = 0.077. The line works really well for this group as 0.077 falls extremely close to the average for those four participants.
Think About It!Select the answer you think is correct - then click the right arrow to proceed to the next question. Question 1 Which of the following is not appropriate for studying the relationship (association) between two measurement variables?
Question 2 Which of the following is the range of possible values that a correlation can assume?
Question 3 The regression line for a set of points is given by \(y = -10 + 6 x\). What is the slope of the line?
Question 4 Describe the association found in the graph above.
Question 5 Suppose a correlation -.65 is obtained from two measurement variables. Which of the following might represent the slope that would be found if a regression equation were calculated?
Question 6 If you make a scatterplot of the miles of highways versus the number of infant deaths for the fifty states you will find a moderate positive correlation. An obvious problem with viewing this as important evidence that building highways causes infant deaths is that:
Question 7 The correlation between the heights and weights of the people in a small room was 0.6 until former basketball star Shaquille O’Neal (7 ft 1 in and 335 lb) entered the room. At that point the correlation:
|