Statistics
101: Everyone needs to grapple with stats
By
Peter Schwartzman
Statistics
are everywhere. And their power of influence is immense. Unfortunately, it is
very easy to get confused by them. But given how important and powerful they
are, we, acting as informed citizens, must work extra hard to comprehend them
and understand how they can be used. In what follows, I plan to introduce and
tackle a few of the more common and vexing statistical concepts. I will also attempt
to illustrate how they might be used in an environmental context as well. So,
sit back, put your feet up, and letÕs get Òstatted.Ó
Mean
versus Median. Almost any
scientifically-related report uses the words ÒaverageÓ or Òmean.Ó These mean
the same thing; pardon the pun. The average of a variable (say, temperature) is
just the sum of all its values divided by the number of such values. So, for
instance, letÕs say the daily temperatures for a period of one week are (in
Fahrenheit): 65, 70, 60, 65, 70, 50, & 60. The average (or mean)
temperature for the week is 62.9¼F. Now, the median temperature for the week is
65¼F which is the ÒmiddleÓ value of the dataset (when they are put in numerical
order). Thus, here, the mean and median are not the same. Generally, it is
common to have means and medians that are different from one another. And this
difference can lead people to focus on one value, rather than the other,
depending on their underlying goals. In the case presented, if someone were
trying to demonstrate Òglobal warmingÓ they might talk about the median, or if
they were trying to suggest Òglobal coolingÓ they might reference the mean/average
(since it is ÒcolderÓ).
But readers shouldnÕt think that the choice between
them is usually done for manipulative purposes. The median, which is less
commonly used, may be a more useful statistic than the mean in some instances.
For example, if one wants to look at the number of children that people are
having, it might make more sense to know what the median is rather than the
average. This is because the median will be an integer—1,2,3,
etc.—whereas the average is likely to be a number with a decimal
component. And since people arenÕt having half-children, it may make more sense
to look at the median value.
Correlation
versus Causation. When we want to
know the strength of a relationship between two variables, we can use a simple
(relatively speaking, of course) calculation to assess this; we calculate a
correlation (technically, a correlation coefficient). A ÒstrongÓ relationship (represented
by a large correlation) indicates that the two variables tend to ÒfollowÓ each
other—that is, when one changes the other changes as well (typically in a
linear fashion). If the variables behave similarly (that is, increase together or decrease together), we say that they are positively
correlated and if they behave ÒinverselyÓ (that is, one increases while the
other decreases), we say that they are negatively correlated.
So for instance, we might think that warm days occur
when it is sunny. Therefore, we would expect a calculation of the correlation
between the daily maximum temperature (written, T-max) and amount of daily
sunshine to be high. However, it turns out that these two variables are positively
correlated in summertime, when the sunny days tend to be the warmest, but are
often actually somewhat negatively correlated on winter days when some of the
coldest days are actually quite sunny. Thus, an examination of the relationship
between the amount of sunshine and T-max would only show a high level of
correlation if we looked at only one season at a time and wouldnÕt show much
correlation at all if we looked at all days of the year.
In the
field of environmental studies, researchers are often asking if two variables
are related. And often the first step taken is to calculate the level of
correlation between them. But just because two variables are correlated doesnÕt
mean they are causally related (meaning, a change of one variable causes another variable to change). For example, many people
think that poor people have more children than wealthier ones. On a global
scale, there definitely does appear to be a relationship (i.e., correlation)
between the relative wealth of a country and the average family size found
among its people. However, it would be misleading (and, perhaps, flatly wrong)
to say that, based on a confirmed high correlation between these two variables,
poverty causes families to be large (or that large families lead to conditions of
poverty). While this conclusion may be true, it isnÕt determined by a
correlation calculation alone. This is one way correlation can be misused.
Since causation (Òwhat causes somethingÓ) cannot be
easily ascertained in many cases, especially environmental ones (given the
extremely large number of contributing factors), it is often best to reserve
judgment on ÒcauseÓ and adopt a different framework of understanding the
problem. In order to determine the ÒcauseÓ of something, we often have to
conduct a much more complex set of analytical and statistical procedures. Even
then ÒprovingÓ causation Òbeyond the shadow of a doubtÓ is very difficult. LetÕs
look at just one example that illustrates why this might be.
LetÕs say
that Chemical J has been shown to cause cancer in mice. In fact, it has been
found to cause bladder cancer much more frequently than other forms of cancer.
Now, letÕs say that a person, named Joe is diagnosed with bladder cancer. Did
Chemical J cause JoeÕs cancer? Obviously not, right? Well, actually, we donÕt
know until we attempt to find out. First, we may want to know if Joe had any
exposure to Chemical J. LetÕs say he did when working in a factory. But just
because he was exposed doesnÕt mean that he got cancer from it. What do we ask
next? There is a long list of questions that we might ask, including: Have
other employees in his workplace gotten bladder cancer? What other chemicals
(or lifestyle choices, etc.) are associated with bladder cancer? Was Joe
exposed to these? Are there groups of chemicals that work together to enhance
each otherÕs carcinogenicity? (Though is not uncommon to find chemicals working
together—synergistically—to cause something to happen, it is very
difficult to conduct research that looks at these relationships.) Are some
people genetically predisposed to bladder cancer? Does Joe have these genetic traits?
And yet, even after we get answers to all these questions, we cannot be sure
that Joe got bladder cancer from Chemical J. This is because there are so many
confounding factors in the development of cancer and cancer often doesnÕt
express itself for ten to thirty years after the relevant exposure. But since
it is so important that we know why he got bladder cancer (as part of a legal
response or to prevent others from getting it), it is stifling to realize that
it is nearly impossible to know if chemical J caused JoeÕs cancer. So where
does this leave us?
Since so
many environmental problems (such as, climate change, cancer clusters,
respiratory disorders, species extinction, etc.) have serious impacts on us, we
have a great desire to know why these problems arise. As shown in the previous
example, our ability to determine the ÒwhyÓ (a question of causality) can be
quite complicated or near impossible. This realization can lead us in one of
two directions. Either we can, as we largely do now, put decisions through a
cost-benefit analysis which requires the assignment of a monetary value for a
human life and a weighing of a plethora of inherently uncertain statistical
findings. Or, it might lead us to live more humbly and avoid complicating the environment
excessively/needlessly (through the use of synthetic pesticides, for instance)
because we donÕt really know what the overall impacts will be. This latter
position, to which I generally subscribe, doesnÕt mean that we must live in a
ÒboxÓ and avoid all exposures to risk but rather it means that we donÕt allow
the use and emission of potentially dangerous chemicals precisely because we
are aware of the inherent uncertainty that exists in attempts to assess their
impact and establish causality.
Correlation between variables shouldnÕt be ignored,
however. Nor should attempts to establish causality be eliminated either. In
fact, both correlation and causation have their place. They can be used to
identify relationships (even potential or suspected ones) between variables.
Ultimately, however, readers of statistical results (as found in any daily
newspaper) should reserve judgment and maintain a skeptical position. Whether
one is more skeptical about reports of environmental catastrophe or claims
about the amazing resilience of ecosystems is largely a personal decision, but
it is one that we should all recognize is predicated on the certainty (or the
lack thereof) that we attribute to the power of statistical analysis and the
complexity of living systems.
Significance versus Importance. Probably one of the most commonly used words
involving the reporting of statistics in the mainstream is Òsignificant.Ó What
does it mean to say a result is significant? Does it mean that it is
ÒimportantÓ or does it mean something else?
Within the world of statistics, when the word
ÒsignificantÓ (or ÒsignificanceÓ) is used, it means that a result is Ònot by
chanceÓ and, therefore, represents a meaningful finding. Unfortunately though,
this term gets thrown around so much it is very difficult to know what it
means. For a clear demonstration of the confusion that can arise, consider the
following example. Statement 1 reads: ÒThe increase in global temperatures is
significant.Ó Statement 2 reads: ÒThere is a significant increase in global
temperatures.Ó Statement 1 speaks to the Ònot by chanceÓ character of the
observation—here, the increase in global temperatures. Statement 2, on
the other hand, says something about the importance of the result. And, no,
these are not the same thing. LetÕs understand why. If a finding is
ÒsignificantÓ (ala Statement 1) it means that it is a result (here a
relationship between two variables—time and temperature) that is not
expected to be found in random data more than a certain amount of the time
(usually, less than 5%). So if the result isnÕt random it means that there is a relationship (here, a trend in temperature) that is
real. If a result is said to be significant in extent (ala Statement 2) then it
means that we should be alert to its potential impact on something of
importance (here, rising temperatures of a certain level may melt ice sheets
and cause sea level to rise appreciably). Another example helps illustrate the
distinction further.
Global population increased quite a lot during the 20th
Century. Currently it is rising at just above 1% per year. An analysis of the
actual data from 1950-2005 reveals some interesting findings. The correlation
between population and time (over this period) is 0.998—an extremely high
value (the maximum being 1.00). (Interesting, the correlation between the
population growth rate—PGR, in %—and time is -0.68%, indicative of
the rather steady decline in PRG since the early 1960s.) One might (falsely)
conclude from this result that time is ÒcausingÓ populations to grow. And while
it certainly takes time for populations to grow, it is people procreating that
is ÒcausingÓ populations to grow. And, arguably, populations are growing
(rather than steady) because economic and political forces are motivating families
to reproduce more than replacement level (which is about 2.2 children per
family).
Once we suspect a relationship (based on a high
correlation value), we then perform a linear ÒregressionÓ analysis to determine
if the trend is statistically-significant. Simply, this means that we attempt
to determine if a there exists a change in human population size that we can
say is Ònot expected given random data.Ó Such an analysis here establishes that
a trend does indeed exist (i.e., it is Òstatistically-significantÓ). But even
if a 1.68% annual growth in population (which is the average growth over the
fifty-five year period) is statistically-significant, might it be more relevant
to know if this trend is important? In other words, will this trend create
problems for societies in the future? DoesnÕt our answer to this depend on
several other pieces of data, such as, ÒHow big is the population to begin
with?,Ó ÒAre future people going to drive Hummers or are they going to use
public transportation?,Ó and, ÒWill oil or coal be our future energy source or
will the transition to a renewable energy economy occur quickly?Ó Depending on
how we answer these secondary questions will obviously impact our evaluation of
the importance of a 1.68% growth rate. In some cases it will be very important.
In others it may be much less important.
From now on, I hope you will never be tricked into
thinking a median is a mean or significant findings are (necessarily)
important. Statistics can be misused intentionally (as well as unintentionally)
and therefore it is imperative that we all become more familiar with this
ÒforeignÓ language. Given the ubiquity of their existence in our lives, stats
compel us to avoid becoming mere statistics.
Peter Schwartzman (email: drearth1@gmail.com) is
associate professor and chair of the Environmental Studies Program at Knox
College. Father to two amazing girls, Peter hopes that their lives will be
lived on a cleaner, more just, planet. A nationally-ranked Scrabble¨ junkie, he
is also the founder and maintainer of websites dedicated to peace, empowerment,
and environmental well-being: www.onehuman.org;
www.blackthornhill.org; & www.chicagocleanpower.org.
Published
29 March 2007