Is Microsoft
Excel an Adequate Statistics Package?
It depends on what you want to do, but for many tasks, the
answer is ‘No’.
Excel is available to many people as part of Microsoft
Office. It contains some statistical functions in its basic
installation. It also comes with statistical routines in
the Data Analysis Toolpak, an add-in found separately on
the Office CD. You must install the Toolpak from the CD in
order to get these routines on the Tools menu. Once
installed, these routines are at the bottom of the Tools
menu, in the "Data Analysis" command. People use Excel as
their everyday statistics software because they have
already purchased it. Excel’s limitations, and occasionally
its errors, make this a problem. Below are some of the
concerns with using Excel for statistics that are recorded
in journals, on the web, and from personal experience.
Limitations
of Excel
1. Many statistical methods
are not available in Excel.
Excel's biggest problem. Commonly-used statistics and
methods NOT available within Excel include:
* Boxplots
* p-values for the correlation coefficient
* Spearman’s and Kendall’s rank correlation coefficients
* 2-way ANOVA with unequal sample sizes (unbalanced data)
* Multiple comparison tests (post-hoc tests following
ANOVA)
* p-values for two-way ANOVA
* Levene’s test for equal variance
* Nonparametric tests, including rank-sum and
Kruskal-Wallis
* Probability plots
* Scatterplot arrays or brushing
* Principal components or other multivariate methods
* GLM (generalized linear models)
* Survival analysis methods
* Regression diagnostics, such as Mallow’s Cp and PRESS (
it does compute adjusted r-squared)
* Durbin-Watson test for serial correlation
* LOESS smooths
Excel's lack of functionality makes it difficult to use for
more than computing summary statistics and simple
univariate regression. Third-party add-ins to Excel attempt
to compensate for these limitations, adding new
functionality to the program (see "A Partial Solution",
below).
2. Several Excel procedures
are misleading.
Probability plots are a standard way of judging the
adequacy of the normality assumption in regression. In
statistics packages, residuals from the regression are
easily, or in some cases automatically, plotted on a normal
probability plot. Excel’s regression routine provides a
Normal Probability Plot option. However, it produces a
probability plot of the Y variable, not of the residuals,
as would be expected.
Excel’s CONFIDENCE function computes z intervals using 1.96
for a 95% interval. This is valid only if the population
variance is known, which is never true for experimental
data. Confidence intervals computed using this function on
sample data will be too small. A t-interval should be used
instead.
Excel is inconsistent in the type of P-values it returns.
For most functions of probabilities, Excel acts like a
lookup table in a textbook, and returns one-sided p-values.
But in the TINV function, Excel returns a 2-sided p-value.
Look carefully at the documentation of any Excel function
you use, to be certain you are getting what you want.
Tables of standard distributions such as the normal and t
distributions return p-values for tests, or are used to
confidence intervals. With Excel, the user must be careful
about what is being returned. To compute a 95% t confidence
interval around the mean, for example, the standard method
is to look up the t-statistic in a textbook by entering the
table at a value of alpha/2, or 0.025. This t-statistic is
multiplied by the standard error to produce the length of
the t-interval on each side of the mean. Half of the error
(alpha/2) falls on each side of the mean. In Excel the TINV
function is entered using the value of alpha, not alpha/2,
to return the same number.
For a one-sided t interval at alpha=0.05, standard practice
would be to look up the t-statistic in a textbook for
alpha=0.05. In Excel, the TINV function must be called
using a value of 2*alpha, or 0.10, to get the value for
alpha=0.05. This nonstandard entry point has led several
reviewers to state that Excel’s distribution functions are
incorrect. If not incorrect, they are certainly
nonstandard. Make sure you read the help menu descriptions
carefully to know what each function produces.
3. Distributions are not
computed with precision.
NEW In reference (1), the authors show that all problems
found in Excel 97 are still there in Excel 2000 and XP.
They say that "Microsoft attempted to fix errors in the
standard normal random number generator and the inverse
normal function, and in the former case actually made the
problem worse." From this, you can assume that the problems
listed below are still there in the current versions of the
software.
Statistical distributions used by Excel do not agree with
better algorithms for those distributions at the third
digit and beyond. So they are approximately correct, but
not as exact as would be desired by an exacting
statistician. This may not be harmful for hypothesis tests
unless the third digit is of concern (a p-value of 0.056
versus 0.057). It is of most concern when constructing
intervals (multiplying a std dev of 35 times 1.96 give
68.6; times 1.97 gives 69.0) As summarized in reference 2:
"…the statistical distributions of Excel already have
been assessed by Knusel (1998), to which we refer the
interested reader. He found numerous defects in the various
algorithms used to compute several distributions, including
the Normal, Chi-square, F and t, and summarized his results
concisely: So one has to warn statisticians against using
Excel functions for scientific purposes. The performance of
Excel in this area can be judged unsatisfactory."
4. Routines for handling
missing data were incorrect.
This was the largest error in Excel, but a 'band-aid' has
been added in Office 2000. In earlier versions of Excel,
computations and tests were flat out wrong when some of the
data cells contained missing values, even for simple
summary statistics. See (3) , (5), and page 4 of (6). Error
messages are now displayed in Excel 2000 when there are
missing values, and no result is given. Although this is
still inferior to computing correct results it is somewhat
of an improvement.
In reference to pre-2000,
"Excel does not calculate the paired t-test correctly
when some observations have one of the measurements but not
the other." E. Goldwater, ref. (5)
5. Regression routines are
incorrect for multicollinear data.
This affects multiple regression. A good statistics package
will report errors due to correlations among the X
variables. The Variance Inflation Factor (VIF) is one
measure of collinearity. Excel does not compute
collinearity measures, does not warn the user when
collinearity is present, and reports parameter estimates
that may be nonsensical. See (6) for an example on data
from an experiment. Are multicollinear data of concern in
‘practical’ problems? I think so -- I find many examples of
collinearity in environmental data sets.
Excel also requires the X variables to be in contiguous
columns in order to input them to the procedure. This can
be done with cut and paste, but is certainly annoying if
many multiple regression models are to be built.
6. Ranks of tied data are
computed incorrectly.
When ranking data, standard practice is to assign tied
ranks to tied observations. The value of these ranks should
equal the median of the ranks that the observations would
have had, if they had not been tied. For example, three
observations tied at a value of 14 would have had the ranks
of 7, 8 and 9 had they not been tied. Each of the three
values should be assigned the rank of 8, the median of 7, 8
and 9.
Excel assigns the lowest of the three ranks to all three
observations, giving each a rank of 7. This would result in
problems if Excel computed rank-based tests. Perhaps it is
fortunate none are available.
7. Many of Excel's charts
violate standards of good graphics.
Use of perspective and glitz (donut charts?) violate basic
principles of graphics. Excel's charts are more suitable to
USA Today than to scientific reports. This bothers some
people more than others.
"Good graphs should….[a list of traits]…However, Excel
meets virtually none of these criteria. The vast majority
of chart types produced by Excel should never be
used!" -- Jon Cryer, ref (3).
"Microsoft Excel is an example of a package that does
not allow enough user control to consistently make readable
and concise graphs from tables."
- A. Gelman et al., 2002, The American Statistician 56,
p.123.
A partial solution:
Some of these difficulties (parts of 1,2, 6 and 7) can be
overcome by using a good set of add-in routines. One of the
best is StatPlus, which comes with an excellent textbook,
"Data Analysis with Microsoft Excel". With StatPlus, Excel
becomes an adequate statistical tool., though still not in
the areas of multiple regression and ANOVA for more than
one factor. Without this add-in Excel is inadequate for
anything beyond basic summary statistics and simple
regression.
Data Analysis with Microsoft Excel by Berk and Carey
published by Duxbury (2000).
Opinion: Get this book
if you're going to use Excel for statistics.
(I have no connection with the authors of StatPlus and get
no benefit from this recommendation. I'm just a satisfied
user.)
Some advice from
others:
"If you need to perform analysis of variance, avoid
using Excel, unless you are dealing with extremely simple
problems."
- Statistical Services Centre, Univ. of Reading, U.K. (at
A, below)
"Enterprises should advise their scientists and
professional statisticians not to use Microsoft Excel for
substantive statistical analysis. Instead, enterprises
should look for professional statistical analysis software
certified to pass the (NIST) Statistical Reference Datasets
tests to their users' required level of accuracy."
- The
Gartner Group
References:
1) On the accuracy of statistical procedures in Microsoft
Excel 2000 and Excel XP
B.D. McCullough and B. Wilson, (2002), Computational Statistics & Data
Analysis, 40, pp 713 - 721
(2) On the accuracy of statistical procedures in Microsoft
Excel ‘97
B.D. McCullough and B. Wilson, (1999), Computational Statistics & Data
Analysis, 31, pp 27-37
(3) Problems with using Microsoft Excel for statistics [pdf
Download]
J.D. Cryer, (2001), presented at the Joint Statistical
Meetings, American Statistical Association, 2001, Atlanta
Georgia
[pdf download]
(4) Use of Excel for statistical
analysis
Neil Cox, (2000), AgResearch Ruakura
(5) Using Excel for statistical data analysis
Eva Goldwater, (1999), Univ. of Massachusetts Office of
Information Technology
[pdf download]
(6) Statistical analysis using Microsoft Excel [pdf
download]
Jeffrey Simonoff, (2002)
[pdf download]
(7) Spreadsheet addiction
Patrick Burns
(8) On the Accuracy of Statistical Distributions in
Microsoft Excel 97
Leo Knuesel
[pdf download]
(9) Statistical flaws in Excel
Hans Pottel
[pdf download]
Guides to Excel on the
web:
(A) A Beginner's Guide to Excel - Univ.
of Reading, UK
(B) An Intermediate Guide to Excel -
Univ. of Reading, UK
Note: All opinions other than those cited as coming from
others are my own.