Is Microsoft
Excel an Adequate Statistics Package?
It depends on what you want
to do, but for a great many tasks, the answer is ‘No’.
Excel is available to many people as part of Microsoft
Office. People often use Excel as their everyday statistics
software because they have already purchased it. Excel’s
limitations, and its errors, make this a very questionable
practice for scientific applications. For business
applications where questions might be simpler and precision
not as necessary, Excel may be just fine. Below are some of
the concerns with using Excel for statistics that are
recorded in journals, on the web, and from personal
experience. References for more detail are listed at the
bottom of this page.
Some advice from
others:
“Meanwhile, researchers should continue to avoid using
the statistical functions in Excel 2007 for any scientific
purpose.”
- Yalta (2008), ref 1 below
“...
it is not safe to assume that Microsoft Excel’s
statistical procedures give the correct answer. Persons who
wish to conduct statistical analyses should use some other
package.”
- McCullough and Heiser (2008), ref 2 below
"If you need to perform analysis of
variance, avoid using Excel, unless you are dealing with
extremely simple problems."
- Statistical Services Centre, Univ. of Reading, U.K. (at
A, below)
"Excel is of very limited use in the formal statistical
analysis of data unless your experimental design is very
simple. . . . the "Data Analysis Toolpack" provided with
Excel is no easier to use than most statistics packages,
has very limited capability, has known bugs and so, on the
whole, is not worth bothering with. "
-Neil Cox, ref 7 below
"Enterprises should advise their scientists and
professional statisticians not to use Microsoft Excel for
substantive statistical analysis. Instead, enterprises
should look for professional statistical analysis software
certified to pass the (NIST) Statistical Reference Datasets
tests to their users' required level of accuracy."
- The
Gartner Group
Limitations
of Excel
1. Many statistical methods
are not available in Excel.
Commonly-used statistics and methods NOT available within
Excel include:
* Boxplots
* p-values for the correlation coefficient
* Spearman’s and Kendall’s rank correlation coefficients
* 2-way ANOVA with unequal sample sizes (unbalanced data)
* Multiple comparison tests (post-hoc tests following
ANOVA)
* p-values for two-way ANOVA
* Levene’s test for equal variance
* Nonparametric tests, including rank-sum and
Kruskal-Wallis
* Scatterplot arrays or brushing
* Principal components or other multivariate methods
* GLM (generalized linear models)
* Survival analysis methods
* Regression diagnostics, such as Mallow’s Cp and PRESS
(it does compute adjusted r-squared)
* Durbin-Watson test for serial correlation
* LOESS smooths
Excel still requires the X variables to be in contiguous
columns in order to input them to a multiple regression.
This can be done with cut and paste, but is certainly
annoying if many multiple regression models are to be
built.
Excel's lack of functionality limits its use to only simple
statistical tasks. For alternatives, see the 3 Solutions
presented below.
2. Several Excel procedures
are misleading/wrong.
Probability plots are a standard way of judging the
adequacy of the normality assumption in regression. In
statistics packages, residuals from the regression are
easily, or in some cases automatically, plotted on a normal
probability plot. Excel’s regression routine provides a
Normal Probability Plot option. However, it produces a
uniform distribution probability plot for Y instead of a
normal probability plot for the residuals, even though it
is labelled as the latter. The plot is therefore useless
and misleading for judging the adequacy of regression
residuals.
Excel’s CONFIDENCE function computes z intervals using 1.96
for a 95% interval. This is valid only if the population
variance is known, which is never true for experimental
data. Confidence intervals computed using this function on
sample data will be too small. A t-interval should be used
instead.
Excel is inconsistent in the type of p-values it returns.
For most functions of probabilities, Excel acts like a
lookup table in a textbook, and returns one-sided p-values.
But in the TINV function, Excel returns a 2-sided p-value
by default. Look carefully at the documentation of any
Excel function you use, to be certain you are getting what
you want. Excel's t-test is not the issue, it is when you
use the underlying function that you may not get what you
are expecting. For example, to compute a 95% t confidence
interval around the mean, the standard method is to look up
the t-statistic in a textbook by entering the table at a
value of alpha/2, or 0.025. This t-statistic is multiplied
by the standard error to produce the length of the
t-interval on each side of the mean. Half of the error
(alpha/2) falls on each side of the mean. In Excel the TINV
function is entered using the value of alpha, not alpha/2,
to return the same number. In Excel 2007 this is at least
more obvious in the metadata provided onscreen.
The SOLVER function is used to solve nonlinear equations,
including approximate maximum-likelihood solutions. SOLVER
unfortunately frequently returns a solution prior to
finding a global minimum or maximum, and so presents
incorrect slopes and other coefficients as if they were
correct (ref 2). In short, it can give back the wrong
answers! In testing by McCullough and Heiser (ref 2), the
wrong answer was returned 41% of the time.
3. Distributions are not
computed with precision.
In reference 1, Yalta presents a table of numerical errors
reported for Excel 97, carrying them through for later
versions. Almost all are listed as “Not fixed” or “Poor
fix” in Excel 2000, 2003 and now in 2007. In short,
essentially all numerical problems found in Excel 97 are
still there in Excel 2007.
Statistical distributions used by Excel do not agree with
better algorithms for those distributions at the third
digit and beyond. So p-values are approximately correct,
but not as exact as would be desired by a statistician.
This may not be harmful for hypothesis tests unless the
third digit is of concern (a p-value of 0.056 versus
0.057). It is of most concern when constructing intervals
(multiplying a std dev of 35 times 1.96 give 68.6; times
1.97 gives 69.0) As summarized in reference 4:
"…the statistical distributions of Excel already have
been assessed by Knusel (1998), to which we refer the
interested reader. He found numerous defects in the various
algorithms used to compute several distributions, including
the Normal, Chi-square, F and t, and summarized his results
concisely: So one has to warn statisticians against using
Excel functions for scientific purposes. The performance of
Excel in this area can be judged unsatisfactory."
4. Regression routines are
incorrect for multicollinear data.
This affects multiple regression. A good statistics package
will report errors due to correlations among the X
variables. The Variance Inflation Factor (VIF) is one
measure of collinearity. Excel does not compute
collinearity measures, does not warn the user when
collinearity is present, and reports parameter estimates
that may be nonsensical. See ref (8) for an example on data
from an experiment. Are multicollinear data of concern in
‘practical’ problems? I think so -- I find many examples of
collinearity in environmental data sets.
5. Ranks of tied data are
nonstandard.
When ranking data, standard practice is to assign tied
ranks to tied observations. The value of these ranks should
equal the median of the ranks that the observations would
have had, if they had not been tied. For example, three
observations tied at a value of 14 would have had the ranks
of 7, 8 and 9 had they not been tied. Each of the three
values should be assigned the rank of 8, the median of 7, 8
and 9.
Excel assigns the lowest of the three ranks to all three
observations, giving each a rank of 7. This would result in
problems if Excel computed rank-based tests. Perhaps it is
fortunate none are available.
6. Many of Excel's charts
violate standards of good graphics.
Use of perspective and glitz (donut charts!) violate basic
principles of graphics. Excel's charts are more suitable to
USA Today than to scientific reports. This bothers some
people more than others.
Excel "charts create chartjunk that hinder peoples’
ability to comprehend the data. ... those who want to use
Excel are advised to get to know the principles of good
graphing well enough so that they know how to choose the
appropriate options to override the defaults."
-- Y-S Su, ref (3).
"Good graphs should….[a list of traits]…However, Excel
meets virtually none of these criteria. The vast majority
of chart types produced by Excel should never be
used!" -- Jon Cryer, ref (6).
Possible
Solutions
Solution #1: Excel
add-ins
Some of these limitations can be overcome by using add-in
routines that avoid using Excel’s computation procedures.
Many of these were evaluated in our newsletters reviewing
statistical software for under $600 (see our Aug and Sept 2008
newsletters). These programs use Excel for data
handling, but their own routines for data analysis. They
vary widely in which routines are included - again, see
our newsletters. A newer option is RExcel, an R module
that will call external R functions from inside of an
Excel worksheet. The accuracy and functionality of the
free, GNU license R software can be available within the
familiar Excel environment. See the RExcel site to view a video demonstrating RExcel, and
download the RExcel module. A textbook manual “R Through
Excel: A Spreadsheet Interface for Statistics, Data
Analysis, and Graphics” by the developers of RExcel,
Heilberger and Neuwirth, is available from your favorite
technical bookseller.
Solution #2: Alternatives to
Excel
Yalta (ref 1) states that p-values [inverse probability
distributions] reported by the free OpenOffice’s Calc spreadsheet
and the open-source
Gnumeric spreadsheet do not have
the same numerical problems as does Excel - their
programmers used accurate algorithms. I have not seen a
detailed analysis of results from these programs,
however, so caution is advised until you check their
results against a true statistics package. These two
programs, both of which read Excel data spreadsheets,
might provide an alternate and more accurate set of the
simpler statistical test procedures within a spreadsheet
environment than does Excel.
Solution #3: Purchase
statistics software, or use R (industry standard and
free)
Our November 2007
newsletter reviewed the capabilities of
commercial statistics software. Our Summer 2006
newsletter introduced the free R statistical
software environment. All of our courses are
taught with either commercial software, or with R. We
couldn't get through the routines presented in the first
day of our introductory Applied Environmental Statistics
course if using Excel.
References
and Links
Articles:
(1) The accuracy of statistical distributions in Microsoft
Excel 2007
A.T. Yalta, (2008). Computational Statistics and Data
Analysis 52, pp. 4579-4586
(2) On the accuracy of statistical procedures in Microsoft
Excel 2007
B.D. McCullough and D.A. Heiser, (2008). Computational Statistics & Data
Analysis 52, pp. 4570-4578
(3) It’s easy to produce chartjunk using Microsoft Excel
2007 but hard to make good graphs
Y-S. Su, (2008). Computational Statistics and Data
Analysis 52, pp. 4594-4601
(4) On the accuracy of statistical procedures in Microsoft
Excel 2003
B.D. McCullough and B. Wilson, (2005), Computational Statistics & Data
Analysis, 49, pp. 1244-1252
(5) On the accuracy of statistical procedures in Microsoft
Excel ‘97
B.D. McCullough and B. Wilson, (1999), Computational Statistics & Data
Analysis, 31, pp. 27-37
(6) Problems with using Microsoft Excel for statistics
J.D. Cryer, (2001), presented at the Joint Statistical
Meetings, American Statistical Association, 2001, Atlanta
Georgia
[pdf download]
(7) Use of Excel for statistical
analysis
Neil Cox, (2000), AgResearch Ruakura
(8) Statistical analysis using Microsoft Excel
Jeffrey Simonoff, (2002)
[pdf download]
(9) On the Accuracy of Statistical Distributions in
Microsoft Excel 97
Leo Knuesel
[pdf download]
(10) Statistical flaws in Excel
Hans Pottel
[pdf download]
Websites with more detail:
(1) Spreadsheet addiction
Patrick Burns
(2) Errors, faults and fixes for Excel
statistical functions and routines
David Heiser
(3) Data Analysis and Visualization with
Excel, R and Google Tools
ProcessTrends.com
(4) Excel’s New Charting Engine: Preview of
an Opportunity Missed [pdf]
Stephen Few
Guides to Excel on the
web:
(A) A Beginner's Guide to Excel -
Univ. of Reading, UK
(B) An Intermediate Guide to Excel -
Univ. of Reading, UK
(C) RExcel: Using R from within Excel
- Univ. of North Texas, USA
Note: All opinions other than those cited as coming from
others are my own.