Is Microsoft Excel an Adequate Statistics Package?
Changes in Excel 2010 have improved its use for statistics considerably. For earlier versions of Excel, however, the answer is generally ‘No’. The following refers to versions prior to 2010 (2011 on the Mac). More on the recent improvements coming here soon.
Excel is available to many people as part of Microsoft Office. People often use Excel as their everyday statistics software because they have already purchased it. Excel’s limitations, and its errors, make this a very questionable practice for scientific applications. For business applications where questions might be simpler and precision not as necessary, Excel may be just fine. Below are some of the concerns with using Excel for statistics that are recorded in journals, on the web, and from personal experience. References for more detail are listed at the bottom of this page.
Some advice from others:
“Meanwhile, researchers should continue to avoid using the statistical functions in Excel 2007 for any scientific purpose.”
- Yalta (2008), ref 1 below
“... it is not safe to assume that Microsoft Excel’s statistical procedures give the correct answer. Persons who wish to conduct statistical analyses should use some other package.”
- McCullough and Heiser (2008), ref 2 below
"If you need to perform analysis of variance, avoid using Excel, unless you are dealing with extremely simple problems."
- Statistical Services Centre, Univ. of Reading, U.K. (at A, below)
"Excel is of very limited use in the formal statistical analysis of data unless your experimental design is very simple. . . . the "Data Analysis Toolpack" provided with Excel is no easier to use than most statistics packages, has very limited capability, has known bugs and so, on the whole, is not worth bothering with. "
-Neil Cox, ref 7 below
"Enterprises should advise their scientists and professional statisticians not to use Microsoft Excel for substantive statistical analysis. Instead, enterprises should look for professional statistical analysis software certified to pass the (NIST) Statistical Reference Datasets tests to their users' required level of accuracy."
- The Gartner Group
Limitations of Excel
1. Many statistical methods are not available in Excel.
Commonly-used statistics and methods NOT available within Excel include:
* p-values for the correlation coefficient
* Spearman’s and Kendall’s rank correlation coefficients
* 2-way ANOVA with unequal sample sizes (unbalanced data)
* Multiple comparison tests (post-hoc tests following ANOVA)
* p-values for two-way ANOVA
* Levene’s test for equal variance
* Nonparametric tests, including rank-sum and Kruskal-Wallis
* Scatterplot arrays or brushing
* Principal components or other multivariate methods
* GLM (generalized linear models)
* Survival analysis methods
* Regression diagnostics, such as Mallow’s Cp and PRESS
(it does compute adjusted r-squared)
* Durbin-Watson test for serial correlation
* LOESS smooths
Excel still requires the X variables to be in contiguous columns in order to input them to a multiple regression. This can be done with cut and paste, but is certainly annoying if many multiple regression models are to be built.
Excel's lack of functionality limits its use to only simple statistical tasks. For alternatives, see the 3 Solutions presented below.
2. Several Excel procedures are misleading/wrong.
Probability plots are a standard way of judging the adequacy of the normality assumption in regression. In statistics packages, residuals from the regression are easily, or in some cases automatically, plotted on a normal probability plot. Excel’s regression routine provides a Normal Probability Plot option. However, it produces a uniform distribution probability plot for Y instead of a normal probability plot for the residuals, even though it is labelled as the latter. The plot is therefore useless and misleading for judging the adequacy of regression residuals.
Excel’s CONFIDENCE function computes z intervals using 1.96 for a 95% interval. This is valid only if the population variance is known, which is never true for experimental data. Confidence intervals computed using this function on sample data will be too small. A t-interval should be used instead.
Excel is inconsistent in the type of p-values it returns. For most functions of probabilities, Excel acts like a lookup table in a textbook, and returns one-sided p-values. But in the TINV function, Excel returns a 2-sided p-value by default. Look carefully at the documentation of any Excel function you use, to be certain you are getting what you want. Excel's t-test is not the issue, it is when you use the underlying function that you may not get what you are expecting. For example, to compute a 95% t confidence interval around the mean, the standard method is to look up the t-statistic in a textbook by entering the table at a value of alpha/2, or 0.025. This t-statistic is multiplied by the standard error to produce the length of the t-interval on each side of the mean. Half of the error (alpha/2) falls on each side of the mean. In Excel the TINV function is entered using the value of alpha, not alpha/2, to return the same number. In Excel 2007 this is at least more obvious in the metadata provided onscreen.
The SOLVER function is used to solve nonlinear equations, including approximate maximum-likelihood solutions. SOLVER unfortunately frequently returns a solution prior to finding a global minimum or maximum, and so presents incorrect slopes and other coefficients as if they were correct (ref 2). In short, it can give back the wrong answers! In testing by McCullough and Heiser (ref 2), the wrong answer was returned 41% of the time.
3. Distributions are not computed with precision.
In reference 1, Yalta presents a table of numerical errors reported for Excel 97, carrying them through for later versions. Almost all are listed as “Not fixed” or “Poor fix” in Excel 2000, 2003 and now in 2007. In short, essentially all numerical problems found in Excel 97 are still there in Excel 2007.
Statistical distributions used by Excel do not agree with better algorithms for those distributions at the third digit and beyond. So p-values are approximately correct, but not as exact as would be desired by a statistician. This may not be harmful for hypothesis tests unless the third digit is of concern (a p-value of 0.056 versus 0.057). It is of most concern when constructing intervals (multiplying a std dev of 35 times 1.96 give 68.6; times 1.97 gives 69.0) As summarized in reference 4:
"…the statistical distributions of Excel already have been assessed by Knusel (1998), to which we refer the interested reader. He found numerous defects in the various algorithms used to compute several distributions, including the Normal, Chi-square, F and t, and summarized his results concisely: So one has to warn statisticians against using Excel functions for scientific purposes. The performance of Excel in this area can be judged unsatisfactory."
Also see this free article on the accuracy of statistics in spreadsheets by the Journal of Statistical Software (American Statistical Association).
4. Regression routines are incorrect for multicollinear data.
This affects multiple regression. A good statistics package will report errors due to correlations among the X variables. The Variance Inflation Factor (VIF) is one measure of collinearity. Excel does not compute collinearity measures, does not warn the user when collinearity is present, and reports parameter estimates that may be nonsensical. See ref (8) for an example on data from an experiment. Are multicollinear data of concern in ‘practical’ problems? I think so -- I find many examples of collinearity in environmental data sets.
5. Ranks of tied data are nonstandard.
When ranking data, standard practice is to assign tied ranks to tied observations. The value of these ranks should equal the median of the ranks that the observations would have had, if they had not been tied. For example, three observations tied at a value of 14 would have had the ranks of 7, 8 and 9 had they not been tied. Each of the three values should be assigned the rank of 8, the median of 7, 8 and 9.
Excel assigns the lowest of the three ranks to all three observations, giving each a rank of 7. This would result in problems if Excel computed rank-based tests. Perhaps it is fortunate none are available.
6. Many of Excel's charts violate standards of good graphics.
Use of perspective and glitz (donut charts!) violate basic principles of graphics. Excel's charts are more suitable to USA Today than to scientific reports. This bothers some people more than others.
Excel "charts create chartjunk that hinder peoples’ ability to comprehend the data. ... those who want to use Excel are advised to get to know the principles of good graphing well enough so that they know how to choose the appropriate options to override the defaults."
-- Y-S Su, ref (3).
"Good graphs should….[a list of traits]…However, Excel meets virtually none of these criteria. The vast majority of chart types produced by Excel should never be used!" -- Jon Cryer, ref (6).
Solution #1: Excel add-ins
Some of these limitations can be overcome by using add-in routines that avoid using Excel’s computation procedures. Many of these were evaluated in our newsletters reviewing statistical software for under $600 (see our Aug and Sept 2008 newsletters). These programs use Excel for data handling, but their own routines for data analysis. They vary widely in which routines are included - again, see our newsletters. A newer option is RExcel, an R module that will call external R functions from inside of an Excel worksheet. The accuracy and functionality of the free, GNU license R software can be available within the familiar Excel environment. See the RExcel site to view a video demonstrating RExcel, and download the RExcel module. A textbook manual “R Through Excel: A Spreadsheet Interface for Statistics, Data Analysis, and Graphics” by the developers of RExcel, Heilberger and Neuwirth, is available from your favorite technical bookseller.
Most of these add-ins alter your copy of Excel. Perform a complete backup prior to installing them, in case you don't find them helpful and want to go back to the pristine version. Uninstalling add-ins is not always simple.
Solution #2: Alternatives to Excel
Yalta (ref 1) states that p-values [inverse probability distributions] reported by the free OpenOffice’s Calc spreadsheet and the open-source Gnumeric spreadsheet do not have the same numerical problems as does Excel - their programmers used accurate algorithms. I have not seen a detailed analysis of results from these programs, however, so caution is advised until you check their results against a true statistics package. These two programs, both of which read Excel data spreadsheets, might provide an alternate and more accurate set of the simpler statistical test procedures within a spreadsheet environment than does Excel.
Solution #3: Purchase statistics software, or use R (industry standard and free)
Our November 2007 newsletter reviewed the capabilities of commercial statistics software. Our Summer 2006 newsletter introduced the free R statistical software environment. All of our courses are taught with either commercial software, or with R. We couldn't get through the routines presented in the first day of our introductory Applied Environmental Statistics course if using Excel.
References and Links
(1) The accuracy of statistical distributions in Microsoft Excel 2007
A.T. Yalta, (2008). Computational Statistics and Data Analysis 52, pp. 4579-4586
(2) On the accuracy of statistical procedures in Microsoft Excel 2007
B.D. McCullough and D.A. Heiser, (2008). Computational Statistics & Data Analysis 52, pp. 4570-4578
(3) It’s easy to produce chartjunk using Microsoft Excel 2007 but hard to make good graphs
Y-S. Su, (2008). Computational Statistics and Data Analysis 52, pp. 4594-4601
(4) On the accuracy of statistical procedures in Microsoft Excel 2003
B.D. McCullough and B. Wilson, (2005), Computational Statistics & Data Analysis, 49, pp. 1244-1252
(5) On the accuracy of statistical procedures in Microsoft Excel ‘97
B.D. McCullough and B. Wilson, (1999), Computational Statistics & Data Analysis, 31, pp. 27-37
(6) Problems with using Microsoft Excel for statistics
J.D. Cryer, (2001), presented at the Joint Statistical Meetings, American Statistical Association, 2001, Atlanta Georgia
(7) Use of Excel for statistical analysis
Neil Cox, (2000), AgResearch Ruakura
(8) Statistical analysis using Microsoft Excel
Jeffrey Simonoff, (2002)
(9) On the Accuracy of Statistical Distributions in Microsoft Excel 97
(10) Statistical flaws in Excel
Websites with more detail:
(1) Spreadsheet addiction
(2) Errors, faults and fixes for Excel statistical functions and routines
(3) Data Analysis and Visualization with Excel, R and Google Tools
(4) Excel’s New Charting Engine: Preview of an Opportunity Missed [pdf]
Guides to Excel on the web:
(A) A Beginner's Guide to Excel - Univ. of Reading, UK
(B) An Intermediate Guide to Excel - Univ. of Reading, UK
(C) RExcel: Using R from within Excel - Univ. of North Texas, USA
Note: All opinions other than those cited as coming from others are my own.