Researchers painstakingly search for that p value under 0.05, which indicates that data sets are significantly different from each other. It’s a quest that, while getting a special star in your bar graph, is often misunderstood. Choosing the appropriate test, conducting power analysis and determining if your data are normally distributed are steps needed to ensure that you can actually reject the null hypothesis if p < 0.05. However, more often than not, students input data into their statistics program and go straight to the p value. If it’s low enough, the experiment is over and they did a good job. This isn’t sufficient to generate sound inferences from data, and more journals should focus on demanding for details on prospective studies statistical methods.
Before the start of an experiment, how many variables there are and whether or not they are independent or dependent must be determined. Independent variables refer to those that are consciously modified to generate an effect, and the dependent variables are the ones to be measured (i.e. if we want to see the effects of yearly income on number of TVs owned, yearly income would be the independent variable and number of TVs would be dependent). Put another way, the dependent variables change based on the independent variables.
Once the number and type of variables have been determined, the next step is finding out if the data from each independent variable are normally distributed. This is important because parametric statistical tests (t-tests, analysis of variance, regression analysis, etc.) assume that the data fits a predictable normal or Gaussian bell-curve. If the data is not normal, there are equivalent non-parametric statistical tests (mann whitney u test, kruskal wallis test, etc.) that should be used in place of parametric ones.
Tests for normality
Data plotted in a histogram are normally distributed in Figure 1. Properties of this curve are that it is symmetric around the mean, which also happens to be the mode and median. Normal distributions also adhere to the empirical rule, which states that 99.7% of values lie within 3 standard deviations of the mean. The rule also states that 2 standard deviations from the mean includes 95% of the values, and within 1 standard deviation includes 68% of them. To see if a data set fits a normal distribution, there are many tests that can be done. The more common ones include plotting a graph like in the given Figure 1 and taking note of the skewness and kurtosis. Skewness refers to a slant in the graph, where most of the data happens to be on one side more than the other. Kurtosis is the sharpness of the peak. The closer these measures are to zero, the more normal the data. One way to see if they are too big, is to divide the skew and kurtotic numbers by their respective standard error, which are the skewness and kurtosis z-values. Both should be between 1.96 and -1.96 for a normal distribution.
Other than visual inspection of the data, a common test is the shapiro-wilk test that can be computed by most statistics software. This function tests the null hypothesis that a data set came from a population that fits a normal distribution.
If the p-value is below the set α, usually between 0.05-0.001, then the null hypothesis is rejected and the data do fit a normal distribution. Finally, a probability plot that may be used to interpret normality is the Q-Q plot. This plot compares distributions, and in the case of a single data set, compare it to the line y=x. The closer the points of the data lie to y=x, the more normal they are. What makes the Q-Q plot unique, however, is that it plots quantiles of the data using a continuous cumulative distribution function. Learning how to do this would be tedious, but statistics software packages can create it. An example of a Q-Q plot graciously donated by wikipedia that shows a normal distribution is in Figure 2.
Power analysis The law of large numbers states that if we keep taking more and more samples, the mean of the sample gets closer and closer to the mean of the population. And really, our goal is to extrapolate inferences about populations with a reasonable number of samples. But how many samples (N value) are enough to make this inference? For many experiments in too many labs, this number is the lowest number possible that SPSS will allow statistics to be run and statistical significance to be achieved. In reality, this low N value leads to very low power, suggesting that the significance of the test is skewed, because the test is setup to detect too many false positives. Power analysis is a required tool that determines the percent probability of your parameters detecting an effect, if it exists. In terms of hypothesis testing, power refers to the sensitivity, or the probability of rejecting the null hypothesis when it is false. This is related to the two types of error researchers make in statistical analysis called Type I and Type II error. If the null hypothesis is rejected when it shouldn’t have, Type I error was committed, known as α. Type II error is committed when the null hypothesis is accepted in a case where it should have been rejected, known as β. According to our definition of power, it is the probability of rejecting the null hypothesis correctly, or 1-β. The lower the power, the great the chance of committing Type II error. Conducting power analysis is extremely useful because it allows calculation of the minimal sample size required to detect an effect. In fact, three factors come into play when determining power. They are size of the effect in the population, number of samples and α. Power values range from 0 to 1, where the most accepted level is 0.8, but also depends based on your field (search pubmed for reviews on the topic in your specific area). The actual calculation of power can be done in many statistics software packages, and each of the factors mentioned above can modify the other.
How do I calculate these factors for power analysis?
- α – As mentioned above, α is known as Type I error or a false positive error. Scientists have generally accepted 0.05 as ideal for α, but again, review the literature in your field and see if they use a lower α like 0.01 or 0.001.
- Effect size – An entire post should be dedicated to effect size because it allows researchers to tell if any observed differences are meaningful regardless of their significance. Data sets can be significantly difference from each other due to a high N number but their effect size may be so low because the real difference is so small. The equation for calculating effect size (ES) is:
refers to the mean of each group and σ is the standard deviation of both data sets. To get pooled standard deviation use the following formula:
The larger the effect size, the more meaningful your results. For an ideal effect, an effect size of 0.5 or greater is standard, but your field might be different.
- N value – This is likely to be our unknown in our calculation of power, because a power of 0.8 is ideal. A crux to power analysis is that it should be conducted before the analyses, which is obviously difficult because effect size needs means of the data. Therefore, calculated guesses are the best hope to get a feel for how many samples required in the study. Since there are generally accepted values for all the criteria, they can be used to determine the minimum number of samples to get a significant, meaningful and powerful result.
Which test to use?
The final part of conducting a solid statistical analysis (and power analysis) is choosing the right test for your data. Consult a handy chart, made by smart kids at UCLA to see which test should be run, given the number and types of variables and whether the data is normal or not.
For more reading:
- Ghasemi, A., Zahediasl, S. Normality tests for statistical analysis: a guide for non-statisticians. Int J Endocrinol Metab 10(2). 486-9 (2012)
- Murphy, K.R., Myors, B., Volach, A.H. Statistical Power Analysis: A Simple and General Model for Traditional and Modern Hypothesis Tests, 3rd Ed. Routledge, 2008