Proper sample size calculation is both a scientific and ethical imperative.

Researchers are routinely asked to justify the number of animals used in their studies, either by regulatory bodies, funding agencies or, increasingly, by journal editors. In accordance with the 3R's, studies should be designed to reduce the number of animals used to meet scientific objectives. While the ethical reasons underlying such reductions are obvious, it is also ethically important to rigorously test experimental hypotheses when the results may directly impact human health.

Underpowered studies that do not include enough animal subjects may produce ambiguous or misleading results, failing to promote scientific progress or the reproducibility of research findings. As such, underpowered studies unnecessarily subject animals to experimentation and violate the 3R's principles.

If an effect size is known, there are various methods that can be used to calculate an appropriate sample size for a desired level for α and power

An extensive review of this subject is beyond the scope of this article, and researchers are encouraged to consult a statistician; however, there are several important factors that should be considered.

The

As an example of bias, suppose twenty research groups around the world are testing the same hypothesis: Drug A causes Effect B. At an α level of 0.05, there is a good probability that one of these groups will produce data showing that Drug A does cause Effect B, even if this is not in fact true. Given that positive findings are more readily published than negative findings (i.e. publication/reporting bias), this effect may be reported as real in this hypothetical situation (even if 19 other groups failed to detect an effect).

This statistical reality emphasizes the importance of reproducing studies and reporting negative results. Furthermore, there are some that argue that a

The

Why is this a problem? Low-powered studies have a much greater chance of not detecting an effect (higher chance of a false negative), but if they do detect an effect, aren't these conclusions still valid? Since α is independent of power, a low-powered study can still have a reasonably small chance of concluding a false positive.

However, there are several problems with this (technically correct) assertion.

This, of course is separate from the multiple sources of bias that exist in the performance and reporting of scientific studies. Nevertheless, there is a major push to ensure that animal studies are sufficiently powered to produce reliable and predictive results.

Researchers are routinely asked to justify the number of animals used in their studies, either by regulatory bodies, funding agencies or, increasingly, by journal editors. In accordance with the 3R's, studies should be designed to reduce the number of animals used to meet scientific objectives. While the ethical reasons underlying such reductions are obvious, it is also ethically important to rigorously test experimental hypotheses when the results may directly impact human health.

Underpowered studies that do not include enough animal subjects may produce ambiguous or misleading results, failing to promote scientific progress or the reproducibility of research findings. As such, underpowered studies unnecessarily subject animals to experimentation and violate the 3R's principles.

## Power and Sample Size Calculations

Ensuring that an experiment uses a large enough sample size to ensure reproducibility is a critical aspect of experimental design. Power, or the ability to reliably detect differences between experimental groups, is dependent upon several factors:**Sample size (n)**- the number of subjects in each experimental group**Effect size**- the magnitude of the difference between groups (including the variance of the data, as appropriate)**α**- the probability of a false positive finding (Type I error - incorrectly rejecting the null hypothesis), typically set at 0.05**β**- the probability of a false negative finding (Type II error - incorrectly supporting the null hypothesis), typically set at 0.2-0.1**Power (1-β)**- related to the probability of detecting a true positive (correctly rejecting the null hypothesis), typically set at 0.8-0.9

^{1}^{,}^{2}. Multiple online calculators and software packages can be used for such calculations (see below).An extensive review of this subject is beyond the scope of this article, and researchers are encouraged to consult a statistician; however, there are several important factors that should be considered.

## Effect Size

The actual effect size in an experiment is rarely known beforehand, and neither is the variance in the data. These are usually approximations informed by historical or pilot study data, which may or may not reflect the outcomes of a proposed experiment. When establishing this effect size for sample size calculations, it is critical that this value is set at the lower end of what would be considered scientifically important, as this determines the minimum difference that can be reliably detected with that sample size. For example, if the sample size is calculated to detect a difference of 2 standard deviations, this n value would not be sufficient to detect any effect less than this value with confidence.## α and Power

A second critical factor is determining the appropriate levels for α and power (1-β). To a non-statistician, these values often represent opportunities for confusion^{1}.The

**Type I error rate**(α) is easiest to grasp; this is the false positive rate and corresponds to a desired*p*value for statistical hypothesis testing. The 'standard' α value of 0.05 reflects a 1 in 20 chance that a detected difference between groups is not real (i.e. occurring only by chance). As such, α or*p*values can be easily misunderstood; they only support, but cannot prove, that two groups are different and are easily subject to bias.As an example of bias, suppose twenty research groups around the world are testing the same hypothesis: Drug A causes Effect B. At an α level of 0.05, there is a good probability that one of these groups will produce data showing that Drug A does cause Effect B, even if this is not in fact true. Given that positive findings are more readily published than negative findings (i.e. publication/reporting bias), this effect may be reported as real in this hypothetical situation (even if 19 other groups failed to detect an effect).

This statistical reality emphasizes the importance of reproducing studies and reporting negative results. Furthermore, there are some that argue that a

*p*value of 0.01 or lower may be more appropriate than the 0.05 standard.The

**power of an experiment**(1-β) is related to but independent of α. It roughly corresponds to the probability of a detecting a result that is a true positive (rejecting the null hypothesis when an alternative hypothesis is true). A higher-powered experiment will have a greater chance to detect an effect if one exists. Generally, power levels are set to 0.8 or higher, with high risk experiments often using greater power levels (e.g. for toxicology studies in which it is important to have a high confidence of detecting effects).## Sample Size Calculation Resources

- PS: Power and Sample Size Calculation (Windows, free) - Software package from Vanderbilt University for multiple types of power analysis.
- G*Power (Windows/OSX, free) - Multi-platform software package from Universität Düsseldorf for comprehensive power analysis calculations.
- Sample Size Calculations - Description of sample size calculations from the IACUC at Boston University, including an Excel template for calculation based on means/standard deviations and proportions.

## Consequences of Underpowered Experiments

Underpowered research studies are far too common in the biological sciences^{3}^{,}^{4}. Several reviews of the literature have emphasized this problem, showing that many studies use power levels well below the 'standard' 0.8 level. This is a particularly well-known problem within the neuroscience field, in which published studies showing statistically-significant effects often have an apparent power level as low as 0.2^{3}.Why is this a problem? Low-powered studies have a much greater chance of not detecting an effect (higher chance of a false negative), but if they do detect an effect, aren't these conclusions still valid? Since α is independent of power, a low-powered study can still have a reasonably small chance of concluding a false positive.

However, there are several problems with this (technically correct) assertion.

- Studies with small sample sizes tend to produce inflated estimates of the actual effect size, which can lead to spurious conclusions of statistical significance.
- Positive predictive value (PPV) is rarely considered in experimental biological sciences. Unlike the false positive rate (α), PPV is a statistic that indicates how likely a positive result is to be a true positive, and is related to both α and power (1-β). An experiment that has a very stringent level for α, but uses low power, will have a low PPV (a lower ability to detect a true positive). A consequence of performing studies with a low PPV is that these findings may not be reproducible or generalizable to the greater population (of either mice or humans).

## Experimental Power and Reproducibility

The biological sciences have been criticized for a lack of reproducibility and predictability, and the common use of underpowered studies is a major contributor to this problem. By increasing power, the scientific community can have more faith in published results.This, of course is separate from the multiple sources of bias that exist in the performance and reporting of scientific studies. Nevertheless, there is a major push to ensure that animal studies are sufficiently powered to produce reliable and predictive results.

## Increasing the Power of Your Animal Experiments:

- Calculate sample size based on minimum effects sizes of scientific importance, with appropriate levels of α and power (consult a statistician, as needed), and faithfully incorporate this sample size into experiments
- Sample sizes should be based on statistical analysis and not convenience (e.g. caging density, litter sizes) or costs (animal costs, personnel costs)
- Report rationale for the selection of sample size, including details of power calculations, as per ARRIVE guidelines
- Account for animal attrition during study duration when setting sample sizes
- Increase effect size to increase power with fewer subjects:
- Optimize experimental protocols to maximize difference between experimental and control groups, if ethically and scientifically valid
- E.g. Chose an appropriate/optimal inbred mouse background that responds best in the intended model
- Decrease experimental variation to increase power with fewer subjects:
- Ensure that inbred strains and GEMs have high quality genetic backgrounds
- Ensure that animals are free of pathogens
- Control for microbiome-related effects
- Minimize environmental stressors