From Wikipedia, the free encyclopedia  View original article
Sample size determination is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is determined based on the expense of data collection, and the need to have sufficient statistical power. In complicated studies there may be several different sample sizes involved in the study: for example, in a survey sampling involving stratified sampling there would be different sample sizes for each population. In a census, data are collected on the entire population, hence the sample size is equal to the population size. In experimental design, where a study may be divided into different treatment groups, there may be different sample sizes for each group.
Sample sizes may be chosen in several different ways:
How samples are collected is discussed in sampling (statistics) and survey data collection.
Larger sample sizes generally lead to increased precision when estimating unknown parameters. For example, if we wish to know the proportion of a certain species of fish that is infected with a pathogen, we would generally have a more accurate estimate of this proportion if we sampled and examined 200 rather than 100 fish. Several fundamental facts of mathematical statistics describe this phenomenon, including the law of large numbers and the central limit theorem.
In some situations, the increase in accuracy for larger sample sizes is minimal, or even nonexistent. This can result from the presence of systematic errors or strong dependence in the data, or if the data follow a heavytailed distribution.
Sample sizes are judged based on the quality of the resulting estimates. For example, if a proportion is being estimated, one may wish to have the 95% confidence interval be less than 0.06 units wide. Alternatively, sample size may be assessed based on the power of a hypothesis test. For example, if we are comparing the support for a certain political candidate among women with the support for that candidate among men, we may wish to have 80% power to detect a difference in the support levels of 0.04 units.
A relatively simple situation is estimation of a proportion. For example, we may wish to estimate the proportion of residents in a community who are at least 65 years old.
The estimator of a proportion is , where X is the number of 'positive' observations (e.g. the number of people out of the n sampled people who are at least 65 years old). When the observations are independent, this estimator has a (scaled) binomial distribution (and is also the sample mean of data from a Bernoulli distribution). The maximum variance of this distribution is 0.25*n, which occurs when the true parameter is p = 0.5. In practice, since p is unknown, the maximum variance is often used for sample size assessments.
For sufficiently large n, the distribution of will be closely approximated by a normal distribution.^{[1]} Using this approximation, it can be shown that around 95% of this distribution's probability lies within 2 standard deviations of the mean. Using the Wald method for the binomial distribution, an interval of the form
will form a 95% confidence interval for the true proportion. If this interval needs to be no more than W units wide, the equation
can be solved for n, yielding^{[2]}^{[3]} n = 4/W^{2} = 1/B^{2} where B is the error bound on the estimate, i.e., the estimate is usually given as within ± B. So, for B = 10% one requires n = 100, for B = 5% one needs n = 400, for B = 3% the requirement approximates to n = 1000, while for B = 1% a sample size of n = 10000 is required. These numbers are quoted often in news reports of opinion polls and other sample surveys.
A proportion is a special case of a mean. When estimating the population mean using an independent and identically distributed (iid) sample of size n, where each data value has variance σ^{2}, the standard error of the sample mean is:
This expression describes quantitatively how the estimate becomes more precise as the sample size increases. Using the central limit theorem to justify approximating the sample mean with a normal distribution yields an approximate 95% confidence interval of the form
If we wish to have a confidence interval that is W units in width, we would solve
for n, yielding the sample size n = 16σ^{2}/W^{2}.
For example, if we are interested in estimating the amount by which a drug lowers a subject's blood pressure with a confidence interval that is six units wide, and we know that the standard deviation of blood pressure in the population is 15, then the required sample size is 100.
A common problem faced by the statisticians is calculating the sample size required to yield a certain power for a test, given a predetermined Type I error rate α. As follows, this can be estimated by predetermined tables for certain values, by Mead's resource equation, or, more generally, by the cumulative distribution function:
^{[4]} Power  Cohen's d  

0.2  0.5  0.8  
0.25  84  14  6 
0.50  193  32  13 
0.60  246  40  16 
0.70  310  50  20 
0.80  393  64  26 
0.90  526  85  34 
0.95  651  105  42 
0.99  920  148  58 
The table shown at right can be used in a twosample ttest to estimate the sample sizes of an experimental group and a control group that are of equal size, that is, the total number of individuals in the trial is twice that of the number given, and the desired significance level is 0.05.^{[4]} The parameters used are:
Mead's resource equation is often used for estimating sample sizes of laboratory animals, as well as in many other laboratory experiments. It may not be as accurate as using other methods in estimating sample size, but gives a hint of what is the appropriate sample size where parameters such as expected standard deviations or expected differences in values between groups are unknown or very hard to estimate.^{[5]}
All the parameters in the equation are in fact the degrees of freedom of the number of their concepts, and hence, their numbers are subtracted by 1 before insertion into the equation.
The equation is:^{[5]}
where:
For example, if a study using laboratory animals is planned with four treatment groups (T=3), with eight animals per group, making 32 animals total (N=31), without any further stratification (B=0), then E would equal 28, which is above the cutoff of 20, indicating that sample size may be a bit too large, and six animals per group might be more appropriate.^{[6]}
Let X_{i}, i = 1, 2, ..., n be independent observations taken from a normal distribution with unknown mean μ and known variance σ^{2}. Let us consider two hypotheses, a null hypothesis:
and an alternative hypothesis:
for some 'smallest significant difference' μ^{*} >0. This is the smallest value for which we care about observing a difference. Now, if we wish to (1) reject H_{0} with a probability of at least 1β when H_{a} is true (i.e. a power of 1β), and (2) reject H_{0} with probability α when H_{0} is true, then we need the following:
If z_{α} is the upper α percentage point of the standard normal distribution, then
and so
is a decision rule which satisfies (2). (Note, this is a 1tailed test)
Now we wish for this to happen with a probability at least 1β when H_{a} is true. In this case, our sample average will come from a Normal distribution with mean μ^{*}. Therefore we require
Through careful manipulation, this can be shown to happen when
where is the normal cumulative distribution function.
With more complicated sampling techniques, such as stratified sampling, the sample can often be split up into subsamples. Typically, if there are H such subsamples (from H different strata) then each of them will have a sample size n_{h}, h = 1, 2, ..., H. These n_{h} must conform to the rule that n_{1} + n_{2} + ... + n_{H} = n (i.e. that the total sample size is given by the sum of the subsample sizes). Selecting these n_{h} optimally can be done in various ways, using (for example) Neyman's optimal allocation.
There are many reasons to use stratified sampling:^{[7]} to decrease variances of sample estimates, to use partly nonrandom methods, or to study strata individually. A useful, partly nonrandom method would be to sample individuals where easily accessible, but, where not, sample clusters to save travel costs.^{[8]}
In general, for H strata, a weighted sample mean is
with
The weights, , frequently, but not always, represent the proportions of the population elements in the strata, and . For a fixed sample size, that is ,
which can be made a minimum if the sampling rate within each stratum is made proportional to the standard deviation within each stratum: , where and is a constant such that .
An "optimum allocation" is reached when the sampling rates within the strata are made directly proportional to the standard deviations within the strata and inversely proportional to the square root of the sampling cost per element within the strata, :
where is a constant such that , or, more generally, when
