这是用户在 2025-7-22 20:24 为 https://app.immersivetranslate.com/pdf-pro/0d8dfb8d-ec77-4448-991a-1bf549f3b5bb/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Introduction  介绍

This is the first of three modules that will addresses the second area of statistical inference, which is hypothesis testing, in which a specific statement or hypothesis is generated about a population parameter, and sample statistics are used to assess the likelihood that the hypothesis is true. The hypothesis is based on available information and the investigator’s belief about the population parameters. The process of hypothesis testing involves setting up two competing hypotheses, the null hypothesis and the alternate hypothesis. One selects a random sample (or multiple samples when there are more comparison groups), computes summary statistics and then assesses the likelihood that the sample data support the research or alternative hypothesis. Similar to estimation, the process of hypothesis testing is based on probability theory and the Central Limit Theorem.
这是三个模块中的第一个,将解决统计推断的第二个领域,即假设检验,其中生成有关总体参数的特定陈述或假设,并使用样本统计数据来评估假设为真的可能性。该假设基于可用信息和研究者对总体参数的信念。假设检验的过程涉及设置两个相互竞争的假设,即原假设和备择假设。选择一个随机样本(如果有更多的比较组,则选择多个样本),计算汇总统计数据,然后评估样本数据支持研究或替代假设的可能性。与估计类似,假设检验的过程基于概率论和中心极限定理。
This module will focus on hypothesis testing for means and proportions. The next two modules in this series will address analysis of variance and chi-squared tests.
本模块将重点介绍均值和比例的假设检验。本系列中的下两个模块将介绍方差分析和卡方检验。

Learning Objectives  学习目标

After completing this module, the student will be able to:
学完本模块后,学生将能够:

  1. Define null and research hypothesis, test statistic, level of significance and decision rule
    定义 null 假设和研究假设、检验统计量、显著性水平和决策规则
  2. Distinguish between Type I and Type II errors and discuss the implications of each
    区分 I 类和 II 类错误,并讨论每种错误的含义
  3. Explain the difference between one and two sided tests of hypothesis
    解释假设的单侧检验和双侧检验之间的区别
  4. Estimate and interpret p p pp-values
    估计和解释 p p pp - 值
  5. Explain the relationship between confidence interval estimates and p p pp-values in drawing inferences
    在绘制推断中解释置信区间估计值和 p p pp -值之间的关系
  6. Differentiate hypothesis testing procedures based on type of outcome variable and number of sample
    根据结果变量类型和样本数量区分假设检验程序

Introduction to Hypothesis Testing
假设检验简介

Techniques for Hypothesis Testing
假设检验技术

The techniques for hypothesis testing depend on
假设检验的技术取决于
  • the type of outcome variable being analyzed (continuous, dichotomous, discrete)
    正在分析的结果变量的类型(连续、二分、离散)
  • the number of comparison groups in the investigation
    调查中的比较组数
  • whether the comparison groups are independent (i.e., physically separate such as men versus women) or dependent (i.e., matched or paired such as pre- and post-assessments on the same participants).
    对照组是独立的(即物理上分开的,例如男性与女性)还是依赖的(即匹配或配对,例如对同一参与者的前后评估)。
In estimation we focused explicitly on techniques for one and two samples and discussed estimation for a specific parameter (e.g., the mean or proportion of a population), for differences (e.g., difference in means, the risk difference) and ratios (e.g., the relative risk and odds ratio). Here we will focus on procedures for one and two samples when the outcome is either continuous (and we focus on means) or dichotomous (and we focus on proportions).
在估计中,我们明确关注一个和两个样本的技术,并讨论了特定参数(例如,总体的平均值或比例)、差异(例如,平均值的差异、风险差异)和比率(例如,相对风险和比值比)的估计。在这里,我们将重点介绍当结果为连续(我们关注均值)或二分法(我们关注比例)时,一样本和二样本的程序。

General Approach: A Simple Example
一般方法:一个简单的例子

The Centers for Disease Control (CDC) reported on trends in weight, height and body mass index from the 1960’s through 2002.1 The general trend was that Americans were much heavier and slightly taller in 2002 as compared to 1960; both men and women gained approximately 24 pounds, on average, between 1960 and 2002. In 2002, the mean weight for men was reported at 191 pounds. Suppose that an investigator hypothesizes that weights are even higher in 2006 (i.e., that the trend continued over the subsequent 4 years). The research hypothesis is that the mean weight in men in 2006 is more than 191 pounds. The null hypothesis is that there is no change in weight, and therefore the mean weight is still 191 pounds in 2006.
疾病控制中心 (CDC) 报告了 1960 年代至 2002 年体重、身高和体重指数的趋势。总体趋势是,与 1960 年相比,2002 年美国人要重得多,身高也略高;1960 年至 2002 年间,男性和女性平均增加了约 24 磅。2002 年,据报道男性的平均体重为 191 磅。假设研究人员假设 2006 年的权重甚至更高(即,这种趋势在接下来的 4 年中持续)。研究假设是 2006 年男性的平均体重超过 191 磅。原假设是体重没有变化,因此 2006 年的平均体重仍然是 191 磅。
Null Hypothesis  原假设 H 0 : μ = 191 H 0 : μ = 191 H_(0):mu=191H_{0}: \mu=191 (no change)  (无变化)
Research Hypothesis  研究假设 H 1 : μ > 191 H 1 : μ > 191 H_(1):mu > 191H_{1}: \mu>191 (investigator's belief)  (研究者的看法)
Null Hypothesis H_(0):mu=191 (no change) Research Hypothesis H_(1):mu > 191 (investigator's belief)| Null Hypothesis | $H_{0}: \mu=191$ | (no change) | | :--- | :--- | :--- | | Research Hypothesis | $H_{1}: \mu>191$ | (investigator's belief) |
In order to test the hypotheses, we select a random sample of American males in 2006 and measure their weights. Suppose we have resources available to recruit n = 100 n = 100 n=100\mathrm{n}=100 men into our sample. We weigh each participant and compute summary statistics on the sample data. Suppose in the sample we determine the following:
为了检验这些假设,我们在 2006 年随机选择了美国男性样本并测量他们的体重。假设我们有可用的资源来招募 n = 100 n = 100 n=100\mathrm{n}=100 男性到我们的样本中。我们权衡每个参与者并计算样本数据的汇总统计数据。假设在样本中我们确定以下内容:
n = 100 , x = 197.1 , s = 25.6 n = 100 , x ¯ = 197.1 , s = 25.6 n=100, bar(x)=197.1,s=25.6\mathrm{n}=100, \overline{\mathrm{x}}=197.1, \mathrm{~s}=25.6
Do the sample data support the null or research hypothesis? The sample mean of 197.1 is numerically higher than 191. However, is this difference more than would be expected by chance? In hypothesis testing, we assume that the null hypothesis holds until proven otherwise. We therefore need to determine the likelihood of observing a sample mean of 197.1 or higher when the true
样本数据是否支持 null 假设或研究假设?样本均值 197.1 在数值上高于 191。然而,这种差异是否比偶然预期的要大?在假设检验中,我们假设原假设成立,直到证明不是这样。因此,我们需要确定观察到 197.1 或更高的样本均值的可能性,当真

population mean is 191 (i.e., if the null hypothesis is true or under the null hypothesis). We can compute this probability using the Central Limit Theorem. Specifically,
总体平均值为 191(即,如果原假设为真或低于原假设)。我们可以使用中心极限定理来计算这个概率。具体说来
P ( X ¯ > 197.1 ) = P ( Z > 197.1 191 25.6 / 100 ) = P ( Z > 2.38 ) = 1 0.9913 = 0.0087 P ( X ¯ > 197.1 ) = P Z > 197.1 191 25.6 / 100 = P ( Z > 2.38 ) = 1 0.9913 = 0.0087 P( bar(X) > 197.1)=P(Z > (197.1-191)/(25.6//sqrt100))=P(Z > 2.38)=1-0.9913=0.0087P(\bar{X}>197.1)=P\left(Z>\frac{197.1-191}{25.6 / \sqrt{100}}\right)=P(Z>2.38)=1-0.9913=0.0087
(Notice that we use the sample standard deviation in computing the Z Z ZZ score. This is generally an appropriate substitution as long as the sample size is large, n 30 n 30 n >= 30\mathrm{n} \geq 30. Thus, there is less than a 1 % 1 % 1%1 \% probability of observing a sample mean as large as 197.1 when the true population mean is 191. Do you think that the null hypothesis is likely true? Based on how unlikely it is to observe a sample mean of 197.1 under the null hypothesis (i.e., < 1 % < 1 % < 1%<1 \% probability), we might infer, from our data, that the null hypothesis is probably not true.
(请注意,我们在计算 Z Z ZZ 分数时使用样本标准差。只要样本量较大,这通常是一个合适的替代项。 n 30 n 30 n >= 30\mathrm{n} \geq 30 因此,当真实总体均值为 191 时,观察到高达 197.1 的样本均值的 1 % 1 % 1%1 \% 概率小于 197.1。您认为零假设可能是正确的吗?根据在原假设(即 < 1 % < 1 % < 1%<1 \% 概率)下观察到样本均值 197.1 的可能性,我们可以从数据中推断出原假设可能不成立。
Suppose that the sample data had turned out differently. Suppose that we instead observed the following in 2006:
假设样本数据的结果不同。假设我们在 2006 年观察到了以下内容:
n = 100 , x = 192.1 , s = 25.6 . n = 100 , x ¯ = 192.1 , s = 25.6 . n=100, bar(x)=192.1,s=25.6.\mathrm{n}=100, \overline{\mathrm{x}}=192.1, \mathrm{~s}=25.6 .
How likely it is to observe a sample mean of 192.1 or higher when the true population mean is 191 (i.e., if the null hypothesis is true)? We can again compute this probability using the Central Limit Theorem. Specifically,
当真实总体均值为 191 时(即,如果原假设为真),观察到 192.1 或更高的样本均值的可能性有多大?我们可以再次使用中心极限定理计算这个概率。具体说来
P ( X ¯ > 192.1 ) = P ( Z > 192.1 191 25.6 / 100 ) = P ( Z > 0.43 ) = 1 0.6664 = 0.3336 P ( X ¯ > 192.1 ) = P Z > 192.1 191 25.6 / 100 = P ( Z > 0.43 ) = 1 0.6664 = 0.3336 P( bar(X) > 192.1)=P(Z > (192.1-191)/(25.6//sqrt100))=P(Z > 0.43)=1-0.6664=0.3336P(\bar{X}>192.1)=P\left(Z>\frac{192.1-191}{25.6 / \sqrt{100}}\right)=P(Z>0.43)=1-0.6664=0.3336
There is a 33.4 % 33.4 % 33.4%33.4 \% probability of observing a sample mean as large as 192.1 when the true population mean is 191 . Do you think that the null hypothesis is likely true?
当真实总体均值为 191 时,观察到样本均值的 33.4 % 33.4 % 33.4%33.4 \% 概率高达 192.1。您认为零假设可能是正确的吗?
Neither of the sample means that we obtained allows us to know with certainty whether the null hypothesis is true or not. However, our computations suggest that, if the null hypothesis were true, the probability of observing a sample mean > 197.1 > 197.1 > 197.1>197.1 is less than 1 % 1 % 1%1 \%. In contrast, if the null hypothesis were true, the probability of observing a sample mean > 192.1 > 192.1 > 192.1>192.1 is about 33 % 33 % 33%33 \%. We can’t know whether the null hypothesis is true, but the sample that provided a mean value of 197.1 provides much stronger evidence in favor of rejecting the null hypothesis, than the sample that provided a mean value of 192.1. Note that this does not mean that a sample mean of 192.1 indicates that the null hypothesis is true; it just doesn’t provide compelling evidence to reject it.
我们获得的两个样本均不能让我们确定原假设是否为真。然而,我们的计算表明,如果原假设为真,则观察到样本均值的 > 197.1 > 197.1 > 197.1>197.1 概率小于 1 % 1 % 1%1 \% 。相反,如果原假设为真,则观察到样本均值的 > 192.1 > 192.1 > 192.1>192.1 概率约为 33 % 33 % 33%33 \% 。我们无法知道原假设是否为真,但与提供均值为 192.1 的样本相比,提供均值为 197.1 的样本提供了更有力的证据来支持否定原假设。请注意,这并不意味着样本均值 192.1 表示原假设为真;它只是没有提供令人信服的证据来拒绝它。
In essence, hypothesis testing is a procedure to compute a probability that reflects the strength of the evidence (based on a given sample) for rejecting the null hypothesis. In hypothesis testing, we determine a threshold or cut-off point (called the critical value) to decide when to believe the null hypothesis and when to believe the research hypothesis. It is important to note that it is possible to observe any sample mean when the true population mean is true (in this example equal to 191), but some sample means are very unlikely. Based on the two samples above it would seem reasonable to believe the research hypothesis when X = 197.1 X ¯ = 197.1 bar(X)=197.1\overline{\mathrm{X}}=197.1, but to believe the null hypothesis when X = 192.1 X ¯ = 192.1 bar(X)=192.1\overline{\mathrm{X}}=192.1. What we need is a threshold value such that if X X ¯ bar(X)\overline{\mathrm{X}} is above that threshold then we believe that H 1 H 1 H_(1)H_{1} is true and if X ¯ X ¯ bar(X)\bar{X} is below that threshold then we believe that H 0 H 0 H_(0)H_{0} is true. The difficulty in determining a threshold for X ¯ X ¯ bar(X)\bar{X} is that it depends on the scale of measurement. In this example, the threshold, sometimes called the critical value, might be 195 (i.e., if the sample mean is 195 or more then we believe that H 1 H 1 H_(1)\mathrm{H}_{1} is true and if the sample mean is less than 195 then we believe that H 0 H 0 H_(0)\mathrm{H}_{0} is true). Suppose we are interested in assessing an increase in blood pressure over time, the critical value will be different because blood pressures are measured in millimeters of mercury ( mmHg ) as opposed to in pounds. In the following we will explain how the critical value is determined and how we handle the issue of scale.
从本质上讲,假设检验是一种计算概率的过程,该程序反映了拒绝原假设的证据强度(基于给定样本)。在假设检验中,我们确定一个阈值或截止点(称为临界值)来决定何时相信原假设以及何时相信研究假设。请务必注意,当真实总体均值为真时(在本例中等于 191),可以观察到任何样本均值,但某些样本均值的可能性非常小。根据上面的两个样本,当 X = 197.1 X ¯ = 197.1 bar(X)=197.1\overline{\mathrm{X}}=197.1 时相信研究假设似乎是合理的,但当 时相信零假设。 X = 192.1 X ¯ = 192.1 bar(X)=192.1\overline{\mathrm{X}}=192.1 我们需要的是一个阈值,如果 X X ¯ bar(X)\overline{\mathrm{X}} 高于该阈值,则我们认为这是 H 1 H 1 H_(1)H_{1} 真的,如果 X ¯ X ¯ bar(X)\bar{X} 低于该阈值,则我们认为这是 H 0 H 0 H_(0)H_{0} 真的。确定阈值的难点 X ¯ X ¯ bar(X)\bar{X} 在于它取决于测量的尺度。在此示例中,阈值(有时称为临界值)可能是 195(即,如果样本均值为 195 或更大,则我们认为这是 H 1 H 1 H_(1)\mathrm{H}_{1} 真的,如果样本均值小于 195,则我们认为这是 H 0 H 0 H_(0)\mathrm{H}_{0} 真的)。假设我们有兴趣评估血压随时间的增加,临界值会有所不同,因为血压是以毫米汞柱 (mmHg) 而不是磅为单位测量的。在下文中,我们将解释如何确定临界值以及我们如何处理缩放问题。
First, to address the issue of scale in determining the critical value, we convert our sample data (in particular the sample mean) into a Z Z ZZ score. We know from the module on probability that the center of the Z Z ZZ distribution is zero and extreme values are those that exceed 2 or fall below -2 . Z Z ZZ scores above 2 and below -2 represent approximately 5 % 5 % 5%5 \% of all Z Z ZZ values. If the observed sample mean is close to the mean specified in H 0 H 0 H_(0)\mathrm{H}_{0} (here m = 191 m = 191 m=191\mathrm{m}=191 ), then Z will be close to zero. If the observed sample mean is much larger than the mean specified in H 0 H 0 H_(0)\mathrm{H}_{0}, then Z will be large.
首先,为了解决确定临界值时的尺度问题,我们将样本数据(特别是样本均值)转换为 Z Z ZZ 分数。我们从 概率 模块中知道 Z Z ZZ ,分布中心为零,极值是超过 2 或低于 -2 的值。 Z Z ZZ 高于 2 且低于 -2 的分数表示大约 5 % 5 % 5%5 \% 所有 Z Z ZZ 值。如果观测到的样本均值接近 H 0 H 0 H_(0)\mathrm{H}_{0} (此处 m = 191 m = 191 m=191\mathrm{m}=191 ) 中指定的均值,则 Z 将接近于零。如果观测到的样本均值远大于 H 0 H 0 H_(0)\mathrm{H}_{0} 中指定的均值,则 Z 将较大。
In hypothesis testing, we select a critical value from the Z Z ZZ distribution. This is done by first determining what is called the level of significance, denoted α α alpha\alpha (“alpha”). What we are doing here is drawing a line at extreme values. The level of significance is the probability that we reject the null hypothesis (in favor of the alternative) when it is actually true and is also called the Type I error rate.
在假设检验中,我们从 Z Z ZZ 分布中选择一个临界值。这是通过首先确定所谓的显著性水平来完成的,表示 α α alpha\alpha 为 (“alpha”)。我们在这里所做的是在极值处划一条线。显著性水平是我们拒绝原假设(支持备择方案)的概率,而它实际上是正确的,也称为 I 类错误率。
α = Level of significance = P ( Type I error ) = P ( Reject H 0 H 0 is true ) . α =  Level of significance  = P (  Type I error  ) = P  Reject  H 0 H 0  is true  . alpha=" Level of significance "=P(" Type I error ")=P(" Reject "H_(0)∣H_(0)" is true ").\alpha=\text { Level of significance }=P(\text { Type I error })=P\left(\text { Reject } H_{0} \mid H_{0} \text { is true }\right) .
Because α α alpha\alpha is a probability, it ranges between 0 and 1 . The most commonly used value in the medical literature for α α alpha\alpha is 0.05 , or 5 % 5 % 5%5 \%. Thus, if an investigator selects α = 0.05 α = 0.05 alpha=0.05\alpha=0.05, then they are allowing a 5 % 5 % 5%5 \% probability of incorrectly rejecting the null hypothesis in favor of the alternative when the null is in fact true. Depending on the circumstances, one might choose to use a level of significance of 1 % 1 % 1%1 \% or 10 % 10 % 10%10 \%. For example, if an investigator wanted to reject the null only if there were even stronger evidence than that ensured with α = 0.05 α = 0.05 alpha=0.05\alpha=0.05, they could choose a = 0.01 a = 0.01 a=0.01a=0.01 as their level of significance. The typical values for α α alpha\alpha are 0.01 , 0.05 0.01 , 0.05 0.01,0.050.01,0.05 and 0.10 , with α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 the most commonly used value.
因为 α α alpha\alpha 是一个概率,所以它的范围介于 0 和 1 之间。医学文献中最常用的值 α α alpha\alpha 是 0.05 或 5 % 5 % 5%5 \% 。因此,如果调查员选择 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 ,则他们允许错误地拒绝 null 假设的 5 % 5 % 5%5 \% 概率,而 null 实际上是 true。根据具体情况,可以选择使用 1 % 1 % 1%1 \% 或 的 10 % 10 % 10%10 \% 显著性级别。例如,如果调查员只想在有比 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 确保的证据更强的情况下拒绝 null,他们可以选择 a = 0.01 a = 0.01 a=0.01a=0.01 作为其显著性级别。的典型值为 α α alpha\alpha 0.01 , 0.05 0.01 , 0.05 0.01,0.050.01,0.05 和 0.10 ,是 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 最常用的值。
Suppose in our weight study we select α = 0.05 α = 0.05 alpha=0.05\alpha=0.05. We need to determine the value of Z Z ZZ that holds 5 % 5 % 5%5 \% of the values above it (see below).
假设在权重研究中,我们选择 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 。我们需要确定其上方值的 hold 5 % 5 % 5%5 \% 的值 Z Z ZZ (见下文)。

The critical value of Z Z ZZ for α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 is Z = 1.645 Z = 1.645 Z=1.645Z=1.645 (i.e., 5 % 5 % 5%5 \% of the distribution is above Z = 1.645 Z = 1.645 Z=1.645Z=1.645 ). With this value we can set up what is called our decision rule for the test. The rule is to reject H 0 H 0 H_(0)H_{0} if the Z Z ZZ score is 1.645 or more.
for 的 Z Z ZZ 临界值为 Z = 1.645 Z = 1.645 Z=1.645Z=1.645 (即, 5 % 5 % 5%5 \% 分布的 高于 Z = 1.645 Z = 1.645 Z=1.645Z=1.645 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 )。使用此值,我们可以为测试设置所谓的决策规则。规则是如果 Z Z ZZ 分数为 1.645 或更高,则拒绝 H 0 H 0 H_(0)H_{0}
With the first sample we have
对于第一个示例,我们有
X ¯ = 197.1 and Z = 197.1 191 25.6 / 100 = 2.38 X ¯ = 197.1  and  Z = 197.1 191 25.6 / 100 = 2.38 bar(X)=197.1" and "Z=(197.1-191)/(25.6//sqrt100)=2.38\bar{X}=197.1 \text { and } Z=\frac{197.1-191}{25.6 / \sqrt{100}}=2.38
Because 2.38 1.645 2.38 1.645 2.38 >= 1.6452.38 \geq 1.645, we reject the null hypothesis. (The same conclusion can be drawn by comparing the 0.0087 probability of observing a sample mean as extreme as 197.1 to the level of significance of 0.05 . If the observed probability is smaller than the level of significance we reject H 0 H 0 H_(0)\mathrm{H}_{0} ). Because the Z score exceeds the critical value, we conclude that the mean weight for men in 2006 is more than 191 pounds, the value reported in 2002. If we observed the second sample (i.e., sample mean =192.1), we would not be able to reject the null hypothesis because the Z Z ZZ score is 0.43 which is not in the rejection region (i.e., the region in the tail end of the curve above 1.645). With the second sample we do not have sufficient evidence (because we set our level of significance at 5 % 5 % 5%5 \% ) to conclude that weights have increased. Again, the same conclusion can be reached by comparing probabilities. The probability of observing a sample mean as extreme as 192.1 is 33.4 % 33.4 % 33.4%33.4 \% which is not below our 5 % 5 % 5%5 \% level of significance.
因为 2.38 1.645 2.38 1.645 2.38 >= 1.6452.38 \geq 1.645 ,我们拒绝原假设。(通过将观察到极端值为 197.1 的样本平均值的 0.0087 概率与 0.05 的显著性水平进行比较,可以得出相同的结论。如果观察到的概率小于显著性水平,则我们拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} )。由于 Z 分数超过临界值,我们得出结论,2006 年男性的平均体重超过 191 磅,即 2002 年报告的值。如果我们观察第二个样本(即样本均值 =192.1),我们将无法拒绝原假设, Z Z ZZ 因为分数为 0.43,这不在拒绝区域(即曲线尾端高于 1.645 的区域)。对于第二个样本,我们没有足够的证据(因为我们将显著性水平设置为 5 % 5 % 5%5 \% )来得出权重增加的结论。同样,通过比较概率可以得出相同的结论。观察到样本均值达到 192.1 的极端概率不 33.4 % 33.4 % 33.4%33.4 \% 低于我们的 5 % 5 % 5%5 \% 显著性水平。

Hypothesis Testing: Upper-, Lower, and Two Tailed Tests
假设检验:上、下和两个尾部检验

The procedure for hypothesis testing is based on the ideas described above. Specifically, we set up competing hypotheses, select a random sample from the population of interest and compute summary statistics. We then determine whether the sample data supports the null or alternative hypotheses. The procedure can be broken down into the following five steps.
假设检验的程序基于上述想法。具体来说,我们设置竞争假设,从感兴趣的总体中选择一个随机样本并计算汇总统计数据。然后,我们确定样本数据是否支持 null 假设或替代假设。该过程可以分为以下五个步骤。
  • Step 1. Set up hypotheses and select the level of significance α α alpha\alpha.
    步骤 1.设置假设验证并选择显著性级别 α α alpha\alpha

    H 0 H 0 H_(0)\mathrm{H}_{0} : Null hypothesis (no change, no difference); H 1 H 1 H_(1)\mathrm{H}_{1} : Research hypothesis (investigator’s belief); α = 0.05 α = 0.05 alpha=0.05\alpha=0.05
    H 0 H 0 H_(0)\mathrm{H}_{0} : 零假设 (无变化,无差异); H 1 H 1 H_(1)\mathrm{H}_{1} : 研究假设 (研究者的信念); α = 0.05 α = 0.05 alpha=0.05\alpha=0.05

    Upper-tailed, Lower-tailed, Two-tailed Tests
    上尾、下尾、双尾测试

    The research or alternative hypothesis can take one of three forms. An investigator might believe that the parameter has increased, decreased or changed. For example, an investigator might hypothesize:
    研究或替代假设可以采用以下三种形式之一。研究者可能认为参数已增加、减少或更改。例如,调查员可能会假设:
  1. H 1 : μ > μ 0 H 1 : μ > μ 0 H_(1):mu > mu_(0)H_{1}: \mu>\mu_{0}, where μ 0 μ 0 mu_(0)\mu_{0} is the comparator or null value (e.g., μ 0 = 191 μ 0 = 191 mu_(0)=191\mu_{0}=191 in our example about weight in men in 2006) and an increase is hypothesized - this type of test is called an upper-tailed test;
    H 1 : μ > μ 0 H 1 : μ > μ 0 H_(1):mu > mu_(0)H_{1}: \mu>\mu_{0} ,其中 μ 0 μ 0 mu_(0)\mu_{0} 是比较值或空值(例如, μ 0 = 191 μ 0 = 191 mu_(0)=191\mu_{0}=191 在我们关于 2006 年男性体重的例子中),并且假设增加 - 这种类型的测试称为上尾检验;
  2. H 1 : μ < μ 0 H 1 : μ < μ 0 H_(1):mu < mu_(0)\mathrm{H}_{1}: \mu<\mu_{0}, where a decrease is hypothesized and this is called a lowertailed test; or
    H 1 : μ < μ 0 H 1 : μ < μ 0 H_(1):mu < mu_(0)\mathrm{H}_{1}: \mu<\mu_{0} ,其中假设减少,这称为 低尾检验;或
  3. H 1 : μ μ 0 H 1 : μ μ 0 H_(1):mu!=mu_(0)\mathrm{H}_{1}: \mu \neq \mu_{0}, where a difference is hypothesized and this is called a two-tailed test.
    H 1 : μ μ 0 H 1 : μ μ 0 H_(1):mu!=mu_(0)\mathrm{H}_{1}: \mu \neq \mu_{0} ,其中假设存在差异,这称为双尾检验。
The exact form of the research hypothesis depends on the investigator’s belief about the parameter of interest and whether it has possibly increased, decreased or is different from the null value. The research hypothesis is set up by the investigator before any data are collected.
研究假设的确切形式取决于研究者对感兴趣参数的信念,以及它是否可能增加、减少或不同于零值。研究假设由研究者在收集任何数据之前建立。
  • Step 2. Select the appropriate test statistic.
    步骤 2。选择适当的检验统计量。
The test statistic is a single number that summarizes the sample information. An example of a test statistic is the Z Z ZZ statistic computed as follows:
检验统计量是汇总样本信息的单个数字。检验统计量的一个示例是按如下方式计算的 Z Z ZZ 统计量:
Z = X ¯ μ 0 s / n Z = X ¯ μ 0 s / n Z=(( bar(X))-mu_(0))/(s//sqrtn)Z=\frac{\bar{X}-\mu_{0}}{s / \sqrt{n}}
When the sample size is small, we will use t t tt statistics (just as we did when constructing confidence intervals for small samples). As we present each scenario, alternative test statistics are provided along with conditions for their appropriate use.
当样本量较小时,我们将使用 t t tt 统计数据(就像我们为小样本构建置信区间时所做的那样)。当我们介绍每种情况时,提供了替代测试统计量以及适当使用它们的条件。
  • Step 3. Set up decision rule.
    步骤 3。设置决策规则。
The decision rule is a statement that tells under what circumstances to reject the null hypothesis. The decision rule is based on specific values of the test statistic (e.g., reject H 0 H 0 H_(0)\mathrm{H}_{0} if Z 1.645 Z 1.645 Z >= 1.645\mathrm{Z} \geq 1.645 ). The decision rule for a specific test depends on 3 factors: the research or alternative hypothesis, the test statistic and the level of significance. Each is discussed below.
决策规则是一个语句,它告诉在什么情况下拒绝 null 假设。决策规则基于测试统计量的特定值(例如,reject H 0 H 0 H_(0)\mathrm{H}_{0} if Z 1.645 Z 1.645 Z >= 1.645\mathrm{Z} \geq 1.645 )。特定检验的决策规则取决于 3 个因素:研究或备择假设、检验统计量和显著性水平。下面将逐一讨论。
  1. The decision rule depends on whether an upper-tailed, lower-tailed, or two-tailed test is proposed. In an upper-tailed test the decision rule has investigators reject H 0 H 0 H_(0)\mathrm{H}_{0} if the test statistic is larger than the critical value. In a lower-tailed test the decision rule has investigators reject H 0 H 0 H_(0)\mathrm{H}_{0} if the test statistic is smaller than the critical value. In a two-tailed test the decision rule has investigators reject H 0 H 0 H_(0)\mathrm{H}_{0} if the test statistic is extreme, either larger than an upper critical value or smaller than a lower critical value.
    决策规则取决于是建议上尾、下尾还是双尾检验。在上尾检验中,决策规则让调查员在检验统计量大于临界值时拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} 。在低尾检验中,决策规则让调查员在检验统计量小于临界值时拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} 。在双尾检验中,决策规则让调查员拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} 检验统计量是否为极端值,大于临界上限或小于临界下限值。
  2. The exact form of the test statistic is also important in determining the decision rule. If the test statistic follows the standard normal distribution (Z), then the decision rule will be based on the standard normal distribution. If the test statistic follows the t t tt distribution, then the decision rule will be based on the t t tt distribution. The appropriate critical value will be selected from the t distribution again depending on the specific alternative hypothesis and the level of significance.
    检验统计量的确切形式在确定决策规则时也很重要。如果检验统计量服从标准正态分布 (Z),则决策规则将基于标准正态分布。如果检验统计量服从 t t tt 分布,则决策规则将基于 t t tt 分布。将根据特定的备择假设和显著性水平再次从 t 分布中选择适当的临界值。
  3. The third factor is the level of significance. The level of significance which is selected in Step 1 (e.g., α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 ) dictates the critical value. For example, in an upper tailed Z Z ZZ test, if α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 then the critical value is Z = 1.645 Z = 1.645 Z=1.645Z=1.645.
    第三个因素是显著性水平。在步骤 1 中选择的显著性水平(例如, α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 )决定了临界值。例如,在上尾测试 Z Z ZZ 中,如果 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 则临界值为 Z = 1.645 Z = 1.645 Z=1.645Z=1.645
The following figures illustrate the rejection regions defined by the decision rule for upper-, lower- and two-tailed Z tests with α = 0.05 α = 0.05 alpha=0.05\alpha=0.05. Notice that the rejection regions are in the upper, lower and both tails of the curves, respectively. The decision rules are written below each figure.
下图说明了由决策规则定义的拒绝区域,用于 上 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 尾、下尾和双尾 Z 检验。请注意,拒绝区域分别位于曲线的上部、下部和两尾。决策规则写在每个图的下方。


The complete table of critical values of Z Z ZZ for upper, lower and two-tailed tests can be found in the table of Z Z ZZ values to the right in “Other Resources.”
上、下和双尾测试的临界值 Z Z ZZ 的完整表可以在 “Other Resources” 中右侧的 Z Z ZZ 值表中找到。
Critical values of t t tt for upper, lower and two-tailed tests can be found in the table of t t tt values in “Other Resources.”
上限、下限和双尾测试的临界值 t t tt 可以在“其他资源”的值表 t t tt 中找到。
  • Step 4. Compute the test statistic.
    步骤 4。计算检验统计量。
Here we compute the test statistic by substituting the observed sample data into the test statistic identified in Step 2.
在这里,我们通过将观察到的样本数据替换为步骤 2 中确定的检验统计量来计算检验统计量。
  • Step 5. Conclusion.  步骤 5。结论。
The final conclusion is made by comparing the test statistic (which is a summary of the information observed in the sample) to the decision rule. The final conclusion will be either to reject the null hypothesis (because the sample data are very unlikely if the null hypothesis is true) or not to reject the null hypothesis (because the sample data are not very unlikely).
通过将检验统计量(即样本中观察到的信息的摘要)与决策规则进行比较,得出最终结论。最终结论是拒绝原假设(因为如果原假设为真,样本数据不太可能)或不拒绝原假设(因为样本数据并非不太可能)。
If the null hypothesis is rejected, then an exact significance level is computed to describe the likelihood of observing the sample data assuming that the null hypothesis is true. The exact level of significance is called the p p pp-value and it will be less than the chosen level of significance if we reject H 0 H 0 H_(0)\mathrm{H}_{0}.
如果原假设被拒绝,则计算确切的显著性水平,以描述假设原假设为真时观察样本数据的可能性。确切的显著性水平称为 p p pp -值,如果我们拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} ,它将小于所选的显著性水平。
Statistical computing packages provide exact p p pp-values as part of their standard output for hypothesis tests. In fact, when using a statistical computing package, the steps outlined about can be abbreviated. The hypotheses (step 1) should always be set up in advance of any analysis and the significance criterion should also be determined (e.g., α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 ). Statistical computing packages will produce the test statistic (usually reporting the test statistic as t t tt ) and a p p pp-value. The investigator can then determine statistical significance using the following: If p α p α p <= alphap \leq \alpha then reject H 0 H 0 H_(0)H_{0}.
统计计算包提供精确的 p p pp -values 作为其假设检验标准输出的一部分。事实上,在使用统计计算包时,可以缩短概述的步骤。假设(步骤 1)应始终在任何分析之前设置,并且还应确定显著性标准(例如, α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 )。统计计算包将生成 test 统计信息(通常将 test 统计信息报告为 t t tt )和 p p pp -value。然后,调查员可以使用以下内容确定统计显著性: 如果 p α p α p <= alphap \leq \alpha then reject H 0 H 0 H_(0)H_{0} .

Things to Remember When Interpreting P P P\mathbf{P} Values
解释 P P P\mathbf{P} 值时要记住的事项

  1. P P PP-values summarize statistical significance and do not address clinical significance. There are instances where results are both clinically and statistically significant - and others where they are one or the other but not both. This is because P -values depend upon both the magnitude of association and the precision of the estimate (the sample size). When the sample size is large, results can reach statistical significance (i.e., small p p pp-value) even when the effect is small and clinically unimportant. Conversely, with small sample sizes, results can fail to reach statistical significance yet the effect is large and potentially clinical important. It is extremely important to assess both statistical and clinical significance of results.
    P P PP -值总结了统计显着性,不涉及临床显着性。在某些情况下,结果在临床和统计学上都具有显著意义,而在其他情况下,结果则是其中之一,但并非两者兼而有之。这是因为 P 值取决于关联量级和估计的精度(样本量)。当样本量较大时,即使效应较小且临床上不重要,结果也可以达到统计学显着性(即小 p p pp -value)。相反,对于小样本量,结果可能无法达到统计学意义,但影响很大并且具有潜在的临床重要性。评估结果的统计和临床意义非常重要。
  2. Statistical tests allow us to draw conclusions of significance or not based on a comparison of the pvalue to our selected level of significance. Remember that this conclusion is based on the selected level of significance ( α α alpha\alpha ) and could change with a different level of significance. While α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 is standard, a p-value of 0.06 should be examined for clinical importance.
    统计检验允许我们根据 pvalue 与我们选择的显着性水平的比较来得出显著性或不显著性的结论。请记住,此结论基于所选的显著性水平 ( α α alpha\alpha ),并且可能会随着不同的显著性水平而变化。虽然是标准,但 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 应检查 p 值为 0.06 的临床重要性。
  3. When conducting any statistical analysis, there is always a possibility of an incorrect conclusion. With many statistical analyses, this possibility is increased. Investigators should only conduct the statistical analyses (e.g., tests) of interest and not all possible tests.
    在进行任何统计分析时,总是有可能得出错误的结论。通过许多统计分析,这种可能性增加了。调查人员应仅进行感兴趣的统计分析(例如,测试),而不是所有可能的测试。
  4. Many investigators inappropriately believe that the p -value represents the probability that the null hypothesis is true. P -values are computed based on the assumption that the null hypothesis is true. The p -value is the probability that the data could deviate from the null hypothesis as much as they did or more. Consequently, the p-value measures the compatibility of the data with the null hypothesis, not the probability that the null hypothesis is correct.
    许多调查人员不恰当地认为 p 值表示原假设为真的概率。P 值是根据原假设为真的假设计算的。p 值是数据偏离原假设的概率,其偏差程度与原假设相同或更大。因此,p 值度量数据与原假设的兼容性,而不是原假设正确的概率。
  5. Statistical significance does not take into account the possibility of bias or confounding - these issues must always be investigated.
    统计显著性没有考虑偏差或混淆的可能性 - 必须始终调查这些问题。
  6. Evidence-based decision making is important in public health and in medicine, but decisions are rarely made based on the finding of a single study. Replication is always important to build a body of evidence to support findings.
    循证决策在公共卫生和医学中很重要,但很少根据单一研究的结果做出决策。复制对于构建证据体系以支持研究结果始终很重要。
We now use the five-step procedure to test the research hypothesis that the mean weight in men in 2006 is more than 191 pounds. We will assume the sample data are as follows: n = 100 , x = 197.1 n = 100 , x ¯ = 197.1 n=100, bar(x)=197.1\mathrm{n}=100, \overline{\mathrm{x}}=197.1 and s = 25.6 s = 25.6 s=25.6\mathrm{s}=25.6.
我们现在使用五步程序来检验 2006 年男性平均体重超过 191 磅的研究假设。我们假设样本数据如下: n = 100 , x = 197.1 n = 100 , x ¯ = 197.1 n=100, bar(x)=197.1\mathrm{n}=100, \overline{\mathrm{x}}=197.1 s = 25.6 s = 25.6 s=25.6\mathrm{s}=25.6
  • Step 1. Set up hypotheses and determine level of significance
    步骤 1.设置假设并确定显著性级别
H 0 : μ = 191 H 1 : μ > 191 α = 0.05 H 0 : μ = 191 H 1 : μ > 191 α = 0.05 H_(0):mu=191H_(1):mu > 191quad alpha=0.05H_{0}: \mu=191 H_{1}: \mu>191 \quad \alpha=0.05
The research hypothesis is that weights have increased, and therefore an upper tailed test is used.
研究假设是体重增加,因此使用上尾检验。
  • Step 2. Select the appropriate test statistic.
    步骤 2。选择适当的检验统计量。
Because the sample size is large ( n 30 n 30 n >= 30n \geq 30 ) the appropriate test statistic is
由于样本量很大 ( n 30 n 30 n >= 30n \geq 30 ),因此适当的检验统计量为
Z = X ¯ μ 0 s / n Z = X ¯ μ 0 s / n Z=(( bar(X))-mu_(0))/(s//sqrtn)Z=\frac{\bar{X}-\mu_{0}}{s / \sqrt{n}}
  • Step 3. Set up decision rule.
    步骤 3。设置决策规则。
In this example, we are performing an upper tailed test ( H 1 : μ > 191 H 1 : μ > 191 H_(1):mu > 191H_{1}: \mu>191 ), with a Z Z ZZ test statistic and selected α = 0.05 α = 0.05 alpha=0.05\alpha=0.05. Reject H 0 H 0 H_(0)H_{0} if Z Z Z >=Z \geq 1.645 .
在此示例中,我们将执行上尾测试 ( H 1 : μ > 191 H 1 : μ > 191 H_(1):mu > 191H_{1}: \mu>191 ),其中包含 Z Z ZZ test 统计量和 selected α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 H 0 H 0 H_(0)H_{0} 如果 Z Z Z >=Z \geq 1.645 .
  • Step 4. Compute the test statistic.
    步骤 4。计算检验统计量。
We now substitute the sample data into the formula for the test statistic identified in Step 2.
现在,我们将样本数据替换为步骤 2 中确定的检验统计量的公式。
Z = X ¯ μ 0 s / n = 195.3 191 25.6 / 200 = 2.38 Z = X ¯ μ 0 s / n = 195.3 191 25.6 / 200 = 2.38 Z=(( bar(X))-mu_(0))/(s//sqrtn)=(195.3-191)/(25.6//sqrt200)=2.38Z=\frac{\bar{X}-\mu_{0}}{s / \sqrt{n}}=\frac{195.3-191}{25.6 / \sqrt{200}}=2.38

Step 5. Conclusion.  步骤 5。结论。

We reject H 0 H 0 H_(0)H_{0} because 2.38 1.645 2.38 1.645 2.38 >= 1.6452.38 \geq 1.645. We have statistically significant evidence at a = 0.05 a = 0.05 a=0.05a=0.05, to show that the mean weight in men in 2006 is more than 191 pounds. Because we rejected the null hypothesis, we now approximate the p-value which is the likelihood of observing the sample data if the null hypothesis is true. An alternative definition of the p p pp-value is the smallest level of significance where we can still reject H 0 H 0 H_(0)H_{0}. In this example, we observed Z = 2.38 Z = 2.38 Z=2.38Z=2.38 and for α = 0.05 α = 0.05 alpha=0.05\alpha=0.05, the critical value was 1.645. Because 2.38 exceeded 1.645 we rejected H 0 H 0 H_(0)H_{0}. In our conclusion we reported a statistically significant increase in mean weight at a 5 % 5 % 5%5 \% level of significance. Using the table of critical values for upper tailed tests, we can approximate the p p pp-value. If we select α = 0.025 α = 0.025 alpha=0.025\alpha=0.025, the critical value is 1.96 , and we still reject H 0 H 0 H_(0)H_{0} because 2.38 1.960 2.38 1.960 2.38 >= 1.9602.38 \geq 1.960. If we select α = 0.010 α = 0.010 alpha=0.010\alpha=0.010 the critical value is 2.326 , and we still reject H 0 H 0 H_(0)\mathrm{H}_{0} because 2.38 2.326 2.38 2.326 2.38 >= 2.3262.38 \geq 2.326. However, if we select α = 0.005 α = 0.005 alpha=0.005\alpha=0.005, the critical value is 2.576 , and we cannot reject H 0 H 0 H_(0)\mathrm{H}_{0} because 2.38 < 2.576 2.38 < 2.576 2.38 < 2.5762.38<2.576. Therefore, the smallest α α alpha\alpha where we still reject H 0 H 0 H_(0)\mathrm{H}_{0} is 0.010 . This is the p -value. A statistical computing package would produce a more precise p -value which would be in between 0.005 and 0.010 . Here we are approximating the p -value and would report p < p < p <\mathrm{p}< 0.010 .
我们拒绝 H 0 H 0 H_(0)H_{0} 是因为 2.38 1.645 2.38 1.645 2.38 >= 1.6452.38 \geq 1.645 .我们在 有统计学上显著的证据表明 a = 0.05 a = 0.05 a=0.05a=0.05 ,2006 年男性的平均体重超过 191 磅。由于我们拒绝了原假设,因此我们现在近似 p 值,即如果原假设为真,则观察样本数据的可能性。 p p pp -value 的另一种定义是我们仍然可以拒绝 H 0 H 0 H_(0)H_{0} 的最小显著性水平。在此示例中,我们观察到 Z = 2.38 Z = 2.38 Z=2.38Z=2.38 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 临界值为 1.645。因为 2.38 超过 1.645,所以我们拒绝了 H 0 H 0 H_(0)H_{0} 。在我们的结论中,我们报告了在显著性 5 % 5 % 5%5 \% 水平上平均体重的统计学显着增加。使用上尾检验的临界值表,我们可以近似 p p pp -value。如果我们选择 α = 0.025 α = 0.025 alpha=0.025\alpha=0.025 ,则临界值为 1.96 ,我们仍然拒绝 H 0 H 0 H_(0)H_{0} ,因为 2.38 1.960 2.38 1.960 2.38 >= 1.9602.38 \geq 1.960 。如果我们选择 α = 0.010 α = 0.010 alpha=0.010\alpha=0.010 临界值为 2.326 ,我们仍然拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} ,因为 2.38 2.326 2.38 2.326 2.38 >= 2.3262.38 \geq 2.326 。但是,如果我们选择 α = 0.005 α = 0.005 alpha=0.005\alpha=0.005 ,则临界值为 2.576 ,我们不能拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} ,因为 2.38 < 2.576 2.38 < 2.576 2.38 < 2.5762.38<2.576 。因此,我们仍然拒绝 α α alpha\alpha H 0 H 0 H_(0)\mathrm{H}_{0} 的最小值为 0.010 。这是 p 值。统计计算包将产生更精确的 p 值,该值介于 0.005 和 0.010 之间。这里我们近似 p 值,将报告 p < p < p <\mathrm{p}< 0.010 。

Type I and Type II Errors
I 类和 II 类错误

In all tests of hypothesis, there are two types of errors that can be committed. The first is called a Type I error and refers to the situation where we incorrectly reject H 0 H 0 H_(0)\mathrm{H}_{0} when in fact it is true. This is also called a false positive result (as we incorrectly conclude that the research hypothesis is true when in fact it is not). When we run a test of hypothesis and decide to reject H 0 H 0 H_(0)\mathrm{H}_{0} (e.g., because the test statistic exceeds the critical value in an upper tailed test) then either we make a correct decision because the research hypothesis is true or we commit a Type I error. The different conclusions are summarized in the table below. Note that we will never know whether the null hypothesis is really true or false (i.e., we will never know which row of the following table reflects reality).
在所有假设检验中,都有两种类型的错误可以提交。第一个称为 I 类错误,指的是我们错误地拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} 了,而实际上是真的。这也称为假阳性结果(因为我们错误地得出研究假设是正确的结论,而实际上并非如此)。当我们运行假设检验并决定拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} 时(例如,因为检验统计量超过了上尾检验中的临界值),那么我们要么因为研究假设是正确的而做出正确的决定,要么我们犯了 I 类错误。下表总结了不同的结论。请注意,我们永远不知道原假设是真的是真还是假(即,我们永远不知道下表的哪一行反映了现实)。
Conclusion in Test of Hypothesis
假设检验中的结论
Do Not Reject H 0 H 0 H_(0)\mathbf{H}_{\mathbf{0}}  不拒绝 H 0 H 0 H_(0)\mathbf{H}_{\mathbf{0}} Reject H 0 H 0 H_(0)\mathbf{H}_{\mathbf{0}}  拒绝 H 0 H 0 H_(0)\mathbf{H}_{\mathbf{0}}
H 0 H 0 H_(0)\mathbf{H}_{\mathbf{0}} is True   H 0 H 0 H_(0)\mathbf{H}_{\mathbf{0}} 是真的 Correct Decision  正确的决定 Type I Error  I 类错误
H 0 H 0 H_(0)\mathbf{H}_{\mathbf{0}} is False   H 0 H 0 H_(0)\mathbf{H}_{\mathbf{0}} 为 False Type II Error  II 类错误 Correct Decision  正确的决定
Conclusion in Test of Hypothesis Do Not Reject H_(0) Reject H_(0) H_(0) is True Correct Decision Type I Error H_(0) is False Type II Error Correct Decision| | Conclusion in Test of Hypothesis | | | :---: | :---: | :---: | | | Do Not Reject $\mathbf{H}_{\mathbf{0}}$ | Reject $\mathbf{H}_{\mathbf{0}}$ | | $\mathbf{H}_{\mathbf{0}}$ is True | Correct Decision | Type I Error | | $\mathbf{H}_{\mathbf{0}}$ is False | Type II Error | Correct Decision |
In the first step of the hypothesis test, we select a level of significance, α α alpha\alpha, and α = P α = P alpha=P\alpha=P (Type I error). Because we purposely select a small value for α α alpha\alpha, we control the probability of committing a Type I error. For example, if we select α = 0.05 α = 0.05 alpha=0.05\alpha=0.05, and our test tells us to reject H 0 H 0 H_(0)\mathrm{H}_{0}, then there is a 5 % 5 % 5%5 \% probability that we commit a Type I error. Most investigators are very comfortable with this and are confident when rejecting H 0 H 0 H_(0)\mathrm{H}_{0} that the research hypothesis is true (as it is the more likely scenario when we reject H 0 H 0 H_(0)\mathrm{H}_{0} ).
在假设检验的第一步中,我们选择显著性水平 、 α α alpha\alpha α = P α = P alpha=P\alpha=P (I 类错误)。因为我们特意为 α α alpha\alpha 选择一个较小的值,所以我们控制了提交类型 I 错误的概率。例如,如果我们选择 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 ,并且我们的测试告诉我们 reject H 0 H 0 H_(0)\mathrm{H}_{0} ,那么我们 5 % 5 % 5%5 \% 很可能会犯下类型 I 错误。大多数研究人员对此非常满意,并且在拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} 研究假设是正确的时充满信心(因为当我们拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} 时,这是更有可能的情况)。
When we run a test of hypothesis and decide not to reject H 0 H 0 H_(0)\mathrm{H}_{0} (e.g., because the test statistic is below the critical value in an upper tailed test) then either we make a correct decision because the null hypothesis is true or we commit a Type II error. Beta ( β β beta\beta )
当我们运行假设检验并决定不拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} 时(例如,因为检验统计量低于上尾检验中的临界值),那么我们要么因为原假设为真而做出正确的决定,要么我们犯了类型 II 错误。贝塔 ( β β beta\beta

represents the probability of a Type II error and is defined as follows: β = P β = P beta=P\beta=P (Type II error) = P = P =P=P (Do not Reject H 0 H 0 H 0 H 0 H_(0)∣H_(0)H_{0} \mid H_{0} is false). Unfortunately, we cannot choose β β beta\beta to be small (e.g., 0.05 ) to control the probability of committing a Type II error because β β beta\beta depends on several factors including the sample size, α α alpha\alpha, and the research hypothesis. When we do not reject H 0 H 0 H_(0)\mathrm{H}_{0}, it may be very likely that we are committing a Type II error (i.e., failing to reject H 0 H 0 H_(0)\mathrm{H}_{0} when in fact it is false). Therefore, when tests are run and the null hypothesis is not rejected we often make a weak concluding statement allowing for the possibility that we might be committing a Type II error. If we do not reject H 0 H 0 H_(0)\mathrm{H}_{0}, we conclude that we do not have significant evidence to show that H 1 H 1 H_(1)\mathrm{H}_{1} is true. We do not conclude that H 0 H 0 H_(0)\mathrm{H}_{0} is true.
表示类型 II 错误的概率,定义如下: β = P β = P beta=P\beta=P (类型 II 错误)( = P = P =P=P Do not Reject H 0 H 0 H 0 H 0 H_(0)∣H_(0)H_{0} \mid H_{0} 为 false)。不幸的是,我们不能选择 β β beta\beta 很小(例如,0.05 )来控制犯 II 类错误的概率,因为 β β beta\beta 取决于几个因素,包括样本量和 α α alpha\alpha 研究假设。当我们不拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} 时,我们很可能犯了 II 类错误(即,在实际上是错误的时未能拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} )。因此,当运行测试并且零假设未被拒绝时,我们通常会做出一个弱的结论性陈述,允许我们可能犯下 II 类错误。如果我们不拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} ,我们就会得出结论,我们没有重要的证据证明这是 H 1 H 1 H_(1)\mathrm{H}_{1} 真的。我们并不认为这是 H 0 H 0 H_(0)\mathrm{H}_{0} 真的。

类型 II 错误的最常见原因是样本量小。
The most common reason for
a Type II error is a small
sample size.
The most common reason for a Type II error is a small sample size.| The most common reason for | | :--- | | a Type II error is a small | | sample size. |
"The most common reason for a Type II error is a small sample size."| | The most common reason for <br> a Type II error is a small <br> sample size. | | :--- | :--- |

Tests with One Sample, Continuous Outcome
使用一个样本的测试,连续结果

Hypothesis testing applications with a continuous outcome variable in a single population are performed according to the five-step procedure outlined above. A key component is setting up the null and research hypotheses. The objective is to compare the mean in a single population to known mean ( μ 0 μ 0 mu_(0)\mu_{0} ). The known value is generally derived from another study or report, for example a study in a similar, but not identical, population or a study performed some years ago. The latter is called a historical control. It is important in setting up the hypotheses in a one sample test that the mean specified in the null hypothesis is a fair and reasonable comparator. This will be discussed in the examples that follow.
根据上述五步程序执行单个总体中具有连续结果变量的假设检验应用程序。一个关键组成部分是设置 null 和 research 假设。目标是将单个总体中的平均值与已知平均值 ( μ 0 μ 0 mu_(0)\mu_{0} ) 进行比较。已知值通常来自另一项研究或报告,例如,对相似但不完全相同的人群的研究或几年前进行的研究。后者称为历史控制。在单样本检验中设置假设时,原假设中指定的平均值是公平合理的比较因子,这一点很重要。这将在下面的示例中讨论。
In one sample tests for a continuous outcome, we set up our hypotheses against an appropriate comparator. We select a sample and compute descriptive statistics on the sample data - including the sample size ( n n nn ), the sample mean ( X ¯ X ¯ bar(X)\bar{X} ) and the sample standard deviation (s). We then determine the appropriate test statistic (Step 2) for the hypothesis test. The formulas for test statistics depend on the sample size and are given below.
在一项针对连续结果的样本测试中,我们针对适当的比较对象设置我们的假设。我们选择一个样本并计算样本数据的描述性统计数据 - 包括样本量 ( n n nn )、样本均值 ( X ¯ X ¯ bar(X)\bar{X} ) 和样本标准差 (s)。然后,我们确定假设检验的适当检验统计量(步骤 2)。检验统计量的公式取决于样本量,如下所示。
Test Statistics for Testing H 0 : μ = μ 0 H 0 : μ = μ 0 H_(0):mu=mu_(0)\boldsymbol{H}_{\mathbf{0}}: \boldsymbol{\mu}=\boldsymbol{\mu}_{\mathbf{0}}
用于测试 H 0 : μ = μ 0 H 0 : μ = μ 0 H_(0):mu=mu_(0)\boldsymbol{H}_{\mathbf{0}}: \boldsymbol{\mu}=\boldsymbol{\mu}_{\mathbf{0}} 的测试统计量
if n 30 n 30 n >= 30\mathrm{n} \geq 30  如果 n 30 n 30 n >= 30\mathrm{n} \geq 30 if n < 30 n < 30 n < 30\mathrm{n}<30  如果 n < 30 n < 30 n < 30\mathrm{n}<30
Z = X μ 0 s / n Z = X ¯ μ 0 s / n Z=( bar(X)-mu_(0))/((s)//sqrtn)Z=\frac{\overline{\mathrm{X}}-\mu_{0}}{\mathrm{~s} / \sqrt{\mathrm{n}}} t = X μ 0 s / n t = X ¯ μ 0 s / n t=( bar(X)-mu_(0))/((s)//sqrtn)\mathrm{t}=\frac{\overline{\mathrm{X}}-\mu_{0}}{\mathrm{~s} / \sqrt{\mathrm{n}}}
where df = n 1 df = n 1 df=n-1\mathrm{df}=\mathrm{n}-1  哪里 df = n 1 df = n 1 df=n-1\mathrm{df}=\mathrm{n}-1
Test Statistics for Testing H_(0):mu=mu_(0) if n >= 30 if n < 30 Z=( bar(X)-mu_(0))/((s)//sqrtn) t=( bar(X)-mu_(0))/((s)//sqrtn) where df=n-1| Test Statistics for Testing $\boldsymbol{H}_{\mathbf{0}}: \boldsymbol{\mu}=\boldsymbol{\mu}_{\mathbf{0}}$ | | | :---: | :---: | | if $\mathrm{n} \geq 30$ | if $\mathrm{n}<30$ | | $Z=\frac{\overline{\mathrm{X}}-\mu_{0}}{\mathrm{~s} / \sqrt{\mathrm{n}}}$ | $\mathrm{t}=\frac{\overline{\mathrm{X}}-\mu_{0}}{\mathrm{~s} / \sqrt{\mathrm{n}}}$ | | | where $\mathrm{df}=\mathrm{n}-1$ |
Note that statistical computing packages will use the t t tt statistic exclusively and make the necessary adjustments for comparing the test statistic to appropriate values from probability tables to produce a p -value.
请注意,统计计算包将仅使用统计数据, t t tt 并进行必要的调整,以便将测试统计数据与概率表中的相应值进行比较,以生成 p 值。

Example:  例:

The National Center for Health Statistics (NCHS) published a report in 2005 entitled Health, United States, containing extensive information on major trends in the health of Americans. Data are provided for the US population as a whole and for specific ages, sexes and races. The NCHS report indicated that in 2002 Americans paid an average of $ 3 , 302 $ 3 , 302 $3,302\$ 3,302 per year on health care and prescription drugs. An investigator hypothesizes that in 2005 expenditures have decreased primarily due to the availability of generic drugs. To test the hypothesis, a sample of 100 Americans are selected and their expenditures on health care and prescription drugs in 2005 are measured. The sample data are summarized as follows: n = 100 , X = $ 3 , 190 n = 100 , X ¯ = $ 3 , 190 n=100, bar(X)=$3,190\mathrm{n}=100, \overline{\mathrm{X}}=\$ 3,190 and s = $ 890 s = $ 890 s=$890\mathrm{s}=\$ 890. Is there statistical evidence of a reduction in expenditures on health care and prescription drugs in 2005? Is the sample mean of $ 3 , 190 $ 3 , 190 $3,190\$ 3,190 evidence of a true reduction in the mean or is it within chance fluctuation? We will run the test using the five-step approach.
美国国家卫生统计中心 (NCHS) 于 2005 年发表了一份题为《美国健康》的报告,其中包含有关美国人健康主要趋势的广泛信息。数据针对美国整体人口以及特定年龄、性别和种族。NCHS 报告指出,2002 年,美国人平均 $ 3 , 302 $ 3 , 302 $3,302\$ 3,302 每年支付医疗保健和处方药费用。一位研究人员假设,2005 年支出减少主要是由于仿制药的供应。为了检验这一假设,选择了 100 名美国人的样本,并测量了他们在 2005 年的医疗保健和处方药支出。示例数据汇总如下: n = 100 , X = $ 3 , 190 n = 100 , X ¯ = $ 3 , 190 n=100, bar(X)=$3,190\mathrm{n}=100, \overline{\mathrm{X}}=\$ 3,190 s = $ 890 s = $ 890 s=$890\mathrm{s}=\$ 890 .是否有统计证据表明 2005 年医疗保健和处方药的支出有所减少?样本均值是均值真正减少 $ 3 , 190 $ 3 , 190 $3,190\$ 3,190 的证据,还是在机会波动范围内?我们将使用五步法运行测试。
  • Step 1. Set up hypotheses and determine level of significance
    步骤 1.设置假设并确定显著性级别

    H 0 : μ = 3 , 302 H 1 : μ < 3 , 302 α = 0.05 H 0 : μ = 3 , 302 H 1 : μ < 3 , 302 α = 0.05 H_(0):mu=3,302H_(1):mu < 3,302quad alpha=0.05\mathrm{H}_{0}: \mu=3,302 \mathrm{H}_{1}: \mu<3,302 \quad \alpha=0.05
    The research hypothesis is that expenditures have decreased, and therefore a lower-tailed test is used.
    研究假设是支出减少了,因此使用了低尾检验。
  • Step 2. Select the appropriate test statistic.
    步骤 2。选择适当的检验统计量。
Because the sample size is large ( n 30 n 30 n >= 30n \geq 30 ) the appropriate test statistic is
由于样本量很大 ( n 30 n 30 n >= 30n \geq 30 ),因此适当的检验统计量为
Z = X ¯ μ 0 s / n Z = X ¯ μ 0 s / n Z=(( bar(X))-mu_(0))/(s//sqrtn)Z=\frac{\bar{X}-\mu_{0}}{s / \sqrt{n}}
  • Step 3. Set up decision rule.
    步骤 3。设置决策规则。
This is a lower tailed test, using a Z Z ZZ statistic and a 5 % 5 % 5%5 \% level of significance. Reject H 0 H 0 H_(0)H_{0} if Z 1.645 Z 1.645 Z <= -1.645Z \leq-1.645.
这是一个低尾检验,使用 Z Z ZZ 统计量和显著性 5 % 5 % 5%5 \% 水平。 H 0 H 0 H_(0)H_{0} 如果 Z 1.645 Z 1.645 Z <= -1.645Z \leq-1.645 .
  • Step 4. Compute the test statistic.
    步骤 4。计算检验统计量。
We now substitute the sample data into the formula for the test statistic identified in Step 2.
现在,我们将样本数据替换为步骤 2 中确定的检验统计量的公式。
Z = X ¯ μ 0 s / n = 3 , 190 3 , 302 890 / 100 = 1.26 Z = X ¯ μ 0 s / n = 3 , 190 3 , 302 890 / 100 = 1.26 Z=(( bar(X))-mu_(0))/(s//sqrtn)=(3,190-3,302)/(890//sqrt100)=-1.26Z=\frac{\bar{X}-\mu_{0}}{s / \sqrt{n}}=\frac{3,190-3,302}{890 / \sqrt{100}}=-1.26
  • Step 5. Conclusion.  步骤 5。结论。
We do not reject H 0 H 0 H_(0)\mathrm{H}_{0} because 1.26 > 1.645 1.26 > 1.645 -1.26 > -1.645-1.26>-1.645. We do not have statistically significant evidence at α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 to show that the mean expenditures on health care and prescription drugs are lower in 2005 than the mean of $ 3 , 302 $ 3 , 302 $3,302\$ 3,302 reported in 2002.
我们拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} 不是因为 1.26 > 1.645 1.26 > 1.645 -1.26 > -1.645-1.26>-1.645 .我们没有统计学上显著的证据表明 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 2005 年医疗保健和处方药的平均支出低于 2002 年 $ 3 , 302 $ 3 , 302 $3,302\$ 3,302 报告的平均值。
Recall that when we fail to reject H 0 H 0 H_(0)\mathrm{H}_{0} in a test of hypothesis that either the null hypothesis is true (here the mean expenditures in 2005 are the same as those in 2002 and equal to $ 3 , 302 $ 3 , 302 $3,302\$ 3,302 ) or we committed a Type II error (i.e., we failed to reject H 0 H 0 H_(0)H_{0} when in fact it is false). In summarizing this test, we conclude that we do not have sufficient evidence to reject H 0 H 0 H_(0)\mathrm{H}_{0}. We do not conclude that H 0 H 0 H_(0)\mathrm{H}_{0} is true, because there may be a moderate to high probability that we committed a Type II error. It is possible that the sample size is not large enough to detect a difference in mean expenditures.
回想一下,当我们在假设检验中未能拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} 原假设为真(此处 2005 年的平均支出与 2002 年的平均支出相同且等于 $ 3 , 302 $ 3 , 302 $3,302\$ 3,302 )或我们犯了 II 类错误(即,我们未能拒绝 H 0 H 0 H_(0)H_{0} ,而实际上它是假的)。在总结此测试时,我们得出结论,我们没有足够的证据来拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} 。我们没有得出结论这是 H 0 H 0 H_(0)\mathrm{H}_{0} 真的,因为我们犯了 II 类错误的概率可能是中等到高。样本量可能不够大,无法检测到平均支出的差异。

Example.  例。

The NCHS reported that the mean total cholesterol level in 2002 for all adults was 203. Total cholesterol levels in participants who attended the seventh examination of the Offspring in the Framingham Heart Study are summarized as follows: n = 3 , 310 , X n = 3 , 310 , X ¯ n=3,310, bar(X)\mathrm{n}=3,310, \overline{\mathrm{X}} = 200.3 = 200.3 =200.3=200.3, and s = 36.8 s = 36.8 s=36.8\mathrm{s}=36.8. Is there statistical evidence of a difference in mean cholesterol levels in the Framingham Offspring?
NCHS 报告称,2002 年所有成年人的平均总胆固醇水平为 203。参加弗雷明汉心脏研究中 Offspring 第七次检查的参与者的总胆固醇水平总结如下: n = 3 , 310 , X n = 3 , 310 , X ¯ n=3,310, bar(X)\mathrm{n}=3,310, \overline{\mathrm{X}} = 200.3 = 200.3 =200.3=200.3 s = 36.8 s = 36.8 s=36.8\mathrm{s}=36.8 。是否有统计证据表明 Framingham Offspring 的平均胆固醇水平存在差异?
Here we want to assess whether the sample mean of 200.3 in the Framingham sample is statistically significantly different from 203 (i.e., beyond what we would expect by chance). We will run the test using the five-step approach.
在这里,我们要评估 Framingham 样本中 200.3 的样本均值是否与 203 在统计上显著不同(即,超出我们偶然预期的值)。我们将使用五步法运行测试。
  • Step 1. Set up hypotheses and determine level of significance
    步骤 1.设置假设并确定显著性级别

    H 0 : μ = 203 H 1 : μ 203 α = 0.05 H 0 : μ = 203 H 1 : μ 203 α = 0.05 H_(0):mu=203H_(1):mu!=203quad alpha=0.05\mathrm{H}_{0}: \mu=203 \mathrm{H}_{1}: \mu \neq 203 \quad \alpha=0.05
    The research hypothesis is that cholesterol levels are different in the Framingham Offspring, and therefore a twotailed test is used.
    研究假设是 Framingham 后代的胆固醇水平不同,因此使用双尾测试。
  • Step 2. Select the appropriate test statistic.
    步骤 2。选择适当的检验统计量。
Because the sample size is large ( n 30 n 30 n >= 30\mathrm{n} \geq 30 ) the appropriate test statistic is
由于样本量很大 ( n 30 n 30 n >= 30\mathrm{n} \geq 30 ),因此适当的检验统计量为
Z = X ¯ μ 0 s / n Z = X ¯ μ 0 s / n Z=(( bar(X))-mu_(0))/(s//sqrtn)Z=\frac{\bar{X}-\mu_{0}}{s / \sqrt{n}}
  • Step 3. Set up decision rule.
    步骤 3。设置决策规则。
This is a two-tailed test, using a Z Z ZZ statistic and a 5 % 5 % 5%5 \% level of significance. Reject H 0 H 0 H_(0)H_{0} if Z 1.960 Z 1.960 Z <= -1.960Z \leq-1.960 or is Z 1.960 Z 1.960 Z >= 1.960Z \geq 1.960.
这是一个双尾检验,使用 Z Z ZZ 统计量和显著性 5 % 5 % 5%5 \% 水平。如果 Z 1.960 Z 1.960 Z <= -1.960Z \leq-1.960 或 为 Z 1.960 Z 1.960 Z >= 1.960Z \geq 1.960 ,则拒绝 H 0 H 0 H_(0)H_{0}
  • Step 4. Compute the test statistic.
    步骤 4。计算检验统计量。
We now substitute the sample data into the formula for the test statistic identified in Step 2.
现在,我们将样本数据替换为步骤 2 中确定的检验统计量的公式。
Z = X ¯ μ 0 s / n = 200.3 203 36.8 / 3 , 310 = 4.22 Z = X ¯ μ 0 s / n = 200.3 203 36.8 / 3 , 310 = 4.22 Z=(( bar(X))-mu_(0))/(s//sqrtn)=(200.3-203)/(36.8//sqrt(3,310))=-4.22Z=\frac{\bar{X}-\mu_{0}}{s / \sqrt{n}}=\frac{200.3-203}{36.8 / \sqrt{3,310}}=-4.22
  • Step 5. Conclusion.  步骤 5。结论。
We reject H 0 H 0 H_(0)H_{0} because 4.22 1 4.22 1 -4.22 <= -1-4.22 \leq-1. .960. We have statistically significant evidence at α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 to show that the mean total cholesterol level in the Framingham Offspring is different from the national average of 203 reported in 2002. Because we reject H 0 H 0 H_(0)\mathrm{H}_{0}, we also approximate a p -value. Using the two-sided significance levels, p < 0.0001 p < 0.0001 p < 0.0001\mathrm{p}<0.0001.
我们拒绝 H 0 H 0 H_(0)H_{0} 是因为 4.22 1 4.22 1 -4.22 <= -1-4.22 \leq-1 ..960.我们有具有统计学意义的 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 证据表明,Framingham Offspring 的平均总胆固醇水平与 2002 年报告的 203 的全国平均水平不同。因为我们拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} ,所以我们也近似一个 p 值。使用双侧显著性水平 p < 0.0001 p < 0.0001 p < 0.0001\mathrm{p}<0.0001 .

Statistical Significance versus Clinical (Practical) Significance
统计显著性与临床 (实际) 显著性

This example raises an important concept of statistical versus clinical or practical significance. From a statistical standpoint, the total cholesterol levels in the Framingham sample are highly statistically significantly different from the national average with p < p < p <p< 0.0001 (i.e., there is less than a 0.01 % 0.01 % 0.01%0.01 \% chance that we are incorrectly rejecting the null hypothesis). However, the sample mean in the Framingham Offspring study is 200.3, less than 3 units different from the national mean of 203. The reason that the data are so highly statistically significant is due to the very large sample size. It is always important to assess both statistical and clinical significance of data. This is particularly relevant when the sample size is large. Is a 3 unit difference in total cholesterol a
这个例子提出了一个重要的概念,即统计学与临床或实际意义。从统计学的角度来看,Framingham 样本中的总胆固醇水平与全国平均水平 p < p < p <p< 0.0001 在统计学上显著差异很大(即,我们错误地拒绝零假设 0.01 % 0.01 % 0.01%0.01 \% 的可能性很小)。然而,Framingham Offspring 研究中的样本平均值为 200.3,与全国平均值 203 相差不到 3 个单位。数据具有如此高度统计显著性的原因是样本量非常大。评估数据的统计和临床意义始终很重要。当样本量较大时,这一点尤其重要。总胆固醇的 3 个单位差异是 a

meaningful difference?  有意义的差异?

Example  

Consider again the NCHS-reported mean total cholesterol level in 2002 for all adults of 203. Suppose a new drug is proposed to lower total cholesterol. A study is designed to evaluate the efficacy of the drug in lowering cholesterol. Fifteen patients are enrolled in the study and asked to take the new drug for 6 weeks. At the end of 6 weeks, each patient’s total cholesterol level is measured and the sample statistics are as follows: n = 15 , X ¯ = 195.9 n = 15 , X ¯ = 195.9 n=15, bar(X)=195.9n=15, \bar{X}=195.9 and s = 28.7 s = 28.7 s=28.7s=28.7. Is there statistical evidence of a reduction in mean total cholesterol in patients after using the new drug for 6 weeks? We will run the test using the five-step approach.
再次考虑 2002 年 NCHS 报告的所有 203 名成年人的平均总胆固醇水平。假设提出了一种降低总胆固醇的新药。一项研究旨在评估该药物在降低胆固醇方面的疗效。15 名患者参加了该研究,并被要求服用新药 6 周。在 6 周结束时,测量每位患者的总胆固醇水平,样本统计如下: n = 15 , X ¯ = 195.9 n = 15 , X ¯ = 195.9 n=15, bar(X)=195.9n=15, \bar{X}=195.9 s = 28.7 s = 28.7 s=28.7s=28.7 。是否有统计证据表明患者在使用新药 6 周后平均总胆固醇降低?我们将使用五步法运行测试。
  • Step 1. Set up hypotheses and determine level of significance
    步骤 1.设置假设并确定显著性级别
H 0 : μ = 203 H 1 : μ < 203 α = 0.05 H 0 : μ = 203 H 1 : μ < 203 α = 0.05 H_(0):mu=203H_(1):mu < 203quad alpha=0.05\mathrm{H}_{0}: \mu=203 \mathrm{H}_{1}: \mu<203 \quad \alpha=0.05
  • Step 2. Select the appropriate test statistic.
    步骤 2。选择适当的检验统计量。
Because the sample size is small ( n < 30 n < 30 n < 30\mathrm{n}<30 ) the appropriate test statistic is
由于样本量较小 ( n < 30 n < 30 n < 30\mathrm{n}<30 ),因此适当的检验统计量为
t = X ¯ μ 0 s / n t = X ¯ μ 0 s / n t=(( bar(X))-mu_(0))/(s//sqrtn)t=\frac{\bar{X}-\mu_{0}}{s / \sqrt{n}}
  • Step 3. Set up decision rule.
    步骤 3。设置决策规则。
This is a lower tailed test, using a t statistic and a 5 % 5 % 5%5 \% level of significance. In order to determine the critical value of t , we need degrees of freedom, d f d f dfd f, defined as d f = n 1 d f = n 1 df=n-1d f=n-1. In this example d f = 15 1 = 14 d f = 15 1 = 14 df=15-1=14d f=15-1=14. The critical value for a lower tailed test with df = 14 df = 14 df=14\mathrm{df}=14 and a = 0.05 a = 0.05 a=0.05\mathrm{a}=0.05 is -2.145 and the decision rule is as follows: Reject H 0 H 0 H_(0)\mathrm{H}_{0} if t 2.145 t 2.145 t <= -2.145\mathrm{t} \leq-2.145.
这是一个低尾检验,使用 t 统计量和显著性 5 % 5 % 5%5 \% 水平。为了确定 t 的临界值 ,我们需要自由度 d f d f dfd f ,定义为 d f = n 1 d f = n 1 df=n-1d f=n-1 。在此示例中 d f = 15 1 = 14 d f = 15 1 = 14 df=15-1=14d f=15-1=14 .使用 df = 14 df = 14 df=14\mathrm{df}=14 a = 0.05 a = 0.05 a=0.05\mathrm{a}=0.05 的低尾测试的临界值为 -2.145,决策规则如下: 如果 则拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} t 2.145 t 2.145 t <= -2.145\mathrm{t} \leq-2.145
  • Step 4. Compute the test statistic.
    步骤 4。计算检验统计量。
We now substitute the sample data into the formula for the test statistic identified in Step 2.
现在,我们将样本数据替换为步骤 2 中确定的检验统计量的公式。
t = X ¯ μ 0 s / n = 195.9 203 28.7 / 15 = 0.96 t = X ¯ μ 0 s / n = 195.9 203 28.7 / 15 = 0.96 t=(( bar(X))-mu_(0))/(s//sqrtn)=(195.9-203)/(28.7//sqrt15)=-0.96t=\frac{\bar{X}-\mu_{0}}{s / \sqrt{n}}=\frac{195.9-203}{28.7 / \sqrt{15}}=-0.96
  • Step 5. Conclusion.  步骤 5。结论。
We do not reject H 0 H 0 H_(0)\mathrm{H}_{0} because 0.96 > 2.145 0.96 > 2.145 -0.96 > -2.145-0.96>-2.145. We do not have statistically significant evidence at α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 to show that the mean total cholesterol level is lower than the national mean in patients taking the new drug for 6 weeks. Again, because we failed to reject the null hypothesis we make a weaker concluding statement allowing for the possibility that we may have committed a Type II error (i.e., failed to reject H 0 H 0 H_(0)\mathrm{H}_{0} when in fact the drug is efficacious).
我们拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} 不是因为 0.96 > 2.145 0.96 > 2.145 -0.96 > -2.145-0.96>-2.145 .我们没有统计学上显着的证据表明 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 ,服用新药 6 周的患者的平均总胆固醇水平低于全国平均水平。同样,因为我们未能拒绝零假设,所以我们做出了一个较弱的结论性陈述,允许我们可能犯了 II 类错误的可能性(即,当药物实际上是有效的时,未能拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} )。
This example raises an important issue in terms of study design. In this example we assume in the null hypothesis that the mean cholesterol level is 203. This is taken to be the mean cholesterol level in patients without treatment. Is this an appropriate comparator? Alternative and potentially more efficient study designs to evaluate the effect of the new drug could involve two treatment groups, where one group receives the new drug and the other does not, or we could measure each patient's baseline or pre-treatment cholesterol level and then assess changes from baseline to 6 weeks posttreatment. These designs are also discussed here.
这个例子在研究设计方面提出了一个重要问题。在此示例中,我们在原假设中假设平均胆固醇水平为 203。这被认为是未经治疗的患者的平均胆固醇水平。这是一个合适的比较对象吗?评估新药效果的替代且可能更有效的研究设计可能涉及两个治疗组,一组接受新药,另一组不接受,或者我们可以测量每个患者的基线或治疗前胆固醇水平,然后评估从基线到治疗后 6 周的变化。这里也讨论了这些设计。
https://cdn.mathpix.com/cropped/2025_07_22_5441130c3afa172b4063g-09.jpg?height=211&width=141&top_left_y=1862&top_left_x=183 This example raises an important issue in terms of study design. In this example we assume in the null hypothesis that the mean cholesterol level is 203. This is taken to be the mean cholesterol level in patients without treatment. Is this an appropriate comparator? Alternative and potentially more efficient study designs to evaluate the effect of the new drug could involve two treatment groups, where one group receives the new drug and the other does not, or we could measure each patient's baseline or pre-treatment cholesterol level and then assess changes from baseline to 6 weeks posttreatment. These designs are also discussed here.| ![](https://cdn.mathpix.com/cropped/2025_07_22_5441130c3afa172b4063g-09.jpg?height=211&width=141&top_left_y=1862&top_left_x=183) | This example raises an important issue in terms of study design. In this example we assume in the null hypothesis that the mean cholesterol level is 203. This is taken to be the mean cholesterol level in patients without treatment. Is this an appropriate comparator? Alternative and potentially more efficient study designs to evaluate the effect of the new drug could involve two treatment groups, where one group receives the new drug and the other does not, or we could measure each patient's baseline or pre-treatment cholesterol level and then assess changes from baseline to 6 weeks posttreatment. These designs are also discussed here. | | :--- | :--- |

Tests with One Sample, Dichotomous Outcome
使用一个样本的检验,二分结果

Hypothesis testing applications with a dichotomous outcome variable in a single population are also performed according to the five-step procedure. Similar to tests for means, a key component is setting up the null and research hypotheses. The objective is to compare the proportion of successes in a single population to a known proportion ( p 0 p 0 p_(0)p_{0} ). That known proportion is generally derived from another study or report and is sometimes called a historical control. It is important in setting up the hypotheses in a one sample test that the proportion specified in the null hypothesis is a fair and reasonable comparator.
在单个群体中具有二分结果变量的假设检验应用程序也根据五步程序进行。与均值检验类似,一个关键组成部分是设置 null 和 research 假设。目标是将单个群体的成功比例与已知比例 ( p 0 p 0 p_(0)p_{0} ) 进行比较。该已知比例通常来自另一项研究或报告,有时称为历史对照。在单样本检验中设置假设时,原假设中指定的比例是公平合理的比较器,这一点很重要。
In one sample tests for a dichotomous outcome, we set up our hypotheses against an appropriate comparator. We select a sample and compute descriptive statistics on the sample data. Specifically, we compute the sample size ( n ) and the sample proportion which is computed by taking the ratio of the number of successes to the sample size,
在二分类结果的一个样本检验中,我们针对适当的比较对象设置我们的假设。我们选择一个样本并计算样本数据的描述性统计数据。具体来说,我们计算样本量 ( n ) 和样本比例,该比例是通过取成功次数与样本量的比率来计算的,
p ^ = x n p ^ = x n hat(p)=(x)/(n)\hat{\mathrm{p}}=\frac{\mathrm{x}}{\mathrm{n}}
We then determine the appropriate test statistic (Step 2) for the hypothesis test. The formula for the test statistic is given below.
然后,我们确定假设检验的适当检验统计量(步骤 2)。检验统计量的公式如下。
Test Statistic for Testing H 0 : p = p 0 H 0 : p = p 0 H_(0):p=p_(0)\mathbf{H}_{\mathbf{0}}: \mathbf{p}=\mathbf{p}_{\mathbf{0}}
用于测试 H 0 : p = p 0 H 0 : p = p 0 H_(0):p=p_(0)\mathbf{H}_{\mathbf{0}}: \mathbf{p}=\mathbf{p}_{\mathbf{0}} 的 Test Statistic
if min ( n p 0 , n ( 1 p 0 ) ) 5 min n p 0 , n 1 p 0 5 min(np_(0),n(1-p_(0))) >= 5\min \left(n p_{0}, n\left(1-p_{0}\right)\right) \geq 5  如果 min ( n p 0 , n ( 1 p 0 ) ) 5 min n p 0 , n 1 p 0 5 min(np_(0),n(1-p_(0))) >= 5\min \left(n p_{0}, n\left(1-p_{0}\right)\right) \geq 5
Z = p ^ p 0 p 0 ( 1 p 0 ) n Z = p ^ p 0 p 0 1 p 0 n Z=(( hat(p))-p_(0))/(sqrt((p_(0)(1-p_(0)))/(n)))Z=\frac{\hat{\mathrm{p}}-\mathrm{p}_{0}}{\sqrt{\frac{\mathrm{p}_{0}\left(1-\mathrm{p}_{0}\right)}{\mathrm{n}}}}
Test Statistic for Testing H_(0):p=p_(0) if min(np_(0),n(1-p_(0))) >= 5 Z=(( hat(p))-p_(0))/(sqrt((p_(0)(1-p_(0)))/(n)))| Test Statistic for Testing $\mathbf{H}_{\mathbf{0}}: \mathbf{p}=\mathbf{p}_{\mathbf{0}}$ | | :---: | | if $\min \left(n p_{0}, n\left(1-p_{0}\right)\right) \geq 5$ | | $Z=\frac{\hat{\mathrm{p}}-\mathrm{p}_{0}}{\sqrt{\frac{\mathrm{p}_{0}\left(1-\mathrm{p}_{0}\right)}{\mathrm{n}}}}$ |
The formula above is appropriate for large samples, defined when the smaller of n p 0 n p 0 np_(0)n p_{0} and n ( 1 p 0 ) n 1 p 0 n(1-p_(0))n\left(1-p_{0}\right) is at least 5 . This is similar, but not identical, to the condition required for appropriate use of the confidence interval formula for a population proportion:
上面的公式适用于大样本,定义为 n p 0 n p 0 np_(0)n p_{0} n ( 1 p 0 ) n 1 p 0 n(1-p_(0))n\left(1-p_{0}\right) 中的较小者至少为 5 。这与对总体比例适当使用置信区间公式所需的条件类似,但不完全相同:
[i.e., min ( n p ^ , n ( 1 p ^ ) ) 5 ]  [i.e.,  min ( n p ^ , n ( 1 p ^ ) ) 5  ]  " [i.e., "min(n hat(p),n(1- hat(p))) >= 5" ] "\text { [i.e., } \min (n \hat{p}, n(1-\hat{p})) \geq 5 \text { ] }
Here we use the proportion specified in the null hypothesis as the true proportion of successes rather than the sample proportion. If we fail to satisfy the condition, then alternative procedures, called exact methods must be used to test the hypothesis about the population proportion.
在这里,我们使用原假设中指定的比例作为成功的真实比例,而不是样本比例。如果我们不满足条件,则必须使用称为精确方法的替代程序来检验关于总体比例的假设。

Example  

The NCHS report indicated that in 2002 the prevalence of cigarette smoking among American adults was 21.1 % 21.1 % 21.1%21.1 \%. Data on prevalent smoking in n = 3 , 536 n = 3 , 536 n=3,536\mathrm{n}=3,536 participants who attended the seventh examination of the Offspring in the Framingham Heart Study indicated that 482 / 3 , 536 = 13.6 % 482 / 3 , 536 = 13.6 % 482//3,536=13.6%482 / 3,536=13.6 \% of the respondents were currently smoking at the time of the exam. Suppose we want to assess whether the prevalence of smoking is lower in the Framingham Offspring sample given the focus on cardiovascular health in that community. Is there evidence of a statistically lower prevalence of smoking in the Framingham Offspring study as compared to the prevalence among all Americans?
NCHS 报告表明,2002 年美国成年人的吸烟率为 21.1 % 21.1 % 21.1%21.1 \% 。参加弗雷明汉心脏研究中 Offspring 第七次检查的 n = 3 , 536 n = 3 , 536 n=3,536\mathrm{n}=3,536 参与者普遍吸烟的数据表明, 482 / 3 , 536 = 13.6 % 482 / 3 , 536 = 13.6 % 482//3,536=13.6%482 / 3,536=13.6 \% 的受访者在检查时目前正在吸烟。假设我们想评估 Framingham Offspring 样本中的吸烟率是否较低,因为该社区关注心血管健康。是否有证据表明 Framingham Offspring 研究中的吸烟率与所有美国人的吸烟率相比在统计学上较低?
  • Step 1. Set up hypotheses and determine level of significance
    步骤 1.设置假设并确定显著性级别
H 0 : p = 0.211 H 1 : p < 0.211 α = 0.05 H 0 : p = 0.211 H 1 : p < 0.211 α = 0.05 H_(0):p=0.211H_(1):p < 0.211quad alpha=0.05H_{0}: p=0.211 H_{1}: p<0.211 \quad \alpha=0.05
  • Step 2. Select the appropriate test statistic.
    步骤 2。选择适当的检验统计量。
We must first check that the sample size is adequate. Specifically, we need to check min ( n p 0 , n ( 1 p 0 ) ) = min ( n p 0 , n 1 p 0 = min ( (np_(0),n(1-p_(0)))=min(\left(n p_{0}, n\left(1-p_{0}\right)\right)=\min ( 3 , 536 ( 0.211 ) , 3 , 536 ( 1 0.211 ) ) = min ( 746 , 2790 ) = 746 3 , 536 ( 0.211 ) , 3 , 536 ( 1 0.211 ) ) = min ( 746 , 2790 ) = 746 3,536(0.211),3,536(1-0.211))=min(746,2790)=7463,536(0.211), 3,536(1-0.211))=\min (746,2790)=746. The sample size is more than adequate so the following formula can be used:
我们必须首先检查样本量是否足够。具体来说,我们需要检查 min ( n p 0 , n ( 1 p 0 ) ) = min ( n p 0 , n 1 p 0 = min ( (np_(0),n(1-p_(0)))=min(\left(n p_{0}, n\left(1-p_{0}\right)\right)=\min ( 3 , 536 ( 0.211 ) , 3 , 536 ( 1 0.211 ) ) = min ( 746 , 2790 ) = 746 3 , 536 ( 0.211 ) , 3 , 536 ( 1 0.211 ) ) = min ( 746 , 2790 ) = 746 3,536(0.211),3,536(1-0.211))=min(746,2790)=7463,536(0.211), 3,536(1-0.211))=\min (746,2790)=746 。样本量绰绰有余,因此可以使用以下公式:
Z = p ^ p 0 p 0 ( 1 p 0 ) n . Z = p ^ p 0 p 0 1 p 0 n . Z=(( hat(p))-p_(0))/(sqrt((p_(0)(1-p_(0)))/(n))).\mathrm{Z}=\frac{\hat{\mathrm{p}}-\mathrm{p}_{0}}{\sqrt{\frac{\mathrm{p}_{0}\left(1-\mathrm{p}_{0}\right)}{\mathrm{n}}}} .
  • Step 3. Set up decision rule.
    步骤 3。设置决策规则。
This is a lower tailed test, using a Z Z ZZ statistic and a 5 % 5 % 5%5 \% level of significance. Reject H 0 H 0 H_(0)H_{0} if Z 1.645 Z 1.645 Z <= -1.645Z \leq-1.645.
这是一个低尾检验,使用 Z Z ZZ 统计量和显著性 5 % 5 % 5%5 \% 水平。 H 0 H 0 H_(0)H_{0} 如果 Z 1.645 Z 1.645 Z <= -1.645Z \leq-1.645 .
  • Step 4. Compute the test statistic.
    步骤 4。计算检验统计量。
We now substitute the sample data into the formula for the test statistic identified in Step 2.
现在,我们将样本数据替换为步骤 2 中确定的检验统计量的公式。
Z = p ^ p 0 p 0 ( 1 p 0 ) n = 0.136 0.211 0.211 ( 1 0.211 ) 3 , 536 = 10.93 Z = p ^ p 0 p 0 1 p 0 n = 0.136 0.211 0.211 ( 1 0.211 ) 3 , 536 = 10.93 Z=(( hat(p))-p_(0))/(sqrt((p_(0)(1-p_(0)))/(n)))=(0.136-0.211)/(sqrt((0.211(1-0.211))/(3,536)))=-10.93Z=\frac{\hat{\mathrm{p}}-\mathrm{p}_{0}}{\sqrt{\frac{\mathrm{p}_{0}\left(1-\mathrm{p}_{0}\right)}{\mathrm{n}}}}=\frac{0.136-0.211}{\sqrt{\frac{0.211(1-0.211)}{3,536}}}=-10.93
  • Step 5. Conclusion.  步骤 5。结论。
We reject H 0 H 0 H_(0)H_{0} because 10.93 1.645 10.93 1.645 -10.93 <= -1.645-10.93 \leq-1.645. We have statistically significant evidence at α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 to show that the prevalence of smoking in the Framingham Offspring is lower than the prevalence nationally (21.1%). Here, p < 0.0001 .
我们拒绝 H 0 H 0 H_(0)H_{0} 是因为 10.93 1.645 10.93 1.645 -10.93 <= -1.645-10.93 \leq-1.645 .我们有具有统计学意义的 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 证据表明,Framingham Offspring 的吸烟率低于全国的吸烟率 (21.1%)。这里,p < 0.0001 .

assess whether use of dental services is similar in children living in the city of Boston. A sample of 125 children aged 2 to 17 living in Boston are surveyed and 64 reported seeing a dentist over the past 12 months. Is there a significant difference in use of dental services between children living in Boston and the national data?
评估居住在波士顿市的儿童对牙科服务的使用是否相似。对居住在波士顿的 125 名 2 至 17 岁儿童进行了抽样调查,其中 64 名儿童报告在过去 12 个月中看过牙医。居住在波士顿的儿童与全国数据相比,使用牙科服务是否有显著差异?
Calculate this on your own before checking the answer.
在检查答案之前,请自行计算一下。

Answer  

Tests with Two Independent Samples, Continuous Outcome
使用两个独立样本的测试,连续结果

There are many applications where it is of interest to compare two independent groups with respect to their mean scores on a continuous outcome. Here we compare means between groups, but rather than generating an estimate of the difference, we will test whether the observed difference (increase, decrease or difference) is statistically significant or not. Remember, that hypothesis testing gives an assessment of statistical significance, whereas estimation gives an estimate of effect and both are important.
在许多应用中,比较两个独立组在连续结果上的平均分数是有趣的。在这里,我们比较组间的均值,但不是生成差异的估计值,而是测试观察到的差异(增加、减少或差异)是否具有统计显著性。请记住,假设检验给出了统计显着性的评估,而估计给出了效果的估计,两者都很重要。
Here we discuss the comparison of means when the two comparison groups are independent or physically separate. The two groups might be determined by a particular attribute (e.g., sex, diagnosis of cardiovascular disease) or might be set up by the investigator (e.g., participants assigned to receive an experimental treatment or placebo). The first step in the analysis involves computing descriptive statistics on each of the two samples. Specifically, we compute the sample size, mean and standard deviation in each sample and we denote these summary statistics as follows:
在这里,我们讨论当两个比较组独立或物理分离时的均值比较。这两组可能由特定属性(例如,性别、心血管疾病的诊断)决定,也可能由研究者设置(例如,被分配接受实验性治疗或安慰剂的参与者)。分析的第一步涉及计算两个样本中每个样本的描述性统计量。具体来说,我们计算每个样本中的样本量、平均值和标准差,并将这些汇总统计数据表示如下:
n 1 , x ¯ 1 and s 1 for sample 1 and n 2 , x ¯ 2 and s 2 for sample 2 . n 1 , x ¯ 1  and  s 1  for sample  1  and  n 2 , x ¯ 2  and  s 2  for sample  2 n_(1), bar(x)_(1)" and "s_(1)" for sample "1" and "n_(2), bar(x)_(2)" and "s_(2)" for sample "2". "n_{1}, \bar{x}_{1} \text { and } s_{1} \text { for sample } 1 \text { and } n_{2}, \bar{x}_{2} \text { and } s_{2} \text { for sample } 2 \text {. }
The designation of sample 1 and sample 2 is arbitrary. In a clinical trial setting the convention is to call the treatment group 1 and the control group 2. However, when comparing men and women, for example, either group can be 1 or 2.
样本 1 和样本 2 的指定是任意的。在临床试验环境中,惯例是将治疗组称为 1 组,将对照组称为 2。但是,例如,在比较男性和女性时,任何一组都可以是 1 或 2。
In the two independent samples application with a continuous outcome, the parameter of interest in the test of hypothesis is the difference in population means, μ 1 μ 2 μ 1 μ 2 mu_(1)-mu_(2)\mu_{1}-\mu_{2}. The null hypothesis is always that there is no difference between groups with respect to means, i.e.,
在具有连续结果的两个独立样本应用程序中,假设检验中感兴趣的参数是总体均值的差值 。 μ 1 μ 2 μ 1 μ 2 mu_(1)-mu_(2)\mu_{1}-\mu_{2} 原假设始终是组之间在均值方面没有差异,即
H 0 : μ 1 μ 2 = 0 . H 0 : μ 1 μ 2 = 0 . H_(0):mu_(1)-mu_(2)=0.H_{0}: \mu_{1}-\mu_{2}=0 .
The null hypothesis can also be written as follows: H 0 : μ 1 = μ 2 H 0 : μ 1 = μ 2 H_(0):mu_(1)=mu_(2)\mathrm{H}_{0}: \mu_{1}=\mu_{2}. In the research hypothesis, an investigator can hypothesize that the first mean is larger than the second ( H 1 : μ 1 > μ 2 H 1 : μ 1 > μ 2 H_(1):mu_(1) > mu_(2)H_{1}: \mu_{1}>\mu_{2} ), that the first mean is smaller than the second ( H 1 : μ 1 < μ 2 H 1 : μ 1 < μ 2 H_(1):mu_(1) < mu_(2)H_{1}: \mu_{1}<\mu_{2} ), or that the means are different ( H 1 : μ 1 μ 2 H 1 : μ 1 μ 2 H_(1):mu_(1)!=mu_(2)H_{1}: \mu_{1} \neq \mu_{2} ). The three different alternatives represent upper-, lower-, and two-tailed tests, respectively. The following test statistics are used to test these hypotheses.
原假设也可以写成如下: H 0 : μ 1 = μ 2 H 0 : μ 1 = μ 2 H_(0):mu_(1)=mu_(2)\mathrm{H}_{0}: \mu_{1}=\mu_{2} 。在研究假设中,研究者可以假设第一个均值大于第二个均值 ( H 1 : μ 1 > μ 2 H 1 : μ 1 > μ 2 H_(1):mu_(1) > mu_(2)H_{1}: \mu_{1}>\mu_{2} ),第一个均值小于第二个均值 ( H 1 : μ 1 < μ 2 H 1 : μ 1 < μ 2 H_(1):mu_(1) < mu_(2)H_{1}: \mu_{1}<\mu_{2} ),或者均值不同 ( H 1 : μ 1 μ 2 H 1 : μ 1 μ 2 H_(1):mu_(1)!=mu_(2)H_{1}: \mu_{1} \neq \mu_{2} )。这三种不同的选择分别表示 Upper-Tail-Inspection、Lower-Tailed 和 Two-tailed 测试。以下检验统计量用于检验这些假设。
Test Statistics for Testing H 0 : μ 1 = μ 2 H 0 : μ 1 = μ 2 H_(0):mu_(1)=mu_(2)\boldsymbol{H}_{\mathbf{0}} \boldsymbol{:} \boldsymbol{\mu}_{\mathbf{1}} \boldsymbol{=} \boldsymbol{\mu}_{\mathbf{2}}
用于测试 H 0 : μ 1 = μ 2 H 0 : μ 1 = μ 2 H_(0):mu_(1)=mu_(2)\boldsymbol{H}_{\mathbf{0}} \boldsymbol{:} \boldsymbol{\mu}_{\mathbf{1}} \boldsymbol{=} \boldsymbol{\mu}_{\mathbf{2}} 的测试统计量
if n 1 30 n 1 30 n_(1) >= 30\mathrm{n}_{1} \geq 30 and n 2 30 n 2 30 n_(2) >= 30\mathrm{n}_{2} \geq 30
if n 1 30 n 1 30 n_(1) >= 30\mathrm{n}_{1} \geq 30 n 2 30 n 2 30 n_(2) >= 30\mathrm{n}_{2} \geq 30
if n 1 < 30 n 1 < 30 n_(1) < 30\mathrm{n}_{1}<30 or n 2 < 30 n 2 < 30 n_(2) < 30\mathrm{n}_{2}<30
if n 1 < 30 n 1 < 30 n_(1) < 30\mathrm{n}_{1}<30 n 2 < 30 n 2 < 30 n_(2) < 30\mathrm{n}_{2}<30
Z = X 1 X 2 Sp 1 n 1 + 1 n 2 Z = X ¯ 1 X ¯ 2 Sp 1 n 1 + 1 n 2 Z=( bar(X)_(1)- bar(X)_(2))/(Spsqrt((1)/(n_(1))+(1)/(n_(2))))\mathrm{Z}=\frac{\overline{\mathrm{X}}_{1}-\overline{\mathrm{X}}_{2}}{\mathrm{Sp} \sqrt{\frac{1}{\mathrm{n}_{1}}+\frac{1}{\mathrm{n}_{2}}}} t = X 1 X 2 S p 1 n 1 + 1 n 2 where df = n 1 + n 2 2 . t = X ¯ 1 X ¯ 2 S p 1 n 1 + 1 n 2  where df  = n 1 + n 2 2 . {:[t=( bar(X)_(1)- bar(X)_(2))/(Spsqrt((1)/(n_(1))+(1)/(n_(2))))],[" where df "=n_(1)+n_(2)-2.]:}\begin{aligned} & t=\frac{\overline{\mathrm{X}}_{1}-\overline{\mathrm{X}}_{2}}{S p \sqrt{\frac{1}{\mathrm{n}_{1}}+\frac{1}{\mathrm{n}_{2}}}} \\ & \text { where df }=\mathrm{n}_{1}+\mathrm{n}_{2}-2 . \end{aligned}
Test Statistics for Testing H_(0):mu_(1)=mu_(2) if n_(1) >= 30 and n_(2) >= 30 if n_(1) < 30 or n_(2) < 30 Z=( bar(X)_(1)- bar(X)_(2))/(Spsqrt((1)/(n_(1))+(1)/(n_(2)))) "t=( bar(X)_(1)- bar(X)_(2))/(Spsqrt((1)/(n_(1))+(1)/(n_(2)))) where df =n_(1)+n_(2)-2."| Test Statistics for Testing $\boldsymbol{H}_{\mathbf{0}} \boldsymbol{:} \boldsymbol{\mu}_{\mathbf{1}} \boldsymbol{=} \boldsymbol{\mu}_{\mathbf{2}}$ | | | :--- | :--- | | if $\mathrm{n}_{1} \geq 30$ and $\mathrm{n}_{2} \geq 30$ | if $\mathrm{n}_{1}<30$ or $\mathrm{n}_{2}<30$ | | $\mathrm{Z}=\frac{\overline{\mathrm{X}}_{1}-\overline{\mathrm{X}}_{2}}{\mathrm{Sp} \sqrt{\frac{1}{\mathrm{n}_{1}}+\frac{1}{\mathrm{n}_{2}}}}$ | $\begin{aligned} & t=\frac{\overline{\mathrm{X}}_{1}-\overline{\mathrm{X}}_{2}}{S p \sqrt{\frac{1}{\mathrm{n}_{1}}+\frac{1}{\mathrm{n}_{2}}}} \\ & \text { where df }=\mathrm{n}_{1}+\mathrm{n}_{2}-2 . \end{aligned}$ |
NOTE: The formulas above assume equal variability in the two populations (i.e., the population variances are equal, or s 1 2 = s 2 2 s 1 2 = s 2 2 s_(1)^(2)=s_(2)^(2)s_{1}{ }^{2}=s_{2}{ }^{2} ). This means that the outcome is equally variable in each of the comparison populations. For analysis, we have samples from each of the comparison populations. If the sample variances are similar, then the assumption about variability in the populations is probably reasonable. As a guideline, if the ratio of the sample variances, s 1 2 / s 2 2 s 1 2 / s 2 2 s_(1)^(2)//s_(2)^(2)\mathrm{s}_{1}{ }^{2} / \mathrm{s}_{2}{ }^{2} is between 0.5 and 2 (i.e., if one variance is no more than double the other), then the formulas above are appropriate. If the ratio of the sample variances is greater than 2 or less than 0.5 then alternative formulas must be used to account for the heterogeneity in variances.
注意:上面的公式假设两个总体的变异性相等(即,总体方差相等,或 s 1 2 = s 2 2 s 1 2 = s 2 2 s_(1)^(2)=s_(2)^(2)s_{1}{ }^{2}=s_{2}{ }^{2} )。这意味着每个比较总体的结果都是相同的。为了进行分析,我们提供了来自每个比较人群的样本。如果样本方差相似,则关于总体变异性的假设可能是合理的。作为指导原则,如果样本方差 s 1 2 / s 2 2 s 1 2 / s 2 2 s_(1)^(2)//s_(2)^(2)\mathrm{s}_{1}{ }^{2} / \mathrm{s}_{2}{ }^{2} 的比率介于 0.5 和 2 之间(即,如果一个方差不超过另一个方差的两倍),则上述公式是合适的。如果样本方差的比率大于 2 或小于 0.5,则必须使用替代公式来解释方差的异质性。
The test statistics include Sp , which is the pooled estimate of the common standard deviation (again assuming that the variances in the populations are similar) computed as the weighted average of the standard deviations in the samples as follows:
检验统计量包括 Sp ,它是公共标准差的合并估计值(再次假设总体中的方差相似),计算为样本中标准差的加权平均值,如下所示:
S p = ( n 1 1 ) s 1 2 + ( n 1 1 ) s 2 2 n 1 + n 2 2 S p = n 1 1 s 1 2 + n 1 1 s 2 2 n 1 + n 2 2 Sp=sqrt(((n_(1)-1)s_(1)^(2)+(n_(1)-1)s_(2)^(2))/(n_(1)+n_(2)-2))S p=\sqrt{\frac{\left(n_{1}-1\right) s_{1}^{2}+\left(n_{1}-1\right) s_{2}^{2}}{n_{1}+n_{2}-2}}
Because we are assuming equal variances between groups, we pool the information on variability (sample variances) to generate an estimate of the variability in the population. (Note: Because Sp is a weighted average of the standard deviations in the sample, S p S p SpS p will always be in between s 1 s 1 s_(1)s_{1} and s 2 s 2 s_(2)s_{2}.)
因为我们假设各组之间的方差相等,所以我们汇总了有关变异性 (样本方差) 的信息,以生成总体变异性的估计值。(注意:因为 Sp 是样本中标准差的加权平均值, S p S p SpS p 所以将始终介于 和 s 2 s 2 s_(2)s_{2} 之间 s 1 s 1 s_(1)s_{1}

Example  

Data measured on n = 3 , 539 n = 3 , 539 n=3,539\mathrm{n}=3,539 participants who attended the seventh examination of the Offspring in the Framingham Heart Study are shown below.
参加弗雷明汉心脏研究中 Offspring 第七次检查的 n = 3 , 539 n = 3 , 539 n=3,539\mathrm{n}=3,539 参与者的测量数据如下所示。
Men  男人 Women  女人
Characteristic  特征 n X X ¯ bar(X)\overline{\mathrm{X}} S n X X ¯ bar(X)\overline{\mathrm{X}} s
Systolic Blood Pressure  收缩压 1,623 128.2 17.5 1,911 126.5 20.1
Diastolic Blood Pressure
舒张压
1,622 75.6 9.8 1,910 72.6 9.7
Total Serum Cholesterol  血清总胆固醇 1,544 192.4 35.2 1,766 207.1 36.7
Weight  重量 1,612 194.0 33.8 1,894 157.7 34.6
Height  高度 1,545 68.9 2.7 1,781 63.4 2.5
Body Mass Index  体重指数 1,545 28.8 4.6 1,781 27.6 5.9
Men Women Characteristic n bar(X) S n bar(X) s Systolic Blood Pressure 1,623 128.2 17.5 1,911 126.5 20.1 Diastolic Blood Pressure 1,622 75.6 9.8 1,910 72.6 9.7 Total Serum Cholesterol 1,544 192.4 35.2 1,766 207.1 36.7 Weight 1,612 194.0 33.8 1,894 157.7 34.6 Height 1,545 68.9 2.7 1,781 63.4 2.5 Body Mass Index 1,545 28.8 4.6 1,781 27.6 5.9| | Men | | | Women | | | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | Characteristic | n | $\overline{\mathrm{X}}$ | S | n | $\overline{\mathrm{X}}$ | s | | Systolic Blood Pressure | 1,623 | 128.2 | 17.5 | 1,911 | 126.5 | 20.1 | | Diastolic Blood Pressure | 1,622 | 75.6 | 9.8 | 1,910 | 72.6 | 9.7 | | Total Serum Cholesterol | 1,544 | 192.4 | 35.2 | 1,766 | 207.1 | 36.7 | | Weight | 1,612 | 194.0 | 33.8 | 1,894 | 157.7 | 34.6 | | Height | 1,545 | 68.9 | 2.7 | 1,781 | 63.4 | 2.5 | | Body Mass Index | 1,545 | 28.8 | 4.6 | 1,781 | 27.6 | 5.9 |
Suppose we now wish to assess whether there is a statistically significant difference in mean systolic blood pressures between men and women using a 5 % 5 % 5%5 \% level of significance.
假设我们现在希望使用显著性 5 % 5 % 5%5 \% 水平来评估男性和女性之间的平均收缩压是否存在统计学上的显著差异。
  • Step 1. Set up hypotheses and determine level of significance
    步骤 1.设置假设并确定显著性级别
H 0 : μ 1 = μ 2 H 1 : μ 1 μ 2 α = 0.05 H 0 : μ 1 = μ 2 H 1 : μ 1 μ 2 α = 0.05 H_(0):mu_(1)=mu_(2)H_(1):mu_(1)!=mu_(2)quad alpha=0.05H_{0}: \mu_{1}=\mu_{2} H_{1}: \mu_{1} \neq \mu_{2} \quad \alpha=0.05
  • Step 2. Select the appropriate test statistic.
    步骤 2。选择适当的检验统计量。
Because both samples are large ( 30 30 >= 30\geq 30 ), we can use the Z Z ZZ test statistic as opposed to t t tt. Note that statistical computing packages use t t tt throughout. Before implementing the formula, we first check whether the assumption of equality of population variances is reasonable. The guideline suggests investigating the ratio of the sample variances, s 1 2 / s 2 2 s 1 2 / s 2 2 s_(1)^(2)//s_(2)^(2)s_{1}{ }^{2} / s_{2}{ }^{2}. Suppose we call the men group 1 and the women group 2. Again, this is arbitrary; it only needs to be noted when interpreting the results. The ratio of the sample variances is 17.5 2 / 20.1 2 = 0.76 17.5 2 / 20.1 2 = 0.76 17.5^(2)//20.1^(2)=0.7617.5^{2} / 20.1^{2}=0.76, which falls between 0.5 and 2 suggesting that the assumption of equality of population variances is reasonable. The appropriate test statistic is
因为两个样本都很大 ( 30 30 >= 30\geq 30 ),所以我们可以使用 Z Z ZZ 检验统计量,而不是 t t tt 。请注意,统计计算包在整个过程中都使用 t t tt 。在实施公式之前,我们首先检查总体方差相等的假设是否合理。该指南建议调查样本方差的比率 。 s 1 2 / s 2 2 s 1 2 / s 2 2 s_(1)^(2)//s_(2)^(2)s_{1}{ }^{2} / s_{2}{ }^{2} 假设我们将男性组称为 1,女性组称为 2。同样,这是武断的;只需在解释结果时注意。样本方差的比率为 17.5 2 / 20.1 2 = 0.76 17.5 2 / 20.1 2 = 0.76 17.5^(2)//20.1^(2)=0.7617.5^{2} / 20.1^{2}=0.76 ,介于 0.5 和 2 之间,表明总体方差相等的假设是合理的。适当的检验统计量为
Z = X ¯ 1 X ¯ 2 S p 1 n 1 + 1 n 2 Z = X ¯ 1 X ¯ 2 S p 1 n 1 + 1 n 2 Z=( bar(X)_(1)- bar(X)_(2))/(Spsqrt((1)/(n_(1))+(1)/(n_(2))))Z=\frac{\bar{X}_{1}-\bar{X}_{2}}{S p \sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}}}
  • Step 3. Set up decision rule.
    步骤 3。设置决策规则。
This is a two-tailed test, using a Z Z ZZ statistic and a 5 % 5 % 5%5 \% level of significance. Reject H 0 H 0 H_(0)H_{0} if Z 1.960 Z 1.960 Z <= -1.960Z \leq-1.960 or is Z 1.960 Z 1.960 Z >= 1.960Z \geq 1.960.
这是一个双尾检验,使用 Z Z ZZ 统计量和显著性 5 % 5 % 5%5 \% 水平。如果 Z 1.960 Z 1.960 Z <= -1.960Z \leq-1.960 或 为 Z 1.960 Z 1.960 Z >= 1.960Z \geq 1.960 ,则拒绝 H 0 H 0 H_(0)H_{0}
  • Step 4. Compute the test statistic.
    步骤 4。计算检验统计量。
We now substitute the sample data into the formula for the test statistic identified in Step 2. Before substituting, we will first compute Sp , the pooled estimate of the common standard deviation.
现在,我们将样本数据替换为步骤 2 中确定的检验统计量的公式。在代入之前,我们将首先计算 Sp ,即公共标准差的合并估计值。
S p = ( n 1 1 ) s 1 2 + ( n 1 1 ) s 2 2 n 1 + n 2 2 S p = ( 1623 1 ) 17.5 2 + ( 1911 1 ) 20.1 2 1623 + 1911 2 S p = 359.12 = 19.0 S p = n 1 1 s 1 2 + n 1 1 s 2 2 n 1 + n 2 2 S p = ( 1623 1 ) 17.5 2 + ( 1911 1 ) 20.1 2 1623 + 1911 2 S p = 359.12 = 19.0 {:[Sp=sqrt(((n_(1)-1)s_(1)^(2)+(n_(1)-1)s_(2)^(2))/(n_(1)+n_(2)-2))],[Sp=sqrt(((1623-1)17.5^(2)+(1911-1)20.1^(2))/(1623+1911-2))],[Sp=sqrt359.12quad=19.0]:}\begin{aligned} & S p=\sqrt{\frac{\left(n_{1}-1\right) s_{1}^{2}+\left(n_{1}-1\right) s_{2}^{2}}{n_{1}+n_{2}-2}} \\ & S p=\sqrt{\frac{(1623-1) 17.5^{2}+(1911-1) 20.1^{2}}{1623+1911-2}} \\ & S p=\sqrt{359.12} \quad=19.0 \end{aligned}
Notice that the pooled estimate of the common standard deviation, Sp , falls in between the standard deviations in the comparison groups (i.e., 17.5 and 20.1). Sp is slightly closer in value to the standard deviation in the women (20.1) as there were slightly more women in the sample. Recall, Sp is a weight average of the standard deviations in the comparison groups, weighted by the respective sample sizes.
请注意,公共标准差 Sp 的合并估计值介于比较组中的标准差之间(即 17.5 和 20.1)。Sp 的值略接近女性的标准差 (20.1),因为样本中的女性略多。回想一下,Sp 是比较组中标准差的权重平均值,按各自的样本量加权。
Now the test statistic:
现在是 test 统计量:
Z = 128.2 126.5 19.0 1 1623 + 1 1911 = 1.7 0.64 = 2.66 Z = 128.2 126.5 19.0 1 1623 + 1 1911 = 1.7 0.64 = 2.66 Z=(128.2-126.5)/(19.0sqrt((1)/(1623)+(1)/(1911)))=(1.7)/(0.64)=2.66Z=\frac{128.2-126.5}{19.0 \sqrt{\frac{1}{1623}+\frac{1}{1911}}}=\frac{1.7}{0.64}=2.66

- Step 5. Conclusion.
- 第 5 步。结论。

We reject H 0 H 0 H_(0)H_{0} because 2.66 1.960 2.66 1.960 2.66 >= 1.9602.66 \geq 1.960. We have statistically significant evidence at α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 to show that there is a difference in mean systolic blood pressures between men and women. The p p pp-value is p < 0.010 p < 0.010 p < 0.010p<0.010.
我们拒绝 H 0 H 0 H_(0)H_{0} 是因为 2.66 1.960 2.66 1.960 2.66 >= 1.9602.66 \geq 1.960 .我们有统计学上显着的 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 证据表明,男性和女性的平均收缩压存在差异。- p p pp value 为 p < 0.010 p < 0.010 p < 0.010p<0.010 .
Here again we find that there is a statistically significant difference in mean systolic blood pressures between men and women at p p pp < 0.010. Notice that there is a very small difference in the sample means ( 128.2 126.5 = 1.7 128.2 126.5 = 1.7 128.2-126.5=1.7128.2-126.5=1.7 units), but this difference is beyond what would be expected by chance. Is this a clinically meaningful difference? The large sample size in this example is driving the statistical significance. A 95% confidence interval for the difference in mean systolic blood pressures is: 1.7 ± 1.26 1.7 ± 1.26 1.7+-1.261.7 \pm 1.26 or (0.44, 2.96). The confidence interval provides an assessment of the magnitude of the difference between means whereas the test of hypothesis and p p pp-value provide an assessment of the statistical significance of the difference.
在这里,我们再次发现男性和女性的平均收缩压存在统计学上的显着差异,为 p p pp < 0.010。请注意,样本均值 ( 128.2 126.5 = 1.7 128.2 126.5 = 1.7 128.2-126.5=1.7128.2-126.5=1.7 units) 的差异非常小,但这种差异超出了偶然预期的范围。这是一个具有临床意义的差异吗?此示例中的大样本量推动了统计显著性。平均收缩压差异的 95% 置信区间为: 1.7 ± 1.26 1.7 ± 1.26 1.7+-1.261.7 \pm 1.26 或 (0.44, 2.96)。置信区间提供对均值之间差异大小的评估,而假设检验和 p p pp -value 提供对差异的统计显著性的评估。
Above we performed a study to evaluate a new drug designed to lower total cholesterol. The study involved one sample of patients, each patient took the new drug for 6 weeks and had their cholesterol measured. As a means of evaluating the efficacy of the new drug, the mean total cholesterol following 6 weeks of treatment was compared to the NCHS-reported mean total cholesterol level in 2002 for all adults of 203. At the end of the example, we discussed the appropriateness of the fixed comparator as well as an alternative study design to evaluate the effect of the new drug involving two treatment groups, where one group receives the new drug and the other does not. Here, we revisit the example with a concurrent or parallel control group, which is very typical in randomized controlled trials or clinical trials (refer to the EP713 module on Clinical Trials).
上面我们进行了一项研究,以评估一种旨在降低总胆固醇的新药。该研究涉及一个患者样本,每位患者服用新药 6 周并测量了他们的胆固醇。作为评估新药疗效的一种手段,将治疗 6 周后的平均总胆固醇与 2002 年 NCHS 报告的所有 203 名成年人的平均总胆固醇水平进行了比较。在示例的最后,我们讨论了固定对照器的适当性以及另一种研究设计,以评估涉及两个治疗组的新药的效果,其中一组接受新药,另一组不接受。在这里,我们重新审视了同时或平行对照组的例子,这在随机对照试验或临床试验中非常典型(请参阅临床试验上的 EP713 模块)。

Example  

A new drug is proposed to lower total cholesterol. A randomized controlled trial is designed to evaluate the efficacy of the medication in lowering cholesterol. Thirty participants are enrolled in the trial and are randomly assigned to receive either the new drug or a placebo. The participants do not know which treatment they are assigned. Each participant is asked to take the assigned treatment for 6 weeks. At the end of 6 weeks, each patient’s total cholesterol level is measured and the sample statistics are as follows.
提出了一种降低总胆固醇的新药。一项随机对照试验旨在评估药物在降低胆固醇方面的疗效。30 名参与者参加了试验,并被随机分配接受新药或安慰剂。参与者不知道他们被分配了哪种治疗。每个参与者都被要求接受指定的治疗 6 周。在 6 周结束时,测量每位患者的总胆固醇水平,样本统计如下。
Treatment  治疗 Sample Size  样本量 Mean  意味 着
  标准差
Standard
Deviation
Standard Deviation| Standard | | :---: | | Deviation |
New Drug  新药 15 195.9 28.7
Placebo  安慰剂 15 217.4 30.3
Treatment Sample Size Mean "Standard Deviation" New Drug 15 195.9 28.7 Placebo 15 217.4 30.3| Treatment | Sample Size | Mean | Standard <br> Deviation | | :--- | :---: | :---: | :---: | | New Drug | 15 | 195.9 | 28.7 | | Placebo | 15 | 217.4 | 30.3 |
Is there statistical evidence of a reduction in mean total cholesterol in patients taking the new drug for 6 weeks as compared to participants taking placebo? We will run the test using the five-step approach.
是否有统计证据表明,与服用安慰剂的参与者相比,服用新药 6 周的患者平均总胆固醇降低?我们将使用五步法运行测试。
  • Step 1. Set up hypotheses and determine level of significance
    步骤 1.设置假设并确定显著性级别
H 0 : μ 1 = μ 2 H 1 : μ 1 < μ 2 α = 0.05 H 0 : μ 1 = μ 2 H 1 : μ 1 < μ 2 α = 0.05 H_(0):mu_(1)=mu_(2)H_(1):mu_(1) < mu_(2)quad alpha=0.05H_{0}: \mu_{1}=\mu_{2} H_{1}: \mu_{1}<\mu_{2} \quad \alpha=0.05
  • Step 2. Select the appropriate test statistic.
    步骤 2。选择适当的检验统计量。
Because both samples are small (<30), we use the t t tt test statistic. Before implementing the formula, we first check whether the assumption of equality of population variances is reasonable. The ratio of the sample variances, s 1 2 / s 2 2 s 1 2 / s 2 2 s_(1)^(2)//s_(2)^(2)s_{1}{ }^{2} / s_{2}{ }^{2} = 28.7 2 / 30.3 2 = 0.90 = 28.7 2 / 30.3 2 = 0.90 =28.7^(2)//30.3^(2)=0.90=28.7^{2} / 30.3^{2}=0.90, which falls between 0.5 and 2 , suggesting that the assumption of equality of population variances is reasonable. The appropriate test statistic is:
由于两个样本都很小 (<30),因此我们使用 t t tt 检验统计量。在实施公式之前,我们首先检查总体方差相等的假设是否合理。样本方差 s 1 2 / s 2 2 s 1 2 / s 2 2 s_(1)^(2)//s_(2)^(2)s_{1}{ }^{2} / s_{2}{ }^{2} 的比率 = 28.7 2 / 30.3 2 = 0.90 = 28.7 2 / 30.3 2 = 0.90 =28.7^(2)//30.3^(2)=0.90=28.7^{2} / 30.3^{2}=0.90 介于 0.5 和 2 之间,表明总体方差相等的假设是合理的。适当的检验统计量为:
t = X 1 X 2 Sp 1 n 1 + 1 n 2 t = X ¯ 1 X ¯ 2 Sp 1 n 1 + 1 n 2 t=( bar(X)_(1)- bar(X)_(2))/(Spsqrt((1)/(n_(1))+(1)/(n_(2))))\mathrm{t}=\frac{\overline{\mathrm{X}}_{1}-\overline{\mathrm{X}}_{2}}{\mathrm{Sp} \sqrt{\frac{1}{\mathrm{n}_{1}}+\frac{1}{\mathrm{n}_{2}}}}
  • Step 3. Set up decision rule.
    步骤 3。设置决策规则。
This is a lower-tailed test, using a t t tt statistic and a 5 % 5 % 5%5 \% level of significance. The appropriate critical value can be found in the t t tt Table (in More Resources to the right). In order to determine the critical value of t t tt we need degrees of freedom, d f d f dfd f, defined as d f = n 1 + n 2 2 = 15 + 15 2 = 28 d f = n 1 + n 2 2 = 15 + 15 2 = 28 df=n_(1)+n_(2)-2=15+15-2=28d f=n_{1}+n_{2}-2=15+15-2=28. The critical value for a lower tailed test with d f = 28 d f = 28 df=28d f=28 and α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 is -2.048 and the decision rule is: Reject H 0 H 0 H_(0)\mathrm{H}_{0} if t 2.048 t 2.048 t <= -2.048\mathrm{t} \leq-2.048.
这是一个低尾检验,使用 t t tt 统计量和显著性 5 % 5 % 5%5 \% 水平。相应的临界值可以在 t t tt Table (表) 中找到(在右侧的 More Resources 中)。为了确定 t t tt 我们需要的自由度 的临界值, d f d f dfd f 定义为 d f = n 1 + n 2 2 = 15 + 15 2 = 28 d f = n 1 + n 2 2 = 15 + 15 2 = 28 df=n_(1)+n_(2)-2=15+15-2=28d f=n_{1}+n_{2}-2=15+15-2=28 。使用 d f = 28 d f = 28 df=28d f=28 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 的低尾测试的临界值为 -2.048,决策规则为: 如果 则拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} t 2.048 t 2.048 t <= -2.048\mathrm{t} \leq-2.048
  • Step 4. Compute the test statistic.
    步骤 4。计算检验统计量。
We now substitute the sample data into the formula for the test statistic identified in Step 2. Before substituting, we will
现在,我们将样本数据替换为步骤 2 中确定的检验统计量的公式。在替换之前,我们将

first compute Sp , the pooled estimate of the common standard deviation.
首先计算 Sp ,即公共标准差的合并估计值。
S p = ( n 1 1 ) s 1 2 + ( n 1 1 ) s 2 2 n 1 + n 2 2 S p = ( 15 1 ) 28.7 2 + ( 15 1 ) 30.3 2 15 + 15 2 S p = 870.89 = 29.5 S p = n 1 1 s 1 2 + n 1 1 s 2 2 n 1 + n 2 2 S p = ( 15 1 ) 28.7 2 + ( 15 1 ) 30.3 2 15 + 15 2 S p = 870.89 = 29.5 {:[Sp=sqrt(((n_(1)-1)s_(1)^(2)+(n_(1)-1)s_(2)^(2))/(n_(1)+n_(2)-2))],[Sp=sqrt(((15-1)28.7^(2)+(15-1)30.3^(2))/(15+15-2))],[Sp=sqrt870.89quad=29.5]:}\begin{aligned} & S p=\sqrt{\frac{\left(n_{1}-1\right) s_{1}^{2}+\left(n_{1}-1\right) s_{2}^{2}}{n_{1}+n_{2}-2}} \\ & S p=\sqrt{\frac{(15-1) 28.7^{2}+(15-1) 30.3^{2}}{15+15-2}} \\ & S p=\sqrt{870.89} \quad=29.5 \end{aligned}
Now the test statistic,
现在是检验统计量
t = 195.9 227.4 29.5 1 15 + 1 15 = 31.5 10.77 = 2.92 t = 195.9 227.4 29.5 1 15 + 1 15 = 31.5 10.77 = 2.92 t=(195.9-227.4)/(29.5sqrt((1)/(15)+(1)/(15)))=(-31.5)/(10.77)=-2.92\mathrm{t}=\frac{195.9-227.4}{29.5 \sqrt{\frac{1}{15}+\frac{1}{15}}}=\frac{-31.5}{10.77}=-2.92

- Step 5. Conclusion.
- 第 5 步。结论。

We reject H 0 H 0 H_(0)\mathrm{H}_{0} because 2.92 2.048 2.92 2.048 -2.92 <= -2.048-2.92 \leq-2.048. We have statistically significant evidence at α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 to show that the mean total cholesterol level is lower in patients taking the new drug for 6 weeks as compared to patients taking placebo, p < p < p <\mathrm{p}< 0.005 .
我们拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} 是因为 2.92 2.048 2.92 2.048 -2.92 <= -2.048-2.92 \leq-2.048 .我们有具有统计学意义的 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 证据表明,与服用安慰剂的患者相比,服用新药 6 周的患者的平均总胆固醇水平较低, p < p < p <\mathrm{p}< 为 0.005。
The clinical trial in this example finds a statistically significant reduction in total cholesterol, whereas in the previous example where we had a historical control (as opposed to a parallel control group) we did not demonstrate efficacy of the new drug. Notice that the mean total cholesterol level in patients taking placebo is 217.4 which is very different from the mean cholesterol reported among all Americans in 2002 of 203 and used as the comparator in the prior example. The historical control value may not have been the most appropriate comparator as cholesterol levels have been increasing over time. In the next section, we present another design that can be used to assess the efficacy of the new drug.
这个例子中的临床试验发现总胆固醇在统计学上显着降低,而在前面的例子中,我们有历史对照(而不是平行对照组),我们没有证明新药的疗效。请注意,服用安慰剂的患者的平均总胆固醇水平为 217.4,这与 2002 年报告的 203 名美国人的平均胆固醇水平有很大不同,并在前面的例子中用作比较。历史对照值可能不是最合适的比较指标,因为胆固醇水平随着时间的推移而增加。在下一节中,我们将介绍另一种可用于评估新药疗效的设计。

Tests with Matched Samples, Continuous Outcome
使用匹配样本的测试,连续结果

In the previous section we compared two groups with respect to their mean scores on a continuous outcome. An alternative study design is to compare matched or paired samples. The two comparison groups are said to be dependent, and the data can arise from a single sample of participants where each participant is measured twice (possibly before and after an intervention) or from two samples that are matched on specific characteristics (e.g., siblings). When the samples are dependent, we focus on difference scores in each participant or between members of a pair and the test of hypothesis is based on the mean difference, μ d μ d mu_(d)\mu_{\mathrm{d}}. The null hypothesis again reflects “no difference” and is stated as H 0 : μ d = 0 H 0 : μ d = 0 H_(0):mu_(d)=0\mathrm{H}_{0}: \mu_{\mathrm{d}}=0. Note that there are some instances where it is of interest to test whether there is a difference of a particular magnitude (e.g., μ d = 5 μ d = 5 mu_(d)=5\mu_{d}=5 ) but in most instances the null hypothesis reflects no difference (i.e., μ d = 0 μ d = 0 mu_(d)=0\mu_{\mathrm{d}}=0 ).
在上一节中,我们比较了两组在连续结局中的平均得分。另一种研究设计是比较匹配或配对的样本。这两个对照组被称为依赖组,数据可以来自单个参与者样本,其中每个参与者都被测量两次(可能在干预之前和之后)或来自两个根据特定特征匹配的样本(例如,兄弟姐妹)。当样本是相关的时,我们关注每个参与者或一对成员之间的差异分数,假设检验基于平均差 μ d μ d mu_(d)\mu_{\mathrm{d}} 。原假设再次反映“无差异”,并表示为 H 0 : μ d = 0 H 0 : μ d = 0 H_(0):mu_(d)=0\mathrm{H}_{0}: \mu_{\mathrm{d}}=0 。请注意,在某些情况下,测试是否存在特定量级的差异是有意义的(例如, μ d = 5 μ d = 5 mu_(d)=5\mu_{d}=5 ),但在大多数情况下,原假设反映没有差异(即 μ d = 0 μ d = 0 mu_(d)=0\mu_{\mathrm{d}}=0 )。
The appropriate formula for the test of hypothesis depends on the sample size. The formulas are shown below and are identical to those we presented for estimating the mean of a single sample presented (e.g., when comparing against an external or historical control), except here we focus on difference scores.
用于假设检验的适当公式取决于样本量。这些公式如下所示,与我们提出的用于估计所呈现的单个样本的平均值的公式相同(例如,与外部或历史对照进行比较时),除了这里我们关注的是差异分数。
Test Statistics for Testing H 0 : μ d = 0 H 0 : μ d = 0 H_(0):mu_(d)=0\mathbf{H}_{\mathbf{0}}: \boldsymbol{\mu}_{\mathbf{d}}=\mathbf{0}
用于测试 H 0 : μ d = 0 H 0 : μ d = 0 H_(0):mu_(d)=0\mathbf{H}_{\mathbf{0}}: \boldsymbol{\mu}_{\mathbf{d}}=\mathbf{0} 的测试统计量
if n 30 n 30 n >= 30\mathrm{n} \geq 30  如果 n 30 n 30 n >= 30\mathrm{n} \geq 30 if n < 30 n < 30 n < 30\mathrm{n}<30  如果 n < 30 n < 30 n < 30\mathrm{n}<30
Z = X d μ d s d / n Z = X ¯ d μ d s d / n Z=( bar(X)_(d)-mu_(d))/(s_(d)//sqrtn)Z=\frac{\overline{\mathrm{X}}_{\mathrm{d}}-\mu_{\mathrm{d}}}{\mathrm{s}_{\mathrm{d}} / \sqrt{\mathrm{n}}} t = X d μ d s d / n t = X ¯ d μ d s d / n t=( bar(X)_(d)-mu_(d))/(s_(d)//sqrtn)\mathrm{t}=\frac{\overline{\mathrm{X}}_{\mathrm{d}}-\mu_{\mathrm{d}}}{\mathrm{s}_{\mathrm{d}} / \sqrt{\mathrm{n}}}
where df = n 1 df = n 1 df=n-1\mathrm{df}=\mathrm{n}-1  哪里 df = n 1 df = n 1 df=n-1\mathrm{df}=\mathrm{n}-1
Test Statistics for Testing H_(0):mu_(d)=0 if n >= 30 if n < 30 Z=( bar(X)_(d)-mu_(d))/(s_(d)//sqrtn) t=( bar(X)_(d)-mu_(d))/(s_(d)//sqrtn) where df=n-1| Test Statistics for Testing $\mathbf{H}_{\mathbf{0}}: \boldsymbol{\mu}_{\mathbf{d}}=\mathbf{0}$ | | | :---: | :---: | | if $\mathrm{n} \geq 30$ | if $\mathrm{n}<30$ | | $Z=\frac{\overline{\mathrm{X}}_{\mathrm{d}}-\mu_{\mathrm{d}}}{\mathrm{s}_{\mathrm{d}} / \sqrt{\mathrm{n}}}$ | $\mathrm{t}=\frac{\overline{\mathrm{X}}_{\mathrm{d}}-\mu_{\mathrm{d}}}{\mathrm{s}_{\mathrm{d}} / \sqrt{\mathrm{n}}}$ | | | where $\mathrm{df}=\mathrm{n}-1$ |

Example  

A new drug is proposed to lower total cholesterol and a study is designed to evaluate the efficacy of the drug in lowering cholesterol. Fifteen patients agree to participate in the study and each is asked to take the new drug for 6 weeks. However, before starting the treatment, each patient’s total cholesterol level is measured. The initial measurement is a pre-treatment or baseline value. After taking the drug for 6 weeks, each patient’s total cholesterol level is measured again and the data are shown below. The rightmost column contains difference scores for each patient, computed by subtracting the 6 week cholesterol level from the baseline level. The differences represent the reduction in total cholesterol over 4 weeks. (The differences could have been computed by subtracting the baseline total cholesterol level from the level measured at 6 weeks. The way in which the differences are computed does not affect the outcome of the analysis only the interpretation.)
提出了一种降低总胆固醇的新药,并设计了一项研究来评估该药物在降低胆固醇方面的疗效。15 名患者同意参加该研究,每位患者都被要求服用新药 6 周。然而,在开始治疗之前,会测量每个患者的总胆固醇水平。初始测量值是治疗前或基线值。服药 6 周后,再次测量每位患者的总胆固醇水平,数据如下所示。最右边的列包含每位患者的差异评分,计算方法是从基线水平中减去 6 周的胆固醇水平。差异代表 4 周内总胆固醇的降低。(可以通过从 6 周时测量的水平中减去基线总胆固醇水平来计算差异。计算差异的方式不会影响分析的结果,只影响解释。
Subject Identification Number
主题识别号
Baseline  基线 6 Weeks  6 周 Difference  差异
1 215 205 10
2 190 156 34
3 230 190 40
4 220 180 40
5 214 201 13
6 240 227 13
7 210 197 13
8 193 173 20
9 210 204 6
10 230 217 13
11 180 142 38
12 260 262 -2
13 210 207 3
14 190 184 6
15 200 193 7
Subject Identification Number Baseline 6 Weeks Difference 1 215 205 10 2 190 156 34 3 230 190 40 4 220 180 40 5 214 201 13 6 240 227 13 7 210 197 13 8 193 173 20 9 210 204 6 10 230 217 13 11 180 142 38 12 260 262 -2 13 210 207 3 14 190 184 6 15 200 193 7| Subject Identification Number | Baseline | 6 Weeks | Difference | | :--- | :--- | :--- | :--- | | 1 | 215 | 205 | 10 | | 2 | 190 | 156 | 34 | | 3 | 230 | 190 | 40 | | 4 | 220 | 180 | 40 | | 5 | 214 | 201 | 13 | | 6 | 240 | 227 | 13 | | 7 | 210 | 197 | 13 | | 8 | 193 | 173 | 20 | | 9 | 210 | 204 | 6 | | 10 | 230 | 217 | 13 | | 11 | 180 | 142 | 38 | | 12 | 260 | 262 | -2 | | 13 | 210 | 207 | 3 | | 14 | 190 | 184 | 6 | | 15 | 200 | 193 | 7 |
Because the differences are computed by subtracting the cholesterols measured at 6 weeks from the baseline values, positive differences indicate reductions and negative differences indicate increases (e.g., participant 12 increases by 2 units over 6 weeks). The goal here is to test whether there is a statistically significant reduction in cholesterol. Because of the way in which we computed the differences, we want to look for an increase in the mean difference (i.e., a positive reduction). In order to conduct the test, we need to summarize the differences. In this sample, we have
因为差异是通过从基线值中减去 6 周时测得的胆固醇来计算的,所以正差异表示减少,负差异表示增加(例如,参与者 12 在 6 周内增加了 2 个单位)。这里的目标是测试胆固醇是否有统计学意义的降低。由于我们计算差值的方式,我们希望寻找均值差值的增加(即正减少)。为了进行测试,我们需要总结差异。在此示例中,我们有
N = 15 , X ¯ d = 16.9 and s d = 14.2 , respectively. N = 15 , X ¯ d = 16.9  and  s d = 14.2 , respectively.  N=15, bar(X)_(d)=16.9" and "s_(d)=14.2", respectively. "N=15, \bar{X}_{d}=16.9 \text { and } s_{d}=14.2 \text {, respectively. }
The calculations are shown below.
计算结果如下所示。
Subject Identification Number
主题识别号
Difference  差异 Difference 2 2 ^(2){ }^{2}  差异 2 2 ^(2){ }^{2}
1 10 100
2 34 1156
3 40 1600
4 40 1600
5 13 169
6 13 169
7 13 169
8 20 400
9 6 36
10 13 169
11 38 1444
12 -2 4
13 3 9
14 6 36
15 7 49
254 7110
Subject Identification Number Difference Difference ^(2) 1 10 100 2 34 1156 3 40 1600 4 40 1600 5 13 169 6 13 169 7 13 169 8 20 400 9 6 36 10 13 169 11 38 1444 12 -2 4 13 3 9 14 6 36 15 7 49 254 7110| Subject Identification Number | Difference | Difference ${ }^{2}$ | | :--- | :--- | :--- | | 1 | 10 | 100 | | 2 | 34 | 1156 | | 3 | 40 | 1600 | | 4 | 40 | 1600 | | 5 | 13 | 169 | | 6 | 13 | 169 | | 7 | 13 | 169 | | 8 | 20 | 400 | | 9 | 6 | 36 | | 10 | 13 | 169 | | 11 | 38 | 1444 | | 12 | -2 | 4 | | 13 | 3 | 9 | | 14 | 6 | 36 | | 15 | 7 | 49 | | | 254 | 7110 |
X d = Differences n = 254 15 = 16.9 and X ¯ d =  Differences  n = 254 15 = 16.9  and  bar(X)_(d)=(sum" Differences ")/(n)=(254)/(15)quad=16.9" and "\overline{\mathrm{X}}_{\mathrm{d}}=\frac{\sum \text { Differences }}{\mathrm{n}}=\frac{254}{15} \quad=16.9 \text { and }
s d = Σ Differences Σ Differences 2 / n n 1 = 7110 ( 254 ) 2 / 15 14 = 2808.93 14 = 200.64 = 14.2 s d = Σ  Differences  Σ  Differences  2 / n n 1 = 7110 ( 254 ) 2 / 15 14 = 2808.93 14 = 200.64 = 14.2 s_(d)=sqrt((Sigma" Differences "^(-Sigma" Differences "^(2)//n))/(n-1))=sqrt((7110-(254)^(2)//15)/(14))=sqrt((2808.93)/(14))=sqrt200.64=14.2s_{d}=\sqrt{\frac{\Sigma \text { Differences }^{-\Sigma \text { Differences }^{2} / n}}{n-1}}=\sqrt{\frac{7110-(254)^{2} / 15}{14}}=\sqrt{\frac{2808.93}{14}}=\sqrt{200.64}=14.2
Is there statistical evidence of a reduction in mean total cholesterol in patients after using the new medication for 6 weeks? We will run the test using the five-step approach.
是否有统计证据表明患者在使用新药 6 周后平均总胆固醇降低?我们将使用五步法运行测试。
  • Step 1. Set up hypotheses and determine level of significance
    步骤 1.设置假设并确定显著性级别
H 0 : μ d = 0 H 1 : μ d > 0 α = 0.05 H 0 : μ d = 0 H 1 : μ d > 0 α = 0.05 H_(0):mu_(d)=0H_(1):mu_(d) > 0quad alpha=0.05H_{0}: \mu_{d}=0 H_{1}: \mu_{d}>0 \quad \alpha=0.05
NOTE: If we had computed differences by subtracting the baseline level from the level measured at 6 weeks then negative differences would have reflected reductions and the research hypothesis would have been H 1 : μ d < 0 H 1 : μ d < 0 H_(1):mu_(d) < 0H_{1}: \mu_{d}<0.
注意:如果我们通过从 6 周时测得的水平中减去基线水平来计算差异,那么负差异将反映减少,研究假设将是 H 1 : μ d < 0 H 1 : μ d < 0 H_(1):mu_(d) < 0H_{1}: \mu_{d}<0
  • Step 2. Select the appropriate test statistic.
    步骤 2。选择适当的检验统计量。
Because the sample size is small ( n < 30 ) ( n < 30 ) (n < 30)(\mathrm{n}<30) the appropriate test statistic is
由于样本量较小 ( n < 30 ) ( n < 30 ) (n < 30)(\mathrm{n}<30) ,因此适当的检验统计量为
t = X ¯ d μ d s d / n t = X ¯ d μ d s d / n t=( bar(X)_(d)-mu_(d))/(s_(d)//sqrtn)t=\frac{\bar{X}_{d}-\mu_{d}}{s_{d} / \sqrt{n}}
  • Step 3. Set up decision rule.
    步骤 3。设置决策规则。
This is an upper-tailed test, using a t statistic and a 5 % 5 % 5%5 \% level of significance. The appropriate critical value can be found in the t t tt Table at the right, with d f = 15 1 = 14 d f = 15 1 = 14 df=15-1=14d f=15-1=14. The critical value for an upper-tailed test with d f = 14 d f = 14 df=14d f=14 and α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 is 2.145 and the decision rule is Reject H 0 H 0 H_(0)\mathrm{H}_{0} if t 2.145 t 2.145 t >= 2.145\mathrm{t} \geq 2.145.
这是一个上尾检验,使用 t 统计量和显著性 5 % 5 % 5%5 \% 水平。适当的临界值可以在右侧的 t t tt Table 中找到,其中 d f = 15 1 = 14 d f = 15 1 = 14 df=15-1=14d f=15-1=14 .使用 d f = 14 d f = 14 df=14d f=14 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 的上尾测试的临界值为 2.145,决策规则为 Reject H 0 H 0 H_(0)\mathrm{H}_{0} if t 2.145 t 2.145 t >= 2.145\mathrm{t} \geq 2.145
  • Step 4. Compute the test statistic.
    步骤 4。计算检验统计量。
We now substitute the sample data into the formula for the test statistic identified in Step 2.
现在,我们将样本数据替换为步骤 2 中确定的检验统计量的公式。
t = X d μ d s d / n = 16.9 0 14.2 / 15 = 4.61 . t = X ¯ d μ d s d / n = 16.9 0 14.2 / 15 = 4.61 . t=( bar(X)_(d)-mu_(d))/(s_(d)//sqrtn)=(16.9-0)/(14.2//sqrt15)quad=4.61.\mathrm{t}=\frac{\overline{\mathrm{X}}_{\mathrm{d}}-\mu_{\mathrm{d}}}{\mathrm{~s}_{\mathrm{d}} / \sqrt{\mathrm{n}}}=\frac{16.9-0}{14.2 / \sqrt{15}} \quad=4.61 .
  • Step 5. Conclusion.  步骤 5。结论。
We reject H 0 H 0 H_(0)H_{0} because 4.61 2.145 4.61 2.145 4.61 >= 2.1454.61 \geq 2.145. We have statistically significant evidence at α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 to show that there is a reduction in cholesterol levels over 6 weeks.
我们拒绝 H 0 H 0 H_(0)H_{0} 是因为 4.61 2.145 4.61 2.145 4.61 >= 2.1454.61 \geq 2.145 .我们有具有统计学意义的 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 证据表明,胆固醇水平在 6 周内有所降低。
Here we illustrate the use of a matched design to test the efficacy of a new drug to lower total cholesterol. We also considered a parallel design (randomized clinical trial) and a study using a historical comparator. It is extremely important to design studies that are best suited to detect a meaningful difference when one exists. There are often several alternatives and investigators work with biostatisticians to determine the best design for each application. It is worth noting that the matched design used here can be problematic in that observed differences may only reflect a “placebo” effect. All participants took the assigned medication, but is the observed reduction attributable to the medication or a result of these participation in a study
在这里,我们说明了使用匹配设计来测试新药降低总胆固醇的功效。我们还考虑了平行设计(随机临床试验)和使用历史对照的研究。设计最适合检测有意义差异(如果存在差异)的研究非常重要。通常有几种替代方案,研究人员与生物统计学家合作,为每种应用确定最佳设计。值得注意的是,此处使用的匹配设计可能存在问题,因为观察到的差异可能仅反映“安慰剂”效应。所有参与者都服用了指定的药物,但观察到的减少是由于药物还是这些参与研究的结果

Tests with Two Independent Samples, Dichotomous Outcome
使用两个独立样本的检验,二分类结果

Here we consider the situation where there are two independent comparison groups and the outcome of interest is dichotomous (e.g., success/failure). The goal of the analysis is to compare proportions of successes between the two groups. The relevant sample data are the sample sizes in each comparison group ( n 1 n 1 n_(1)\mathrm{n}_{1} and n 2 n 2 n_(2)\mathrm{n}_{2} ) and the sample proportions ( p ^ 1 p ^ 1 hat(p)_(1)\hat{\mathrm{p}}_{1} and p ^ 2 p ^ 2 hat(p)_(2)\hat{\mathrm{p}}_{2} ) which are computed by taking the ratios of the numbers of successes to the sample sizes in each group, i.e.,
在这里,我们考虑的是这样一种情况:有两个独立的比较组,并且感兴趣的结果是二分法(例如,成功/失败)。分析的目标是比较两组之间的成功比例。相关样本数据是每个比较组 ( n 1 n 1 n_(1)\mathrm{n}_{1} n 2 n 2 n_(2)\mathrm{n}_{2} ) 中的样本量和样本比例 ( p ^ 1 p ^ 1 hat(p)_(1)\hat{\mathrm{p}}_{1} p ^ 2 p ^ 2 hat(p)_(2)\hat{\mathrm{p}}_{2} ),它们是通过取成功次数与每组中样本量的比率来计算的,即
p ^ 1 = x 1 n 1 p ^ 2 = x 2 n 2 . p ^ 1 = x 1 n 1 p ^ 2 = x 2 n 2 . hat(p)_(1)=(x_(1))/(n_(1))quad hat(p)_(2)=(x_(2))/(n_(2)).\hat{\mathrm{p}}_{1}=\frac{\mathrm{x}_{1}}{\mathrm{n}_{1}} \quad \hat{\mathrm{p}}_{2}=\frac{\mathrm{x}_{2}}{\mathrm{n}_{2}} .
There are several approaches that can be used to test hypotheses concerning two independent proportions. Here we present one approach - the chi-square test of independence is an alternative, equivalent, and perhaps more popular approach to the same analysis. Hypothesis testing with the chi-square test is addressed in the third module in this series: BS704_HypothesisTestingChiSquare.
有几种方法可用于检验有关两个独立比例的假设。在这里,我们提出了一种方法 - 独立性卡方检验是相同分析的另一种等效方法,也许是更流行的方法。使用卡方检验的假设检验在本系列的第三个模块中讨论:BS704_HypothesisTestingChiSquare。
In tests of hypothesis comparing proportions between two independent groups, one test is performed and results can be interpreted to apply to a risk difference, relative risk or odds ratio. As a reminder, the risk difference is computed by taking the difference in proportions between comparison groups, the risk ratio is computed by taking the ratio of proportions, and the odds ratio is computed by taking the ratio of the odds of success in the comparison groups. Because the null values for the risk difference, the risk ratio and the odds ratio are different, the hypotheses in tests of hypothesis look slightly different depending on which measure is used. When performing tests of hypothesis for the risk difference, relative risk or odds ratio, the convention is to label the exposed or treated group 1 and the unexposed or control group 2.
在比较两个独立组之间比例的假设检验中,执行一项检验,结果可以解释为适用于风险差、相对风险或比值比。提醒一下,风险差是通过取对照组之间的比例差来计算的,风险比是通过取比例的比率来计算的,比值比是通过取对照组成功几率的比率来计算的。由于风险差值、风险比和优势比的空值不同,因此假设检验中的假设看起来略有不同,具体取决于所使用的度量。在对风险差、相对风险或比值比进行假设检验时,惯例是标记暴露或治疗组 1 和未暴露组或对照组 2。
For example, suppose a study is designed to assess whether there is a significant difference in proportions in two independent comparison groups. The test of interest is as follows:
例如,假设一项研究旨在评估两个独立比较组中的比例是否存在显著差异。感兴趣的测试如下:

H 0 : p 1 = p 2 H 0 : p 1 = p 2 H_(0):p_(1)=p_(2)H_{0}: p_{1}=p_{2} versus H 1 : p 1 p 2 H 1 : p 1 p 2 H_(1):p_(1)!=p_(2)H_{1}: p_{1} \neq p_{2}.
H 0 : p 1 = p 2 H 0 : p 1 = p 2 H_(0):p_(1)=p_(2)H_{0}: p_{1}=p_{2} H 1 : p 1 p 2 H 1 : p 1 p 2 H_(1):p_(1)!=p_(2)H_{1}: p_{1} \neq p_{2} .

The following are the hypothesis for testing for a difference in proportions using the risk difference, the risk ratio and the odds ratio. First, the hypotheses above are equivalent to the following:
以下是使用风险差、风险比和比值比检验比例差异的假设。首先,上述假设等同于以下内容:
  • For the risk difference, H 0 : p 1 p 2 = 0 H 0 : p 1 p 2 = 0 H_(0):p_(1)-p_(2)=0H_{0}: p_{1}-p_{2}=0 versus H 1 : p 1 p 2 0 H 1 : p 1 p 2 0 H_(1):p_(1)-p_(2)!=0H_{1}: p_{1}-p_{2} \neq 0 which are, by definition, equal to H 0 : R D = 0 H 0 : R D = 0 H_(0):RD=0H_{0}: R D=0 versus H 1 : R D 0 H 1 : R D 0 H_(1):RD!=0H_{1}: R D \neq 0.
    对于风险差异 H 0 : p 1 p 2 = 0 H 0 : p 1 p 2 = 0 H_(0):p_(1)-p_(2)=0H_{0}: p_{1}-p_{2}=0 vs H 1 : p 1 p 2 0 H 1 : p 1 p 2 0 H_(1):p_(1)-p_(2)!=0H_{1}: p_{1}-p_{2} \neq 0 ,根据定义,它们等于 H 0 : R D = 0 H 0 : R D = 0 H_(0):RD=0H_{0}: R D=0 versus H 1 : R D 0 H 1 : R D 0 H_(1):RD!=0H_{1}: R D \neq 0
  • If an investigator wants to focus on the risk ratio, the equivalent hypotheses are H 0 : R R = 1 H 0 : R R = 1 H_(0):RR=1H_{0}: R R=1 versus H 1 : R R 1 H 1 : R R 1 H_(1):RR!=1H_{1}: R R \neq 1.
    如果调查人员想要关注风险比,则等效假设为 H 0 : R R = 1 H 0 : R R = 1 H_(0):RR=1H_{0}: R R=1 vs H 1 : R R 1 H 1 : R R 1 H_(1):RR!=1H_{1}: R R \neq 1
  • If the investigator wants to focus on the odds ratio, the equivalent hypotheses are H 0 : OR = 1 H 0 : OR = 1 H_(0):OR=1\mathrm{H}_{0}: \mathrm{OR}=1 versus H 1 H 1 H_(1)\mathrm{H}_{1} : OR 1 OR 1 OR!=1\mathrm{OR} \neq 1.
    如果研究者想要关注优势比,则等效假设为 H 0 : OR = 1 H 0 : OR = 1 H_(0):OR=1\mathrm{H}_{0}: \mathrm{OR}=1 versus H 1 H 1 H_(1)\mathrm{H}_{1} OR 1 OR 1 OR!=1\mathrm{OR} \neq 1
Suppose a test is performed to test H 0 : R D = 0 H 0 : R D = 0 H_(0):RD=0H_{0}: R D=0 versus H 1 : R D 0 H 1 : R D 0 H_(1):RD!=0H_{1}: R D \neq 0 and the test rejects H 0 H 0 H_(0)H_{0} at α = 0.05 α = 0.05 alpha=0.05\alpha=0.05. Based on this test we can conclude that there is significant evidence, α = 0.05 α = 0.05 alpha=0.05\alpha=0.05, of a difference in proportions, significant evidence that the risk difference is not zero, significant evidence that the risk ratio and odds ratio are not one. The risk difference is analogous to the difference in means when the outcome is continuous. Here the parameter of interest is the difference in proportions in the population, RD = p 1 p 2 RD = p 1 p 2 RD=p_(1)-p_(2)\mathrm{RD}=\mathrm{p}_{1}-\mathrm{p}_{2} and the null value for the risk difference is zero. In a test of hypothesis for the risk difference, the null hypothesis is always H 0 : RD = 0 H 0 : RD = 0 H_(0):RD=0\mathrm{H}_{0}: \mathrm{RD}=0. This is equivalent to H 0 : RR = 1 H 0 : RR = 1 H_(0):RR=1\mathrm{H}_{0}: \mathrm{RR}=1 and H 0 : OR = 1 H 0 : OR = 1 H_(0):OR=1\mathrm{H}_{0}: \mathrm{OR}=1. In the research hypothesis, an investigator can hypothesize that the first proportion is larger than the second ( H 1 : p 1 > p 2 H 1 : p 1 > p 2 H_(1):p_(1) > p_(2)H_{1}: p_{1}>p_{2}, which is equivalent to H 1 : R D > 0 , H 1 : R R > 1 H 1 : R D > 0 , H 1 : R R > 1 H_(1):RD > 0,H_(1):RR > 1H_{1}: R D>0, H_{1}: R R>1 and H 1 : O R > 1 H 1 : O R > 1 H_(1):OR > 1H_{1}: O R>1 ), that the first proportion is smaller than the second ( H 1 H 1 H_(1)\mathrm{H}_{1} : p 1 < p 2 p 1 < p 2 p_(1) < p_(2)\mathrm{p}_{1}<\mathrm{p}_{2}, which is equivalent to H 1 : RD < 0 , H 1 : RR < 1 H 1 : RD < 0 , H 1 : RR < 1 H_(1):RD < 0,H_(1):RR < 1\mathrm{H}_{1}: \mathrm{RD}<0, \mathrm{H}_{1}: \mathrm{RR}<1 and H 1 : OR < 1 H 1 : OR < 1 H_(1):OR < 1\mathrm{H}_{1}: \mathrm{OR}<1 ), or that the proportions are different ( H 1 : p 1 p 2 H 1 : p 1 p 2 H_(1):p_(1)!=p_(2)H_{1}: p_{1} \neq p_{2}, which is equivalent to H 1 : R D 0 , H 1 : R R 1 H 1 : R D 0 , H 1 : R R 1 H_(1):RD!=0,H_(1):RR!=1H_{1}: R D \neq 0, H_{1}: R R \neq 1 and H 1 : O R 1 H 1 : O R 1 H_(1):OR!=1H_{1}: O R \neq 1 ). The three different alternatives represent upper-, lower- and two-tailed tests, respectively.
假设执行了一项测试以测试 H 0 : R D = 0 H 0 : R D = 0 H_(0):RD=0H_{0}: R D=0 vers H 1 : R D 0 H 1 : R D 0 H_(1):RD!=0H_{1}: R D \neq 0 ,并且测试在 H 0 H 0 H_(0)H_{0} α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 .基于该测试,我们可以得出结论,有显著证据表明 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 比例存在差异,显著证据表明风险差异不为零,显著证据表明风险比和比值比不是一个。风险差异类似于结果连续时的均值差异。此处,感兴趣的参数是总体中比例的差异, RD = p 1 p 2 RD = p 1 p 2 RD=p_(1)-p_(2)\mathrm{RD}=\mathrm{p}_{1}-\mathrm{p}_{2} 风险差异的 null 值为零。在风险差值的假设检验中,原假设始终 H 0 : RD = 0 H 0 : RD = 0 H_(0):RD=0\mathrm{H}_{0}: \mathrm{RD}=0 为 。这等效于 H 0 : RR = 1 H 0 : RR = 1 H_(0):RR=1\mathrm{H}_{0}: \mathrm{RR}=1 H 0 : OR = 1 H 0 : OR = 1 H_(0):OR=1\mathrm{H}_{0}: \mathrm{OR}=1 。在研究假设中,研究者可以假设第一个比例大于第二个比例( H 1 : p 1 > p 2 H 1 : p 1 > p 2 H_(1):p_(1) > p_(2)H_{1}: p_{1}>p_{2} ,相当于 H 1 : R D > 0 , H 1 : R R > 1 H 1 : R D > 0 , H 1 : R R > 1 H_(1):RD > 0,H_(1):RR > 1H_{1}: R D>0, H_{1}: R R>1 H 1 : O R > 1 H 1 : O R > 1 H_(1):OR > 1H_{1}: O R>1 ),第一个比例小于第二个比例( H 1 H 1 H_(1)\mathrm{H}_{1} p 1 < p 2 p 1 < p 2 p_(1) < p_(2)\mathrm{p}_{1}<\mathrm{p}_{2} ,相当于 H 1 : RD < 0 , H 1 : RR < 1 H 1 : RD < 0 , H 1 : RR < 1 H_(1):RD < 0,H_(1):RR < 1\mathrm{H}_{1}: \mathrm{RD}<0, \mathrm{H}_{1}: \mathrm{RR}<1 H 1 : OR < 1 H 1 : OR < 1 H_(1):OR < 1\mathrm{H}_{1}: \mathrm{OR}<1 ),或者比例不同( H 1 : p 1 p 2 H 1 : p 1 p 2 H_(1):p_(1)!=p_(2)H_{1}: p_{1} \neq p_{2} ,相当于 H 1 : R D 0 , H 1 : R R 1 H 1 : R D 0 , H 1 : R R 1 H_(1):RD!=0,H_(1):RR!=1H_{1}: R D \neq 0, H_{1}: R R \neq 1 H 1 : O R 1 H 1 : O R 1 H_(1):OR!=1H_{1}: O R \neq 1 ) ).这三种不同的备选方案分别代表上尾检验、下尾检验和双尾检验。
The formula for the test of hypothesis for the difference in proportions is given below.
比例差异的假设检验公式如下。

Test Statistics for Testing H 0 : p 1 = p H 0 : p 1 = p H_(0):p_(1)=p\mathbf{H}_{\mathbf{0}}: \mathbf{p}_{\mathbf{1}}=\mathbf{p}
用于测试 H 0 : p 1 = p H 0 : p 1 = p H_(0):p_(1)=p\mathbf{H}_{\mathbf{0}}: \mathbf{p}_{\mathbf{1}}=\mathbf{p} 的测试统计量
Z = p ^ 1 p ^ 2 p ^ ( 1 p ^ ) ( 1 n 1 + 1 n 2 ) Z = p ^ 1 p ^ 2 p ^ ( 1 p ^ ) 1 n 1 + 1 n 2 Z=( hat(p)_(1)- hat(p)_(2))/(sqrt( hat(p)(1- hat(p))((1)/(n_(1))+(1)/(n_(2)))))Z=\frac{\hat{p}_{1}-\hat{p}_{2}}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_{1}}+\frac{1}{n_{2}}\right)}}
Where p ^ 1 p ^ 1 hat(p)_(1)\hat{\mathrm{p}}_{1} is the proportion of successes in sample 1 , p ^ 2 1 , p ^ 2 1, hat(p)_(2)1, \hat{\mathrm{p}}_{2} is the proportion of successes in sample 2 , and p ^ p ^ hat(p)\hat{\mathrm{p}} is the proportion of successes in the pooled sample. p ^ p ^ hat(p)\hat{\mathrm{p}} is computed by summing all of the successes and dividing by the total sample size,
其中 p ^ 1 p ^ 1 hat(p)_(1)\hat{\mathrm{p}}_{1} 是样本 1 , p ^ 2 1 , p ^ 2 1, hat(p)_(2)1, \hat{\mathrm{p}}_{2} 中的成功比例 是样本 2 中的成功比例 , p ^ p ^ hat(p)\hat{\mathrm{p}} 是合并样本中成功的比例。 p ^ p ^ hat(p)\hat{\mathrm{p}} 的计算方法是将所有成功次数相加并除以总样本量,
p ^ = x 1 + x 2 n 1 + n 2 p ^ = x 1 + x 2 n 1 + n 2 hat(p)=(x_(1)+x_(2))/(n_(1)+n_(2))\hat{\mathrm{p}}=\frac{\mathrm{x}_{1}+\mathrm{x}_{2}}{\mathrm{n}_{1}+\mathrm{n}_{2}}
(this is similar to the pooled estimate of the standard deviation, Sp , used in two independent samples tests with a continuous outcome; just as Sp is in between s 1 s 1 s_(1)\mathrm{s}_{1} and s 2 , p ^ s 2 , p ^ s_(2), hat(p)\mathrm{s}_{2}, \hat{\mathrm{p}} will be in between p ^ 1 p ^ 1 hat(p)_(1)\hat{\mathrm{p}}{ }_{1} and p ^ 2 p ^ 2 hat(p)_(2)\hat{\mathrm{p}}_{2} ).
(这类似于标准差 Sp 的合并估计值,用于两个具有连续结果的独立样本测试;就像 Sp 介于两者之间 s 1 s 1 s_(1)\mathrm{s}_{1} s 2 , p ^ s 2 , p ^ s_(2), hat(p)\mathrm{s}_{2}, \hat{\mathrm{p}} 并将介于 和 p ^ 2 p ^ 2 hat(p)_(2)\hat{\mathrm{p}}_{2} 之间 p ^ 1 p ^ 1 hat(p)_(1)\hat{\mathrm{p}}{ }_{1} 一样)。
The formula above is appropriate for large samples, defined as at least 5 successes ( n p 5 n p 5 np >= 5n p \geq 5 ) and at least 5 failures ( n ( 1 p 5 ) n ( 1 p 5 ) n(1-p >= 5)n(1-p \geq 5) ) in each of the two samples. If there are fewer than 5 successes or failures in either comparison group, then alternative procedures, called exact methods must be used to estimate the difference in population proportions.
上述公式适用于大样本,定义为两个样本中每个样本至少有 5 次成功 ( n p 5 n p 5 np >= 5n p \geq 5 ) 和至少 5 次失败 ( n ( 1 p 5 ) n ( 1 p 5 ) n(1-p >= 5)n(1-p \geq 5) )。如果任一比较组中的成功或失败少于 5 次,则必须使用称为精确方法的替代过程来估计总体比例的差异。

Example  

The following table summarizes data from n = 3 , 799 n = 3 , 799 n=3,799\mathrm{n}=3,799 participants who attended the fifth examination of the Offspring in the Framingham Heart Study. The outcome of interest is prevalent CVD and we want to test whether the prevalence of CVD is significantly higher in smokers as compared to non-smokers.
下表总结了参加 Framingham 心脏研究中 Offspring 第五次检查的 n = 3 , 799 n = 3 , 799 n=3,799\mathrm{n}=3,799 参与者的数据。感兴趣的结果是普遍的 CVD,我们想测试与非吸烟者相比,吸烟者的 CVD 患病率是否显着更高。
Free of CVD  无 CVD History of CVD  CVD 病史 Total  
Non-Smoker  非吸烟者 2,757 298 3,055
Current Smoker  当前吸烟者 663 81 744
Total   3,420 379 3,799
Free of CVD History of CVD Total Non-Smoker 2,757 298 3,055 Current Smoker 663 81 744 Total 3,420 379 3,799| | Free of CVD | History of CVD | Total | | :--- | :---: | :---: | :---: | | Non-Smoker | 2,757 | 298 | 3,055 | | Current Smoker | 663 | 81 | 744 | | Total | 3,420 | 379 | 3,799 |
The prevalence of CVD (or proportion of participants with prevalent CVD) among non-smokers is 298 / 3 , 055 = 0.0975 298 / 3 , 055 = 0.0975 298//3,055=0.0975298 / 3,055=0.0975 and the prevalence of CVD among current smokers is 81 / 744 = 0.1089 81 / 744 = 0.1089 81//744=0.108981 / 744=0.1089. Here smoking status defines the comparison groups and we will call the current smokers group 1 (exposed) and the non-smokers (unexposed) group 2. The test of hypothesis is conducted below using the five step approach.
非吸烟者中 CVD 的患病率(或患有 CVD 的参与者的比例)是 298 / 3 , 055 = 0.0975 298 / 3 , 055 = 0.0975 298//3,055=0.0975298 / 3,055=0.0975 ,当前吸烟者中 CVD 的患病率是 81 / 744 = 0.1089 81 / 744 = 0.1089 81//744=0.108981 / 744=0.1089 。这里吸烟状态定义了比较组,我们将当前吸烟者组 1(暴露)和非吸烟者(未暴露)称为组 2。假设检验在下面使用五步法进行。
  • Step 1. Set up hypotheses and determine level of significance
    步骤 1.设置假设并确定显著性级别
H 0 : p 1 = p 2 H 1 : p 1 p 2 α = 0.05 H 0 : p 1 = p 2 H 1 : p 1 p 2 α = 0.05 H_(0):p_(1)=p_(2)quadH_(1):p_(1)!=p_(2)quad alpha=0.05H_{0}: p_{1}=p_{2} \quad H_{1}: p_{1} \neq p_{2} \quad \alpha=0.05
  • Step 2. Select the appropriate test statistic.
    步骤 2。选择适当的检验统计量。
We must first check that the sample size is adequate. Specifically, we need to ensure that we have at least 5 successes and 5 failures in each comparison group. In this example, we have more than enough successes (cases of prevalent CVD) and failures (persons free of CVD) in each comparison group. The sample size is more than adequate so the following formula can be used:
我们必须首先检查样本量是否足够。具体来说,我们需要确保每个比较组中至少有 5 次成功和 5 次失败。在这个例子中,我们在每个对照组中有足够多的成功(普遍 CVD 病例)和失败(没有 CVD 的人)。样本量绰绰有余,因此可以使用以下公式:
Z = p ^ 1 p ^ 2 p ^ ( 1 p ^ ) ( 1 n 1 + 1 n 2 ) Z = p ^ 1 p ^ 2 p ^ ( 1 p ^ ) 1 n 1 + 1 n 2 Z=( hat(p)_(1)- hat(p)_(2))/(sqrt( hat(p)(1- hat(p))((1)/(n_(1))+(1)/(n_(2)))))Z=\frac{\hat{p}_{1}-\hat{p}_{2}}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_{1}}+\frac{1}{n_{2}}\right)}}
  • Step 3. Set up decision rule.
    步骤 3。设置决策规则。
Reject H 0 H 0 H_(0)H_{0} if Z 1.960 Z 1.960 Z <= -1.960Z \leq-1.960 or if Z 1.960 Z 1.960 Z >= 1.960Z \geq 1.960.
如果 Z 1.960 Z 1.960 Z <= -1.960Z \leq-1.960 或 if Z 1.960 Z 1.960 Z >= 1.960Z \geq 1.960 则拒绝 H 0 H 0 H_(0)H_{0}
  • Step 4. Compute the test statistic.
    步骤 4。计算检验统计量。
We now substitute the sample data into the formula for the test statistic identified in Step 2. We first compute the overall proportion of successes:
现在,我们将样本数据替换为步骤 2 中确定的检验统计量的公式。我们首先计算成功的总体比例:
p ^ = x 1 + x 2 n 1 + n 2 = 81 + 298 744 + 3055 = 379 3799 = 0.0998 p ^ = x 1 + x 2 n 1 + n 2 = 81 + 298 744 + 3055 = 379 3799 = 0.0998 hat(p)=(x_(1)+x_(2))/(n_(1)+n_(2))=(81+298)/(744+3055)=(379)/(3799)=0.0998\hat{\mathrm{p}}=\frac{\mathrm{x}_{1}+\mathrm{x}_{2}}{\mathrm{n}_{1}+\mathrm{n}_{2}}=\frac{81+298}{744+3055}=\frac{379}{3799}=0.0998
We now substitute to compute the test statistic.
现在,我们替代计算检验统计量。
Z = p ^ 1 p ^ 2 p ^ ( 1 p ^ ) ( 1 n 1 + 1 n 2 ) = 0.1089 0.0975 0.0988 ( 1 0.0988 ) ( 1 744 + 1 3055 ) = 0.0114 0.0123 = 0.927 Z = p ^ 1 p ^ 2 p ^ ( 1 p ^ ) 1 n 1 + 1 n 2 = 0.1089 0.0975 0.0988 ( 1 0.0988 ) 1 744 + 1 3055 = 0.0114 0.0123 = 0.927 Z=( hat(p)_(1)- hat(p)_(2))/(sqrt( hat(p)(1- hat(p))((1)/(n_(1))+(1)/(n_(2)))))=(0.1089-0.0975)/(sqrt(0.0988(1-0.0988)((1)/(744)+(1)/(3055))))=(0.0114)/(0.0123)=0.927\mathrm{Z}=\frac{\hat{\mathrm{p}}_{1}-\hat{\mathrm{p}}_{2}}{\sqrt{\hat{\mathrm{p}}(1-\hat{\mathrm{p}})\left(\frac{1}{\mathrm{n}_{1}}+\frac{1}{\mathrm{n}_{2}}\right)}}=\frac{0.1089-0.0975}{\sqrt{0.0988(1-0.0988)\left(\frac{1}{744}+\frac{1}{3055}\right)}}=\frac{0.0114}{0.0123}=0.927
  • Step 5. Conclusion.  步骤 5。结论。
We do not reject H 0 H 0 H_(0)\mathrm{H}_{0} because 1.960 < 0.927 < 1.960 1.960 < 0.927 < 1.960 -1.960 < 0.927 < 1.960-1.960<0.927<1.960. We do not have statistically significant evidence at α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 to show that there is a difference in prevalent CVD between smokers and non-smokers.
我们拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} 不是因为 1.960 < 0.927 < 1.960 1.960 < 0.927 < 1.960 -1.960 < 0.927 < 1.960-1.960<0.927<1.960 .我们没有统计学上显着的证据表明 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 吸烟者和非吸烟者之间的 CVD 患病率存在差异。
A 95% confidence interval for the difference in prevalent CVD (or risk difference) between smokers and non-smokers as 0.0114 ± 0.0114 ± 0.0114+-0.0114 \pm 0.0247 , or between -0.0133 and 0.0361 . Because the 95 % 95 % 95%95 \% confidence interval for the risk difference includes zero we again conclude that there is no statistically significant difference in prevalent CVD between smokers and non-smokers.
吸烟者和非吸烟者之间患病率 CVD 差异(或风险差异)的 95% 置信区间为 0.0114 ± 0.0114 ± 0.0114+-0.0114 \pm 0.0247 ,或介于 -0.0133 和 0.0361 之间。由于风险差异的 95 % 95 % 95%95 \% 置信区间包括零,因此我们再次得出结论,吸烟者和非吸烟者之间的 CVD 患病率没有统计学意义差异。
Smoking has been shown over and over to be a risk factor for cardiovascular disease. What might explain the fact that we did not observe a statistically significant difference using data from the Framingham Heart Study? HINT: Here we consider prevalent CVD, would the results have been different if we considered incident CVD?
吸烟一再被证明是心血管疾病的危险因素。什么可以解释我们使用 Framingham 心脏研究的数据没有观察到统计学上的显著差异这一事实?提示:这里我们考虑的是普遍的 CVD,如果我们考虑新发 CVD,结果会有所不同吗?

Example  

A randomized trial is designed to evaluate the effectiveness of a newly developed pain reliever designed to reduce pain in patients following joint replacement surgery. The trial compares the new pain reliever to the pain reliever currently in use (called the standard of care). A total of 100 patients undergoing joint replacement surgery agreed to participate in the trial. Patients were randomly assigned to receive either the new pain reliever or the standard pain reliever following surgery and were blind to the treatment assignment. Before receiving the assigned treatment, patients were asked to rate their pain on a scale of 0-10 with higher scores indicative of more pain. Each patient was then given the assigned treatment and after 30 minutes was again asked to rate their pain on the same scale. The primary outcome was a reduction in pain of 3 or more scale points (defined by clinicians as a clinically meaningful reduction). The following data were observed in the trial.
一项随机试验旨在评估新开发的止痛药的有效性,该药旨在减轻关节置换手术后患者的疼痛。该试验将新的止痛药与目前使用的止痛药(称为护理标准)进行了比较。共有 100 名接受关节置换手术的患者同意参加该试验。患者在手术后被随机分配接受新的止痛药或标准止痛药,并且对治疗分配不知情。在接受指定的治疗之前,患者被要求以 0-10 的等级对他们的疼痛进行评分,分数越高表示疼痛越严重。然后,每位患者都接受了指定的治疗,并在 30 分钟后再次被要求以相同的量表对他们的疼痛进行评分。主要结局是疼痛减轻 3 个或更多量表点(临床医生定义为具有临床意义的减轻)。在试验中观察到以下数据。
Treatment Group  治疗组 n n nn

减少 3+ 分的数字
Number with
Reduction
of 3+ Points
Number with Reduction of 3+ Points| Number with | | :---: | | Reduction | | of 3+ Points |

减少 3+ 点的比例
Proportion with
Reduction
of 3+ Points
Proportion with Reduction of 3+ Points| Proportion with | | :---: | | Reduction | | of 3+ Points |
New Pain Reliever  新型止痛药 50 23 0.46
Standard Pain Reliever  标准止痛药 50 11 0.22
Treatment Group n "Number with Reduction of 3+ Points" "Proportion with Reduction of 3+ Points" New Pain Reliever 50 23 0.46 Standard Pain Reliever 50 11 0.22| Treatment Group | $n$ | Number with <br> Reduction <br> of 3+ Points | Proportion with <br> Reduction <br> of 3+ Points | | :--- | :---: | :---: | :---: | | New Pain Reliever | 50 | 23 | 0.46 | | Standard Pain Reliever | 50 | 11 | 0.22 |
We now test whether there is a statistically significant difference in the proportions of patients reporting a meaningful reduction (i.e., a reduction of 3 or more scale points) using the five step approach.
我们现在使用五步法测试报告有意义减少(即减少 3 个或更多量表点)的患者比例是否存在统计学显着差异。
  • Step 1. Set up hypotheses and determine level of significance
    步骤 1.设置假设并确定显著性级别

    H 0 : p 1 = p 2 H 1 : p 1 p 2 α = 0.05 H 0 : p 1 = p 2 H 1 : p 1 p 2 α = 0.05 H_(0):p_(1)=p_(2)quadH_(1):p_(1)!=p_(2)quad alpha=0.05\mathrm{H}_{0}: \mathrm{p}_{1}=\mathrm{p}_{2} \quad \mathrm{H}_{1}: \mathrm{p}_{1} \neq \mathrm{p}_{2} \quad \alpha=0.05
    Here the new or experimental pain reliever is group 1 and the standard pain reliever is group 2.
    在这里,新的或实验性的止痛药是第 1 组,标准止痛药是第 2 组。
  • Step 2. Select the appropriate test statistic.
    步骤 2。选择适当的检验统计量。
We must first check that the sample size is adequate. Specifically, we need to ensure that we have at least 5 successes and 5 failures in each comparison group, i.e.,
我们必须首先检查样本量是否足够。具体来说,我们需要确保每个比较组中至少有 5 次成功和 5 次失败,即
min ( n 1 p ^ 1 , n 1 ( 1 p ^ 1 ) , n 2 p ^ 2 , n 2 ( 1 p ^ 2 ) ) 5 min n 1 p ^ 1 , n 1 1 p ^ 1 , n 2 p ^ 2 , n 2 1 p ^ 2 5 min(n_(1) hat(p)_(1),n_(1)(1- hat(p)_(1)),n_(2) hat(p)_(2),n_(2)(1- hat(p)_(2))) >= 5\min \left(\mathrm{n}_{1} \hat{\mathrm{p}}_{1}, \mathrm{n}_{1}\left(1-\hat{\mathrm{p}}_{1}\right), \mathrm{n}_{2} \hat{\mathrm{p}}_{2}, \mathrm{n}_{2}\left(1-\hat{\mathrm{p}}_{2}\right)\right) \geq 5
In this example, we have min ( 50 ( 0.46 ) , 50 ( 1 0.46 ) , 50 ( 0.22 ) , 50 ( 1 0.22 ) ) = min ( 23 , 27 , 11 , 39 ) = 11 min ( 50 ( 0.46 ) , 50 ( 1 0.46 ) , 50 ( 0.22 ) , 50 ( 1 0.22 ) ) = min ( 23 , 27 , 11 , 39 ) = 11 min(50(0.46),50(1-0.46),50(0.22),50(1-0.22))=min(23,27,11,39)=11\min (50(0.46), 50(1-0.46), 50(0.22), 50(1-0.22))=\min (23,27,11,39)=11. The sample size is adequate so the following formula can be used
在此示例中,我们有 min ( 50 ( 0.46 ) , 50 ( 1 0.46 ) , 50 ( 0.22 ) , 50 ( 1 0.22 ) ) = min ( 23 , 27 , 11 , 39 ) = 11 min ( 50 ( 0.46 ) , 50 ( 1 0.46 ) , 50 ( 0.22 ) , 50 ( 1 0.22 ) ) = min ( 23 , 27 , 11 , 39 ) = 11 min(50(0.46),50(1-0.46),50(0.22),50(1-0.22))=min(23,27,11,39)=11\min (50(0.46), 50(1-0.46), 50(0.22), 50(1-0.22))=\min (23,27,11,39)=11 。样本量足够,因此可以使用以下公式
Z = p ^ 1 p ^ 2 p ^ ( 1 p ^ ) ( 1 n 1 + 1 n 2 ) Z = p ^ 1 p ^ 2 p ^ ( 1 p ^ ) 1 n 1 + 1 n 2 Z=( hat(p)_(1)- hat(p)_(2))/(sqrt( hat(p)(1- hat(p))((1)/(n_(1))+(1)/(n_(2)))))Z=\frac{\hat{\mathrm{p}}_{1}-\hat{\mathrm{p}}_{2}}{\sqrt{\hat{\mathrm{p}}(1-\hat{\mathrm{p}})\left(\frac{1}{\mathrm{n}_{1}}+\frac{1}{n_{2}}\right)}}
  • Step 3. Set up decision rule.
    步骤 3。设置决策规则。
Reject H 0 H 0 H_(0)H_{0} if Z 1.960 Z 1.960 Z <= -1.960Z \leq-1.960 or if Z 1.960 Z 1.960 Z >= 1.960Z \geq 1.960.
如果 Z 1.960 Z 1.960 Z <= -1.960Z \leq-1.960 或 if Z 1.960 Z 1.960 Z >= 1.960Z \geq 1.960 则拒绝 H 0 H 0 H_(0)H_{0}
  • Step 4. Compute the test statistic.
    步骤 4。计算检验统计量。
We now substitute the sample data into the formula for the test statistic identified in Step 2. We first compute the overall proportion of successes:
现在,我们将样本数据替换为步骤 2 中确定的检验统计量的公式。我们首先计算成功的总体比例:
p ^ = x 1 + x 2 n 1 + n 2 = 23 + 11 50 + 50 = 34 100 = 0.34 p ^ = x 1 + x 2 n 1 + n 2 = 23 + 11 50 + 50 = 34 100 = 0.34 hat(p)=(x_(1)+x_(2))/(n_(1)+n_(2))=(23+11)/(50+50)=(34)/(100)=0.34\hat{\mathrm{p}}=\frac{\mathrm{x}_{1}+\mathrm{x}_{2}}{\mathrm{n}_{1}+\mathrm{n}_{2}}=\frac{23+11}{50+50}=\frac{34}{100}=0.34
We now substitute to compute the test statistic.
现在,我们替代计算检验统计量。
Z = p ^ 1 p ^ 2 p ^ ( 1 p ^ ) ( 1 n 1 + 1 n 2 ) = 0.46 0.22 0.34 ( 1 0.34 ) ( 1 50 + 1 50 ) = 0.24 0.095 = 2.526 Z = p ^ 1 p ^ 2 p ^ ( 1 p ^ ) 1 n 1 + 1 n 2 = 0.46 0.22 0.34 ( 1 0.34 ) 1 50 + 1 50 = 0.24 0.095 = 2.526 Z=( hat(p)_(1)- hat(p)_(2))/(sqrt( hat(p)(1- hat(p))((1)/(n_(1))+(1)/(n_(2)))))=(0.46-0.22)/(sqrt(0.34(1-0.34)((1)/(50)+(1)/(50))))=(0.24)/(0.095)=2.526\mathrm{Z}=\frac{\hat{\mathrm{p}}_{1}-\hat{\mathrm{p}}_{2}}{\sqrt{\hat{\mathrm{p}}(1-\hat{\mathrm{p}})\left(\frac{1}{\mathrm{n}_{1}}+\frac{1}{\mathrm{n}_{2}}\right)}}=\frac{0.46-0.22}{\sqrt{0.34(1-0.34)\left(\frac{1}{50}+\frac{1}{50}\right)}}=\frac{0.24}{0.095}=2.526
  • Step 5. Conclusion.  步骤 5。结论。
We reject H 0 H 0 H_(0)\mathrm{H}_{0} because 2.526 1960 2.526 1960 2.526 >= 19602.526 \geq 1960. We have statistically significant evidence at a = 0.05 a = 0.05 a=0.05\mathrm{a}=0.05 to show that there is a difference in the proportions of patients on the new pain reliever reporting a meaningful reduction (i.e., a reduction of 3 or more scale points) as compared to patients on the standard pain reliever.
我们拒绝 H 0 H 0 H_(0)\mathrm{H}_{0} 是因为 2.526 1960 2.526 1960 2.526 >= 19602.526 \geq 1960 .我们有具有统计学意义的 a = 0.05 a = 0.05 a=0.05\mathrm{a}=0.05 证据表明,与使用标准止痛药的患者相比,使用新止痛药的患者比例存在差异(即减少 3 个或更多量表点)。
A 95% confidence interval for the difference in proportions of patients on the new pain reliever reporting a meaningful reduction (i.e., a reduction of 3 or more scale points) as compared to patients on the standard pain reliever is 0.24 ± 0.18 0.24 ± 0.18 0.24+-0.180.24 \pm 0.18 or between 0.06 and 0.42 . Because the 95 % 95 % 95%95 \% confidence interval does not include zero we concluded that there was a statistically significant difference in proportions which is consistent with the test of hypothesis result.
与使用标准止痛药的患者相比,使用新止痛药的患者报告有意义减少(即减少 3 个或更多量表点)的患者比例差异的 95% 置信区间为 0.24 ± 0.18 0.24 ± 0.18 0.24+-0.180.24 \pm 0.18 0.06 至 0.42 或介于 0.06 和 0.42 之间。因为 95 % 95 % 95%95 \% 置信区间不包括零,所以我们得出结论,比例存在统计学上的显著差异,这与假设检验结果一致。
Again, the procedures discussed here apply to applications where there are two independent comparison groups and a dichotomous outcome. There are other applications in which it is of interest to compare a dichotomous outcome in matched or paired samples. For example, in a clinical trial we might wish to test the effectiveness of a new antibiotic eye drop for the treatment of bacterial conjunctivitis. Participants use the new antibiotic eye drop in one eye and a comparator (placebo or active control treatment) in the other. The success of the treatment (yes/no) is recorded for each participant for each eye. Because the two assessments (success or failure) are paired, we cannot use the procedures discussed here. The appropriate test is called McNemar’s test (sometimes called McNemar’s test for dependent proportions).
同样,此处讨论的程序适用于存在两个独立比较组和二分结果的应用程序。在其他应用中,比较匹配或配对样本中的二分结果也很有趣。例如,在临床试验中,我们可能希望测试一种新的抗生素滴眼液治疗细菌性结膜炎的有效性。参与者在一只眼睛中使用新的抗生素滴眼液,在另一只眼睛中使用对照剂(安慰剂或积极对照治疗)。记录每个参与者每只眼睛的治疗成功(是/否)。由于两个评估(成功或失败)是配对的,因此我们不能使用此处讨论的过程。适当的检验称为 McNemar 检验(有时称为 McNemar 从属比例检验)。

Summary  总结

Here we presented hypothesis testing techniques for means and proportions in one and two sample situations. Tests of hypothesis involve several steps, including specifying the null and alternative or research hypothesis, selecting and computing an appropriate test statistic, setting up a decision rule and drawing a conclusion. There are many details to consider in hypothesis testing. The first is to determine the appropriate test. We discussed Z Z ZZ and t t tt tests here for different applications. The appropriate test depends on the distribution of the outcome variable (continuous or dichotomous), the number of comparison groups (one, two) and whether the comparison groups are independent or dependent. The following table summarizes the different tests of hypothesis discussed here.
在这里,我们介绍了在一和两个样本情况下的均值和比例的假设检验技术。假设检验涉及多个步骤,包括指定 null 和备择假设或研究假设、选择和计算适当的检验统计量、设置决策规则和得出结论。在假设检验中需要考虑许多细节。首先是确定适当的测试。我们在这里针对不同的应用程序进行了讨论 Z Z ZZ 和测试 t t tt 。适当的检验取决于结果变量(连续或二分)的分布、比较组的数量(一、二)以及比较组是独立的还是相关的。下表总结了此处讨论的不同假设检验。
Outcome Variable, Number of Groups: Null Hypothesis
结果变量,组数:零假设
Test Statistic  检验统计量

连续结局,1 个样本: H 0 : μ = μ 0 H 0 : μ = μ 0 H_(0):mu=mu_(0)\mathrm{H}_{0}: \mu=\mu_{0}
Continuous Outcome, One Sample:
H 0 : μ = μ 0 H 0 : μ = μ 0 H_(0):mu=mu_(0)\mathrm{H}_{0}: \mu=\mu_{0}
Continuous Outcome, One Sample: H_(0):mu=mu_(0)| Continuous Outcome, One Sample: | | :--- | | $\mathrm{H}_{0}: \mu=\mu_{0}$ |
Z = X ¯ μ 0 s / n Z = X ¯ μ 0 s / n Z=(( bar(X))-mu_(0))/(s//sqrtn)Z=\frac{\bar{X}-\mu_{0}}{s / \sqrt{n}}
Outcome Variable, Number of Groups: Null Hypothesis Test Statistic "Continuous Outcome, One Sample: H_(0):mu=mu_(0)" Z=(( bar(X))-mu_(0))/(s//sqrtn) | Outcome Variable, Number of Groups: Null Hypothesis | Test Statistic | | :--- | :--- | | Continuous Outcome, One Sample: <br> $\mathrm{H}_{0}: \mu=\mu_{0}$ | $Z=\frac{\bar{X}-\mu_{0}}{s / \sqrt{n}}$ | | | |
Continuous Outcome, Two Independent Samples:
连续结果,两个独立样本:

H 0 : μ 1 = μ 2 H 0 : μ 1 = μ 2 H_(0):mu_(1)=mu_(2)\mathrm{H}_{0}: \mu_{1}=\mu_{2}
Dichotomous Outcome, Two Independent Samples:
二分类结局,两个独立样本:

H 0 : p 1 = p 2 , RD = 0 , RR = 1 , OR = 1 H 0 : p 1 = p 2 , RD = 0 , RR = 1 , OR = 1 H_(0):p_(1)=p_(2),RD=0,RR=1,OR=1\mathrm{H}_{0}: \mathrm{p}_{1}=\mathrm{p}_{2}, \mathrm{RD}=0, \mathrm{RR}=1, \mathrm{OR}=1
Dichotomous Outcome, One Sample:
二分类结局,1 个样本:

H 0 : p = p 0 H 0 : p = p 0 H_(0):p=p_(0)\mathrm{H}_{0}: \mathrm{p}=\mathrm{p}_{0}
Continuous Outcome, Two Matched Samples:
连续结果,两个匹配的样本:

H 0 : μ d = 0 H 0 : μ d = 0 H_(0):mu_(d)=0\mathrm{H}_{0}: \mu_{\mathrm{d}}=0
Z = X ¯ d μ d s d / n Z = p ^ p 0 p 0 ( 1 p 0 ) n Z = p ^ 1 p ^ 2 p ^ ( 1 p ^ ) ( 1 n 1 + 1 n 2 ) Z = X ¯ d μ d s d / n Z = p ^ p 0 p 0 1 p 0 n Z = p ^ 1 p ^ 2 p ^ ( 1 p ^ ) 1 n 1 + 1 n 2 {:[Z=( bar(X)_(d)-mu_(d))/(s_(d)//sqrtn)],[Z=(( hat(p))-p_(0))/(sqrt((p_(0)(1-p_(0)))/(n)))],[Z=( hat(p)_(1)- hat(p)_(2))/(sqrt( hat(p)(1- hat(p))((1)/(n_(1))+(1)/(n_(2)))))]:}\begin{gathered} Z=\frac{\bar{X}_{d}-\mu_{d}}{s_{d} / \sqrt{n}} \\ Z=\frac{\hat{p}-p_{0}}{\sqrt{\frac{p_{0}\left(1-p_{0}\right)}{n}}} \\ Z=\frac{\hat{p}_{1}-\hat{p}_{2}}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_{1}}+\frac{1}{n_{2}}\right)}} \end{gathered}
Z = X 1 X 2 Sp 1 n 1 + 1 n 2 Z = X ¯ 1 X ¯ 2 Sp 1 n 1 + 1 n 2 Z=( bar(X)_(1)- bar(X)_(2))/(Spsqrt((1)/(n_(1))+(1)/(n_(2))))\mathrm{Z}=\frac{\overline{\mathrm{X}}_{1}-\overline{\mathrm{X}}_{2}}{\mathrm{Sp} \sqrt{\frac{1}{\mathrm{n}_{1}}+\frac{1}{\mathrm{n}_{2}}}}
Once the type of test is determined, the details of the test must be specified. Specifically, the null and alternative hypotheses must be clearly stated. The null hypothesis always reflects the “no change” or “no difference” situation. The alternative or research hypothesis reflects the investigator’s belief. The investigator might hypothesize that a parameter (e.g., a mean, proportion, difference in means or proportions) will increase, will decrease or will be different under specific conditions (sometimes the conditions are different experimental conditions and other times the conditions are simply different groups of participants). Once the hypotheses are specified, data are collected and summarized. The appropriate test is then conducted according to the five step approach. If the test leads to rejection of the null hypothesis, an approximate p -value is computed to summarize the significance of the findings. When tests of hypothesis are conducted using statistical computing packages, exact p -values are computed. Because the statistical tables in this textbook are limited, we can only approximate p-values. If the test fails to reject the null hypothesis, then a weaker concluding statement is made for the following reason.
确定测试类型后,必须指定测试的详细信息。具体来说,必须清楚地说明零假设和替代假设。原假设始终反映“无变化”或“无差异”的情况。替代方案或研究假设反映了研究者的信念。研究者可能会假设一个参数(例如,平均值、比例、平均值或比例的差异)在特定条件下会增加、会减少或会有所不同(有时条件是不同的实验条件,有时条件只是不同的参与者组)。指定假设后,将收集和汇总数据。然后根据五步法进行适当的测试。如果检验导致否定原假设,则计算近似 p 值以总结结果的显著性。当使用统计计算包进行假设检验时,将计算精确的 p 值。由于本教科书中的统计表有限,我们只能近似 p 值。如果检验未能否定原假设,则出于以下原因,将做出较弱的结论陈述。
In hypothesis testing, there are two types of errors that can be committed. A Type I error occurs when a test incorrectly rejects the null hypothesis. This is referred to as a false positive result, and the probability that this occurs is equal to the level of significance, α α alpha\alpha. The investigator chooses the level of significance in Step 1, and purposely chooses a small value such as α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 to control the probability of committing a Type I error. A Type II error occurs when a test fails to reject the null hypothesis when in fact it is false. The probability that this occurs is equal to β β beta\beta. Unfortunately, the investigator cannot specify β β beta\beta at the outset because it depends on several factors including the sample size (smaller samples have higher b), the level of significance ( β β beta\beta decreases as a increases), and the difference in the parameter under the null and alternative hypothesis.
在假设检验中,可以提交两种类型的错误。当测试错误地拒绝 null 假设时,会发生类型 I 错误。这称为假阳性结果,发生这种情况的概率等于显著性水平 α α alpha\alpha 。调查员在步骤 1 中选择显著性水平,并特意选择一个较小的值 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 ,例如控制犯 I 类错误的概率。当测试未能拒绝原假设,而实际上原假设为假时,会发生类型 II 错误。发生这种情况的概率等于 β β beta\beta 。不幸的是,研究者不能一开始就指定 β β beta\beta ,因为它取决于几个因素,包括样本量(较小的样本具有较高的 b)、显着性水平( 随着 a 的增加而 β β beta\beta 降低)以及在零假设和替代假设下参数的差异。
We noted in several examples in this chapter, the relationship between confidence intervals and tests of hypothesis. The approaches are different, yet related. It is possible to draw a conclusion about statistical significance by examining a confidence interval. For example, if a 95% confidence interval does not contain the null value (e.g., zero when analyzing a mean difference or risk difference, one when analyzing relative risks or odds ratios), then one can conclude that a two-sided test of hypothesis would reject the null at α = 0.05 α = 0.05 alpha=0.05\alpha=0.05. It is important to note that the correspondence between a confidence interval and test of hypothesis relates to a two-sided test and that the confidence level corresponds to a specific level of significance (e.g., 95 % 95 % 95%95 \% to α = 0.05 , 90 % α = 0.05 , 90 % alpha=0.05,90%\alpha=0.05,90 \% to α = 0.10 α = 0.10 alpha=0.10\alpha=0.10 and so on). The exact significance of the test, the p p pp-value, can only be determined using the hypothesis testing approach and the p -value provides an assessment of the strength of the evidence and not an estimate of the effect.
我们在本章的几个例子中指出了置信区间和假设检验之间的关系。这些方法不同,但相关。通过检查置信区间,可以得出有关统计显著性的结论。例如,如果 95% 置信区间不包含 null 值(例如,在分析均值差值或风险差值时为 0,在分析相对风险或优势比时为 1),则可以得出结论,假设的双侧检验将拒绝 at 的 α = 0.05 α = 0.05 alpha=0.05\alpha=0.05 null。需要注意的是,置信区间和假设检验之间的对应关系与双侧检验有关,并且置信水平对应于特定的显著性水平(例如, 95 % 95 % 95%95 \% to α = 0.05 , 90 % α = 0.05 , 90 % alpha=0.05,90%\alpha=0.05,90 \% to α = 0.10 α = 0.10 alpha=0.10\alpha=0.10 等)。检验的确切显著性, p p pp 即 -值,只能使用假设检验方法来确定,而 p -值提供对证据强度的评估,而不是对效果的估计。