6 Hypothesis testing

6.1 Sampling distribution

We have shown in Chapter 3 that \(\mathbb{E}\left[\hat{\tau}^{\text{dm}}\mid \mathbf{n}, \mathbf{Y(w)}\right]=\tau\) and in Chapter 5 that \(\widehat{SE}= \sqrt{\frac{s_t^2}{n_t} + \frac{s_c^2}{n_c}}\). Furthermore, Imbens and Rubin (2015) and Ding (2023) explain why a normal approximation to the sampling distribution of \(\hat{\tau}^{\text{dm}}\) is justified. Hence, the sampling distribution of \(\hat{\tau}^{\text{dm}}\) is given by: \[ \hat{\tau}^{\text{dm}} \mid \mathbf{n}, \mathbf{Y(w)} \sim N\left(\tau, \frac{s_t^2}{n_t} + \frac{s_c^2}{n_c}\right). \tag{6.1}\]

6.2 Basic approach

To test whether the observed treatment effect is significantly different from zero, we test: \[ \begin{align} &H_0: \tau = 0 \\[5 pt] &H_A: \tau \neq 0. \end{align} \] and calculate the test-statistic \[ t = \frac{\hat{\tau}^{\text{dm}}} {\widehat{SE}} \] which, for large samples, follows approximately a standard normally distribution.¹

For a given significance level \(\alpha\), the critical value \(z_\alpha\) is the point on the standard normal distribution that has \(\alpha\) of the probability mass to its right. In our two-sided test we thus reject reject \(H_0\) if \(|t| \geq z_{\alpha/2}\).

6.3 Types of errors

With the above procedure, there are two types of mistakes we can make:

We can reject \(H_0\) when it is true. This is called a Type I error and constitutes a false positive. The probability of this error is denoted by \(\alpha\), the significance level. It represents the false positive rate \(\alpha = P(\text{significant result} | H_0\text{ true})\).
We can fail to reject \(H_0\) when it is false. This is called a Type II error and constitutes a false negative. The probability of this type of error is denoted by \(\beta\), and it represents the false negative rate \(\beta = P(\text{insignificant result} | H_A\text{ true})\).

The complements of these two errors are:

The confidence level, the probability that we do not reject \(H_0\) if it is true (the probability of a true negative) is given by \(1 - \alpha = P(\text{insignificant result} | H_0\text{ true})\).
Power, the probability that we do reject \(H_0\) if it is false (the probability of a true positive) is given by \(1 - \beta = P(\text{significant result} | H_A\text{ true})\).

One of Neyman’s key insights was that while we cannot control error rates for a single experiment, we can control them over the long-run over many experiments. In practice, this means that if we follow the approach consistently and run many experiments we know that in \(\alpha\) percent of cases where a feature has no true effect we will erroneously find a significant result.

An interesting question then arises: out of all significant results we find, how many are false positives? The answer is given by the false discovery rate, which is \(P(H_0\text{ true} | \text{significant result})\). Note that the fewer true effects are generated, the more significant results are due to false positives.

See Imbens and Rubin (2015) Section 6.6.2 and Ding (2023) Section 4.2 for details.↩︎