Skip to content

Commit 4903fb5

Browse files
Addition of beginner level notebook on Bayesian Hypothesis Testing (#751)
* initial commit of hypothesis testing notebook * add references + info about HDI+ROPE criteria * sales -> profit * reorder some cells * add link to wikipedia page for Bayes Factors * Add summary section + updates to ROPE section * split into it's own HDI+ROPE section * hide-output cell tag * update value in text to correspond to result * add note box about controversy * de-captitalization of heading * fix citation * misc updates * update to use arviz 1.0 * add line --------- Co-authored-by: aloctavodia <aloctavodia@gmail.com>
1 parent e1e064d commit 4903fb5

File tree

4 files changed

+1524
-0
lines changed

4 files changed

+1524
-0
lines changed
116 KB
Loading

examples/howto/hypothesis_testing.ipynb

Lines changed: 1146 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
---
2+
jupytext:
3+
text_representation:
4+
extension: .md
5+
format_name: myst
6+
format_version: 0.13
7+
kernelspec:
8+
display_name: kulprit
9+
language: python
10+
name: python3
11+
---
12+
13+
(hypothesis-testing)=
14+
# Bayesian Hypothesis Testing - an introduction
15+
16+
:::{post} December 2024
17+
:tags: hypothesis testing, bayesian decision theory
18+
:category: beginner
19+
:author: Benjamin T. Vincent
20+
:::
21+
22+
```{code-cell} ipython3
23+
import arviz.preview as az
24+
import matplotlib.pyplot as plt
25+
import numpy as np
26+
import pymc as pm
27+
```
28+
29+
```{code-cell} ipython3
30+
%config InlineBackend.figure_format = 'retina'
31+
az.style.use("arviz-variat")
32+
# We set the credible interval kind and probability globally
33+
az.rcParams["stats.ci_kind"] = "hdi"
34+
az.rcParams["stats.ci_prob"] = 0.95
35+
SEED = 42
36+
rng = np.random.default_rng(SEED)
37+
```
38+
39+
## Introduction
40+
41+
Bayesian hypothesis testing provides a flexible and intuitive way to assess whether parameters differ from specified values. Unlike classical methods focusing on p-values, Bayesian methods let us directly compute probabilities of hypotheses and quantify the strength of evidence in various ways.
42+
43+
In this tutorial, we'll use PyMC to:
44+
45+
* Fit a simple model to synthetic data and obtain samples from the prior and posterior distributions.
46+
* Demonstrate the following Bayesian hypothesis testing methods:
47+
* Posterior probability statements
48+
* Highest Density Intervals (HDIs)
49+
* Regions of Practical Equivalence (ROPE)
50+
* Bayes Factors
51+
52+
We'll work through a scenario where we want to know if the mean of some metric (e.g., monthly profit) is different from zero.
53+
54+
+++
55+
56+
:::{note}
57+
Many Bayesian practitioners argue that collapsing a rich posterior distribution into a single binary decision (e.g., "reject" or "fail to reject") undermines the essence of Bayesian inference. The Bayesian perspective values the entire posterior as a nuanced representation of uncertainty about parameters. Reducing it to yes/no decisions discards that nuance and may mislead by oversimplifying the uncertainty involved.
58+
59+
However, in real-world scenarios—such as policy-making, resource allocation, or medical decision-making—practitioners often must choose a single course of action. In such cases, translating the posterior into a decision rule or threshold is necessary. The key is to do so transparently and thoughtfully, ideally incorporating utilities, costs, and the full breadth of uncertainty in the decision process.
60+
:::
61+
62+
+++
63+
64+
### Parameter estimation vs model comparison
65+
66+
The Bayesian evaluation of null values can proceed in two distinct ways (see {cite:t}`kruschke2011bayesian`):
67+
68+
#### Parameter estimation
69+
The parameter estimation approach considers a model where the parameter is allowed to vary (which includes the null value). We then compute the posterior distribution of this value and come up with some kind of decision rule which determines if we accept or reject the null value.
70+
71+
#### Model comparison
72+
Two competing models are considered. The first model assumes that the null value is true, or fixed, and the second model allows a range of values. The models are compared to see which is more supported by the data. An example would be in assessing if a coin is fair (null hypothesis) or biased (alternative hypothesis) - we would set up a model where the coin has a fixed probability of heads (0.5) and another model where the probability of heads is a free parameter. Readers are referred to the PyMC examples focussing on {ref}`pymc:model_comparison` for more details.
73+
74+
+++
75+
76+
## Setting up the example
77+
78+
Rather than focus on a complex example, we'll pick a trivial one were we are simply estimating the mean of a single variable. This will allow us to focus on the hypothesis testing. The important thing is what we do with our MCMC samples, not the particulars of the model.
79+
80+
```{code-cell} ipython3
81+
true_mu = 2.0
82+
true_sigma = 3.0
83+
n = 12
84+
85+
x = rng.normal(loc=true_mu, scale=true_sigma, size=n)
86+
87+
fig, ax = plt.subplots(figsize=(14, 1))
88+
ax.plot(x, np.zeros_like(x), "ro")
89+
ax.set(yticklabels=[], yticks=[], xlabel="x", title="Observations");
90+
```
91+
92+
## Sampling from the prior and posterior
93+
94+
Now we'll build our simple model. Again, the focus here is not on the model of the data as such, but simply obtaining a meaningful prior and posterior distribution. We'll ask for more MCMC samples than we normally do, so that we can get a more accurate approximation of the prior and posterior distributions.
95+
96+
```{code-cell} ipython3
97+
:tags: [hide-output]
98+
99+
with pm.Model() as model:
100+
# priors
101+
mu = pm.Normal("mu", mu=0, sigma=2)
102+
sigma = pm.Gamma("sigma", alpha=2, beta=1)
103+
# likelihood
104+
pm.Normal("y", mu=mu, sigma=sigma, observed=x)
105+
# sample
106+
idata = pm.sample_prior_predictive(samples=10_000)
107+
idata.extend(pm.sample(draws=10_000))
108+
```
109+
110+
We didn't get any warnings from the sampling processes.
111+
112+
It is good practice to visualise the posterior distribution, so below we'll look at the joint posterior over $\mu$ and $\sigma$ parameters. Everything looks fine here.
113+
114+
```{code-cell} ipython3
115+
:tags: [hide-input]
116+
117+
pc = az.plot_pair(idata, var_names=["mu", "sigma"])
118+
az.add_lines(
119+
pc, {"mu": idata.posterior["mu"].mean().item(), "sigma": idata.posterior["sigma"].mean().item()}
120+
);
121+
```
122+
123+
Finally, seeing as $\mu$ is the core parameter of interest, we'll visualise both the prior and posterior distributions for $\mu$.
124+
125+
```{code-cell} ipython3
126+
az.plot_prior_posterior(idata, var_names="mu");
127+
```
128+
129+
## Bayesian hypothesis testing methods
130+
131+
### Posterior probability statements
132+
133+
The simplest form of hypothesis testing is to ask whether the mean is greater than zero. This is equivalent to asking whether the probability that $\mu$ is greater than zero is greater than 0.5. We can compute this directly from the samples. So computing compute $P(\mu>0 | x)$ is as simple as counting the number of samples where $\mu>0$ and dividing by the total number of samples - equivalent to computing the mean of the samples where $\mu>0$.
134+
135+
```{code-cell} ipython3
136+
mu_samples = idata.posterior["mu"].values
137+
p_mu_greater_0 = np.mean(mu_samples > 0)
138+
p_mu_greater_0
139+
```
140+
141+
We can also include such information in a visual plot using `az.plot_posterior`.
142+
143+
```{code-cell} ipython3
144+
pc = az.plot_dist(idata, var_names=["mu"], kind="ecdf", visuals={"credible_interval": False})
145+
az.add_lines(pc, 0);
146+
```
147+
148+
It could also be that we have some kind of minimum meaningful threshold that we care about. For example, we might care about whether the mean is greater than 0.1. We can compute this in the same way.
149+
150+
```{code-cell} ipython3
151+
p_mu_greater = np.mean(mu_samples > 0.1)
152+
p_mu_greater
153+
```
154+
155+
### Highest Density Intervals (HDIs)
156+
157+
We can define an infinite number of intervals that contain a given probability mass (e.g., 95%). However, some intervals are more informative than others. The Highest Density Interval (HDI) is the narrowest interval that contains a given probability mass. This means that every point inside the HDI has a higher probability density than any point outside the HDI. If zero is outside the HDI, it’s unlikely the parameter is near zero.
158+
159+
```{code-cell} ipython3
160+
hdi_mu = az.hdi(idata.posterior["mu"]).values
161+
hdi_mu.round(2)
162+
```
163+
164+
In this case, zero is within the HDI, so based on this measure, we can't express much confidence that the mean is different from zero.
165+
166+
+++
167+
168+
Again, we can use `az.plot_posterior` to visualize the HDIs.
169+
170+
```{code-cell} ipython3
171+
az.plot_dist(idata, var_names="mu");
172+
```
173+
174+
### Region of Practical Equivalence (ROPE)
175+
176+
If the probability that the parameter is within a certain range is high, we can say that the parameter is practically equivalent to that value. This is a useful way to express that we don't care about small differences.
177+
178+
For example, if we state that values within $-0.1$ to $0.1$ (this region need not be symmetric) are practically equivalent to zero, we can compute the probability that $\mu$ is within this range. If this probability is high enough then we can say that the mean is practically equivalent to zero.
179+
180+
```{code-cell} ipython3
181+
az.ci_in_rope(idata, var_names="mu", rope=[-0.1, 0.1])
182+
```
183+
184+
So there is only a 2.2% probability that the mean is practically equivalent to zero. This is sufficiently low that we can reject the hypothesis that the mean is practically equivalent to zero.
185+
186+
+++
187+
188+
Third time in a row, `arviz` has our back and can plot the ROPE and HDIs.
189+
190+
```{code-cell} ipython3
191+
pc = az.plot_dist(idata, var_names=["mu"])
192+
az.add_bands(pc, [(-0.1, 0.1)]);
193+
```
194+
195+
The intuition we can get from this is that if the ROPE is narrow, we would require quite a high level of precision to accept the null hypthesis. The posterior distribution would have to be very tightly centered around the null value to have a large probability of being within the ROPE.
196+
197+
+++
198+
199+
### HDI+ROPE decision criteria
200+
201+
+++
202+
203+
{cite:t}`kruschke2018rejecting` outlines the HDI+ROPE decision rule, which is summarised in the figure taken from that paper. Namely:
204+
205+
* **Accept the null value**: If the HDI falls entirely within the ROPE. The HDI does not need to include the null value.
206+
* **Reject the null value**: If the HDI falls entirely outside the ROPE.
207+
* **Remain undecided**: If the HDI is partially or fully outside the ROPE.
208+
209+
In our case, looking back at our posterior + ROPE plot above, we would remain undecided because the HDI does not fall entirely outside nor within the ROPE.
210+
211+
![](hdi_plus_rope_decision_rule.png)
212+
213+
+++
214+
215+
### Bayes Factors
216+
217+
[Bayes Factors](https://en.wikipedia.org/wiki/Bayes_factor) provide a Bayesian alternative to classical hypothesis tests, allowing you to weigh evidence for one hypothesis relative to another. In the simplest case—testing whether $\mu=0$ versus $\mu \neq 0$ — the Bayes Factor (BF) tells you how much more (or less) likely your observed data are under the model where $\mu=0$ than under the model where $\mu$ is free to vary.
218+
219+
Intuitively, the Bayes Factor can be understood by comparing the density of $\mu$ at zero before and after observing the data. Before collecting data, you have a prior belief about $\mu$. This prior density at $\mu=0$ represents how plausible zero was considered initially. After seeing the data, you update these beliefs to get the posterior distribution. The posterior density at $\mu=0$ indicates how plausible zero remains given the evidence. The ratio of these densities—the Savage-Dickey ratio—is closely related to the Bayes Factor. If the data make
220+
$\mu=0$ more plausible relative to your initial belief, the Bayes Factor will favor $\mu=0$. If the data diminish the credibility of $\mu=0$, the Bayes Factor will favor $\mu\neq0$
221+
222+
Because the Bayes Factor directly compares how the data update the prior odds of each hypothesis, the choice of prior is crucial. A strong prior concentration at $\mu=0$ could make it harder for data to move the posterior density away from zero, influencing the resulting Bayes Factor. On the other hand, a diffuse prior might make it easier for data to shift your beliefs about $\mu$. Thus, specifying a reasonable and justifiable prior distribution is essential when using Bayes Factors for hypothesis testing.
223+
224+
+++
225+
226+
Yet again, `arviz` has a function to help us here. We can use `plot_bf` to compute the Bayes Factor for the hypothesis that $\mu=0$ versus $\mu\neq0$.
227+
228+
```{code-cell} ipython3
229+
az.plot_bf(idata, var_names="mu", ref_val=0);
230+
```
231+
232+
We can see that the probability of $\mu=0$ has gone _down_ after observing the data. This is reflected in the value of $BF_{01}=0.54$ in that it is less than 1.
233+
234+
+++
235+
236+
Readers are referred to {ref}`Bayes_factor` for a more detailed look at Bayes Factors.
237+
238+
+++
239+
240+
## Summary
241+
242+
**Posterior Probability Statements**
243+
- *Idea:* Compute $P(\theta > \delta \mid \text{data})$ directly from the posterior.
244+
- *Pros:* Simple, intuitive, no special tools needed.
245+
- *Cons:* Requires choosing a threshold $\delta$.
246+
247+
**Highest Density Intervals (HDIs)**
248+
- *Idea:* Identify the range of values containing a fixed portion (e.g., 95%) of the posterior mass.
249+
- *Pros:* Provides a clear summary of where the parameter lies; easy to interpret.
250+
- *Cons:* By itself, doesn’t encode a decision rule; must still choose what HDI exclusion implies.
251+
252+
**ROPE (Region of Practical Equivalence)**
253+
- *Idea:* Define a small interval around the null (e.g., zero) representing negligible effect size and assess posterior mass within it.
254+
- *Pros:* Focuses on practical rather than just statistical significance; flexible.
255+
- *Cons:* Requires subjective definition of what counts as negligible.
256+
257+
**ROPE + HDI Decision Rule**
258+
- *Idea:* Combine ROPE with HDI to classify results as negligible, meaningful, or inconclusive.
259+
- *Pros:* Offers a three-way decision with practical interpretation; balances interval uncertainty and practical thresholds.
260+
- *Cons:* Still needs careful definition of ROPE bounds and HDI level.
261+
262+
**Bayes Factors**
263+
- *Idea:* Compare evidence for one model/hypothesis against another by ratio of marginal likelihoods.
264+
- *Pros:* Provides a direct measure of relative evidence; can be viewed as updating prior odds.
265+
- *Cons:* Sensitive to priors; can be computationally challenging; interpreting BF scales can be tricky.
266+
267+
+++
268+
269+
## Authors
270+
* Authored by [Benjamin T. Vincent](https://github.com/drbenvincent) in December, 2024.
271+
* Updated by [Osvaldo Martin](https://aloctavodia.github.io/) in Dec, 2025.
272+
273+
+++
274+
275+
## References
276+
:::{bibliography}
277+
:filter: docname in docnames
278+
:::
279+
280+
+++
281+
282+
## Watermark
283+
284+
```{code-cell} ipython3
285+
%load_ext watermark
286+
%watermark -n -u -v -iv -w -p pytensor,xarray
287+
```
288+
289+
:::{include} ../page_footer.md
290+
:::

0 commit comments

Comments
 (0)