Pearson correlation with uncertainty

Author

Šimon Kucharský

Published

May 6, 2025

Code

import os
if "KERAS_BACKEND" not in os.environ:
    # set this to "torch", "tensorflow", or "jax"
    os.environ["KERAS_BACKEND"] = "jax"

import numpy as np
import bayesflow as bf
import matplotlib.pyplot as plt

INFO:bayesflow:Using backend 'jax'

In this example we will estimate a Pearson’s correlation coefficient, assuming known reliability of measurement of the two variables. We will assume that the true variable \(y\) and measurement error \(\epsilon\) are independent and so the measurement variance is simply a sum of the true and error variances, \(\sigma_x^2 = \sigma^2 + \sigma_\epsilon^2\), and reliability is then \(a=\frac{\sigma^2}{\sigma_x^2}\).

\[\begin{equation} \begin{aligned} \rho & \sim \text{Uniform}(-1, 1) \\ \mathbb{y} & \sim \text{MvGaussian}\left((0, 0), \begin{bmatrix} \sigma_{1}^2 & \rho\sigma_{1}\sigma_{2} \\ \rho\sigma_{1}\sigma_{2} & \sigma_{2}^2\end{bmatrix} \right) \\ \mathbb{\epsilon} & \sim \text{MvGaussian}\left((0, 0), \begin{bmatrix} \sigma_{\epsilon1}^2 & \rho\sigma_{\epsilon1}\sigma_{\epsilon2} \\ \rho\sigma_{\epsilon1}\sigma_{\epsilon2} & \sigma_{\epsilon2}^2\end{bmatrix} \right) \\ x & \leftarrow y + \epsilon \\ r & \leftarrow \text{cor}(\mathbb{x}). \end{aligned} \end{equation}\]

Simulator

We will amortize over a range of sample sizes and reliabilities.

Code

def context():
    return dict(
        n = np.random.randint(5, 20),
        reliability = np.random.beta(a=5, b=2, size=2)
    )

def prior():
    return dict(rho = np.random.uniform(low=-1, high=1))

def likelihood(n, reliability, rho):
    sd_true = np.sqrt(reliability)
    sd_err = 1 - np.sqrt(reliability)

    y = np.random.multivariate_normal(
        mean=[0, 0], 
        cov=[
            [np.square(sd_true[0]), rho * np.prod(sd_true)], 
            [rho * np.prod(sd_true), np.square(sd_true[1])]], 
        size = n)
    e = np.random.normal(loc=0, scale=sd_err, size=(n, 2))
    x = y+e
    r = np.corrcoef(x[:,0], x[:,1])[0,1]

    return dict(r=r)

simulator = bf.make_simulator([context, prior, likelihood])

Approximator

We will approximate the posterior of \(\rho\) given sample size \(n\), mearuement reliabilities, and sample correlation coefficients.

Code

adapter=(
    bf.Adapter()
    .constrain("rho", lower=-1, upper=1)
    .rename("rho", "inference_variables")
    .concatenate(["n", "reliability", "r"], into="inference_conditions")
)

Code

workflow=bf.BasicWorkflow(
    simulator=simulator,
    adapter=adapter,
    inference_network=bf.networks.CouplingFlow()
)

Training

Code

history=workflow.fit_online(epochs=20, batch_size=256)

Validation

Code

test_data=simulator.sample(1000)
figs=workflow.plot_default_diagnostics(test_data=test_data, num_samples=500)

Inference

We will conduct an inference for the correlation coefficient for the data reported in Lee & Wagenmakers (2013, p. 64) of response times and IQ scores of 11 participants.

Code

rt = [0.8, 1.0, 0.5, 0.9, 0.7, 0.4, 1.2, 1.4, 0.6, 1.1, 1.3]
iq = [102, 98, 100, 105, 103, 110, 99, 87, 113, 89, 93]

r=np.corrcoef(rt, iq)[1,0]

The original example by Lee & Wagenmakers sets the measurement error to 0.03 and 1 for RTs and IQ, respectivelly. Our model is parametrized in terms of measurement reliability instead. We approximate the reliabilities using the formula \(1 - \sigma_\epsilon^2 / \sigma_x^2\). Here, \(\sigma_\epsilon\) are the values \(0.03\) and \(1\) set by Lee & Wagenmakers (2013). In the case of RTs, we approximate \(\sigma_x\) by its sample counterpart. In the case of IQ, we set it’s \(\sigma_x=15\) (since that’s how IQ tests are standardized).

Code

error_rt = 0.03
error_iq = 1
reliability = [
    1 - np.square(error_rt) / np.var(rt),
    1 - np.square(error_iq) / np.square(15)
]
reliability

[0.991, 0.9955555555555555]

Code

inference_data = dict(
    r = np.array([[r]]),
    n = np.array([[len(rt)]]),
    reliability = np.array(reliability)[np.newaxis]
)

Code

samples = workflow.sample(num_samples=2000, conditions=inference_data)

Code

plt.hist(samples["rho"].flatten(), density=True, color="lightgray", edgecolor="black", bins=np.arange(-1, 1.05, 0.05))
plt.xlabel(r"$\rho$")
plt.ylabel("Density")
plt.tight_layout()

References

Lee, M. D., & Wagenmakers, E.-J. (2013). Bayesian Cognitive Modeling: A Practical Course. Cambridge University Press.

Reuse

CC BY NC