Coordinates in PyMC & InferenceData Objects
PyMC v4 is here and one of the big changes is that the inference routines (e.g., pm.sample()
) will return an ArviZ InferenceData object by default (recent releases of PyMV3 have made this optional). This post is intended to explore a bit about how the InferenceData object works, how it intersects with PyMC parameter arrays and how the labeling of dimensions and coordinates can make life easier. If you haven't already checked out Oriol Abril's fantastic post about PyMC, coords, and dims, I would strongly encourage you to do so.
Oriol's post focuses on the use of labeled dimensions in PyMC models and how Arviz acknowledges and utilizes the labels when processing the prior/posterior (e.g., plotting). Here, I want to focus on how named dimensions/cooridnates can be used to create parameter arrays, contrasting it with the way things were done in PyMC3. I will also illustrate how to these coordinates structure the InferenceData objects in which our posterior is stored.
Table of Contents¶
import arviz as az
import numpy as np
import pandas as pd
import pymc as pm
import xarray as xr
rng = np.random.default_rng(101010)
xarray ¶
The new(ish) ArviZ inferenceData object is basically an xarray Dataset object. xarray describes itself as a generalization of both Numpy arrays and Pandas dataframes. Its design also borrows heavily from the netCDF file specification. But let's see what xarray objects actually look like. Here is a standard numy array.
npdata = rng.standard_normal(size=(2, 3))
npdata
array([[-0.91412107, -0.32402107, 1.29165606], [ 0.52327975, 0.95708885, 0.52139111]])
Here we will use this numpy array to create an xarray DataArray.
data = xr.DataArray(npdata,
dims=("user", "day"),
coords={"user": ['Alice', 'Bob'],
"day": ["yesterday", "today", "tomorrow"]
}
)
data
<xarray.DataArray (user: 2, day: 3)> array([[-0.91412107, -0.32402107, 1.29165606], [ 0.52327975, 0.95708885, 0.52139111]]) Coordinates: * user (user) <U5 'Alice' 'Bob' * day (day) <U9 'yesterday' 'today' 'tomorrow'
As you can see, the data hasn't changed at all, but it now comes with all sorts of extra information. Let's get a bit of terminology out of the way. Our xarray object is an DataArray
. The array has 2 dimensions: one labeled "user" and the other labeled "day".
data.dims
('user', 'day')
Each dimension has several coordinates. The "user" dimension includes coordinates "Alice" or "Bob". The "day" dimension includes coordinates "yesterday", "today", and "tomorrow".
data.coords
Coordinates: * user (user) <U5 'Alice' 'Bob' * day (day) <U9 'yesterday' 'today' 'tomorrow'
These coordinates can be used instead of standard numerical indicies. Here is how we select the value corresponding to "Alice" and "yesterday" from our numpy array.
npdata[0,0]
-0.9141210682943073
And here is how we select that same value from our xarray DataArray.
data.sel(user="Alice", day="yesterday")
<xarray.DataArray ()> array(-0.91412107) Coordinates: user <U5 'Alice' day <U9 'yesterday'
So one benefit of xarray objects is that there are a nice, semantically labeled coordinates. You no longer have to remember that npdata[:,0]
corresponds to "yesterday" and that npdata[1,:]
corresponds to "Bob". But it's even better than that. You don't even need to remember which axis is which. The order in which the dimensions are stored is not relevant because you are just using the dimensions' labels (though you can use numerical indexing if you wish).
data.sel(day="yesterday", user="Alice")
<xarray.DataArray ()> array(-0.91412107) Coordinates: user <U5 'Alice' day <U9 'yesterday'
data.sel(user="Alice", day="yesterday")
<xarray.DataArray ()> array(-0.91412107) Coordinates: user <U5 'Alice' day <U9 'yesterday'
Finally, xarray also permits the creation of objects that include multiple DataArray
objects. xarray calls such an object a Dataset
. This is essentially what we will be dealing with when we get an ArviZ inferenceData object back after sampling.
var2 = xr.DataArray(rng.standard_normal(size=(2, 2)),
dims=("x", "y"),
coords={"x": [0,1],
"y": [11,42]}
)
var3 = xr.DataArray(rng.standard_normal(size=(2, 2)),
dims=("a", "b"),
coords={"a": [4.2, 11.8],
"b": ['Geneva','London']}
)
ds = xr.Dataset(dict(orig=data,
v2=var2,
v3=var3)
)
ds
<xarray.Dataset> Dimensions: (user: 2, day: 3, x: 2, y: 2, a: 2, b: 2) Coordinates: * user (user) <U5 'Alice' 'Bob' * day (day) <U9 'yesterday' 'today' 'tomorrow' * x (x) int64 0 1 * y (y) int64 11 42 * a (a) float64 4.2 11.8 * b (b) <U6 'Geneva' 'London' Data variables: orig (user, day) float64 -0.9141 -0.324 1.292 0.5233 0.9571 0.5214 v2 (x, y) float64 -0.2851 -1.626 -0.08508 0.9506 v3 (a, b) float64 -0.4798 1.463 0.07777 0.09208
Note that we have 6 dimensions (user, day, x, y, a, and b), but that each variable in our Dataset only uses 2 of them.
To get access to each individual variable out as a DataArray, we can use dictionary-style indexing:
ds['orig']
<xarray.DataArray 'orig' (user: 2, day: 3)> array([[-0.91412107, -0.32402107, 1.29165606], [ 0.52327975, 0.95708885, 0.52139111]]) Coordinates: * user (user) <U5 'Alice' 'Bob' * day (day) <U9 'yesterday' 'today' 'tomorrow'
xarray provides loads of indexing options, dimension/coordinate-aware computation, split-apply-combine operations, and much more. Go read up on xarray to discover all of its functionality.
Example Data ¶
In the examples below, we'll be working with some synthetic data involving two groups: one group receiving an experimental medication and one group receiving a placebo. For each patient, we have their age, sex, systolic blood pressure, and a flag indicating which treatment they have received. Each patient also has a unique ID. Let's generate it.
group_size = 100
systolic_f = rng.normal(loc=123, scale=5, size=group_size)
systolic_m = rng.normal(loc=127, scale=5, size=group_size)
systolic = np.hstack((systolic_f, systolic_m))
sex = (['female'] * group_size) + (['male'] * group_size)
age = rng.normal(loc=50, scale=7, size=len(systolic))
treat_group = rng.integers(low=1, high=3, size=len(systolic))
patient_id = rng.choice(1000, size=len(systolic), replace=False)
systolic = systolic + (.05 * age)
df = pd.DataFrame({'bp':systolic,
'sex':sex,
'age':age,
'treat_group':treat_group})
df.index = patient_id
df.index.name = "patientID"
df.head()
bp | sex | age | treat_group | |
---|---|---|---|---|
patientID | ||||
979 | 128.750951 | female | 53.901474 | 2 |
924 | 123.935881 | female | 50.596643 | 1 |
522 | 121.683066 | female | 57.149881 | 1 |
912 | 135.144287 | female | 50.263664 | 2 |
787 | 121.420322 | female | 47.802665 | 1 |
Old Method ¶
Let's construct a model in which we estimate the mean of each group. But let's do it the way we might have in PyMC3, using an unlabeled parameter array. Because we need to index into the parameter array numerically, the first thing we have to do is to use panda's categorical data type to extract category codes (e.g., 0,1,2...) that numerically indicate group membership.
sex_idx,sex_codes = pd.factorize(df["sex"])
n_sex = len(sex_codes)
with pm.Model() as unlabeled_model:
# hyper prior
mu_hyper = pm.Normal('mu_hyper', mu=120, sigma=15)
# per-group prior
mu = pm.Normal('mu', mu=mu_hyper, sigma=15, shape=n_sex)
# likelihood
likelihood = pm.Normal('likelihood',
mu=mu[sex_idx],
sigma=15,
observed=df['bp']
)
with unlabeled_model:
unlabeled_idata = pm.sample()
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [mu_hyper, mu]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 5 seconds.
unlabeled_idata.posterior
<xarray.Dataset> Dimensions: (chain: 2, draw: 1000, mu_dim_0: 2) Coordinates: * chain (chain) int64 0 1 * draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999 * mu_dim_0 (mu_dim_0) int64 0 1 Data variables: mu_hyper (chain, draw) float64 123.8 128.3 126.2 ... 128.1 132.3 116.9 mu (chain, draw, mu_dim_0) float64 124.1 128.6 125.5 ... 123.5 130.5 Attributes: created_at: 2022-06-13T20:09:04.820492 arviz_version: 0.12.1 inference_library: pymc inference_library_version: 4.0.0b6 sampling_time: 4.526270627975464 tuning_steps: 1000
az.summary(unlabeled_idata, kind="stats", round_to=2)
mean | sd | hdi_3% | hdi_97% | |
---|---|---|---|---|
mu_hyper | 124.93 | 9.19 | 107.58 | 141.55 |
mu[0] | 124.95 | 1.54 | 122.00 | 127.76 |
mu[1] | 129.89 | 1.46 | 127.34 | 132.91 |
We can also investigate the posterior directly. Let's calculate the means of each group.
# average over chains and draws
unlabeled_idata.posterior['mu'].mean(dim=['chain', 'draw'])
<xarray.DataArray 'mu' (mu_dim_0: 2)> array([124.94933509, 129.89383768]) Coordinates: * mu_dim_0 (mu_dim_0) int64 0 1
More practically, we can calculate the probability that the blood pressures of the males and females differ from one another.
( unlabeled_idata.posterior['mu'].sel(mu_dim_0=0) <
unlabeled_idata.posterior['mu'].sel(mu_dim_0=1) ).mean()
<xarray.DataArray 'mu' ()> array(0.991)
Ok. We have our posterior and everything seems to be working. But dealing with the posterior is cumbersome. In particular, we have this annoying mu_dim_0
. What is that? In our model, mu
is a parameter array of size (2,)
, so it has one dimension. Because we have not labeled this dimension explicitly, it is automatically labeled for us in the inferenceData object: mu_dim_0
. The two coordinates on this dimension are mu_dim_0=0
and mu_dim_0=1
. Which of these two refers to the female mean and which refers to the male mean? Good question! It's not immediately clear. We would have to go back and see how we have handled our data. Looking back, we can check the categorical codes that we asked pandas to generate for our data['female']
variable.
sex_codes
Index(['female', 'male'], dtype='object')
# first patient's index into the sex_codes map
sex_idx[0]
0
So pandas has assigned the first patient an index of 0
and sex_codes[0]=='female'
, suggesting that the first patient is female. We can verify this by checking that the first patient's entry in the 'sex' column of our original data.
# first patient's value in the 'sex' column
df['sex'].iloc[0]
'female'
So mu_dim_0=0
correponds to the females. Thus, mu_dim_0=1
must corresponds to the males. Got all that? No? It's not particularly easy. Let's see how labeled dimensions and coordinates can help.
New Method ¶
Let's begin by defining sex as a dimension and labeling it as such. We will simultaneously label the coordinates, or values, on this dimension.
coords = {'sex':['female','male']}
Easy enough. What do we do with this? We pass it along to our PyMC model, which will allow us to use it when defining parameter arrays (among other things). Let's see.
with pm.Model(coords=coords) as labeled_model:
# hyper prior
mu_hyper = pm.Normal('mu_hyper', mu=120, sigma=15)
# per-group prior
mu = pm.Normal('mu', mu=mu_hyper, sigma=15, dims='sex')
# likelihood
likelihood = pm.Normal('likelihood',
mu=mu[sex_idx],
sigma=15,
observed=df['bp'])
Notice that we are again asking for an array of parameters, mu
. But instead of asking for shape=n_grps
, we now ask for dims='sex'
. This tells PyMC that we want an array of a normally-distributed parameters, one for each coordinate on our 'sex' dimension.
Just like in the previous example, we will index into this array using the sex_idx
that pandas generated for us earlier. But how do the values in sex_idx
correspond to the coordinates?
# a few patients' coordinate values
patient_ids = [0,75,150]
for idx in sex_idx[patient_ids]:
print(coords['sex'][idx])
female female male
So we have verified that indicies in sex_idx
map into our coordinates in the desired manner. Let's sample and see how our labeled dimensions and coordinates help us to unpack our posterior.
with labeled_model:
labeled_idata = pm.sample()
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [mu_hyper, mu]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4 seconds.
az.summary(labeled_idata, kind="stats", round_to=2)
mean | sd | hdi_3% | hdi_97% | |
---|---|---|---|---|
mu_hyper | 124.82 | 8.72 | 108.21 | 140.74 |
mu[female] | 125.02 | 1.48 | 122.23 | 127.67 |
mu[male] | 129.88 | 1.51 | 127.25 | 132.76 |
We can already see that the multiple entries in our parameter array mu
are now indexed by the meaningful labels, 'female' and 'male', so we don't have to wonder about which summary corresponds to which group. These labels are also useful when interrogating the InferenceData object itself.
# average over chains and draws
labeled_idata.posterior['mu'].mean(dim=['chain', 'draw'])
<xarray.DataArray 'mu' (sex: 2)> array([125.01960648, 129.87574546]) Coordinates: * sex (sex) <U6 'female' 'male'
Alternatively, we can pull out the two parameters manually, referring to sex
and each of the two coordinates ('female' and 'male') intead of the unlabeled dimension mu_dim_0
:
labeled_idata.posterior['mu'].sel(sex='female').mean()
<xarray.DataArray 'mu' ()> array(125.01960648) Coordinates: sex <U6 'female'
labeled_idata.posterior['mu'].sel(sex='male').mean()
<xarray.DataArray 'mu' ()> array(129.87574546) Coordinates: sex <U6 'male'
And here we can compare the two posteriors:
( labeled_idata.posterior['mu'].sel(sex='female') <
labeled_idata.posterior['mu'].sel(sex='male') ).mean()
<xarray.DataArray 'mu' ()> array(0.9885)
So more than 99% of the samples are consistent with the conclusion that the females' blood pressures were less than the males'.
pm.Data ¶
Dimensions/coordinates can also be used when defining pm.Data
objects, the standard method of making PyMC models aware of the data you are using (e.g., the data passed to any observed variables). First, let's look at how we might have incorporated data into a model in PyMC3.
with pm.Model() as model:
# define our one parameter
coef = pm.Normal('coef', mu=0, sigma=15)
# our data is used here...
mean = coef * df['age'].to_numpy()
# ...and also here, as our observed data
# likelihood
likelihood = pm.Normal('likelihood',
mu=mean,
sigma=15,
observed=df['bp'].to_numpy())
idata = pm.sample()
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [coef]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4 seconds.
Let's see what all in that resulting InferenceData object
idata.groups
<bound method InferenceData.groups of Inference data with groups: > posterior > log_likelihood > sample_stats > observed_data>
Now let's see how we can use pm.Data
and incorporate a data-relevant dimension.
coords = {'patient_id':df.index}
with pm.Model(coords=coords) as model:
# here we define pm.Data objects so that PyMC is aware
# of our model-relevant data
age = pm.MutableData('age', df['age'], dims='patient_id')
bloodpressure = pm.MutableData('bloodpressure', df['bp'], dims='patient_id')
# define our one parameter
coef = pm.Normal('coef', mu=0, sigma=15)
# our data is still used here...
mean = coef * age
# ...and here, as our observed data
# likelihood
likelihood = pm.Normal('likelihood',
mu=mean,
sigma=15,
observed=bloodpressure)
idata = pm.sample()
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [coef]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 3 seconds.
So what's the big deal? Nothing seems to have changed. But when we look inside the InferenceData object, we can now see that the data we used to define the model is carried around with the posterior (and everything else).
idata.groups
<bound method InferenceData.groups of Inference data with groups: > posterior > log_likelihood > sample_stats > observed_data > constant_data>
We can see that a new group, constant_data
, is included in our InferenceData object. This group contains the data we used when defining the model. Let's take a look.
idata.constant_data
<xarray.Dataset> Dimensions: (patient_id: 200) Coordinates: * patient_id (patient_id) int64 979 924 522 912 787 ... 783 523 896 968 98 Data variables: age (patient_id) float64 53.9 50.6 57.15 ... 52.35 35.73 50.27 bloodpressure (patient_id) float64 128.8 123.9 121.7 ... 128.4 138.2 129.2 Attributes: created_at: 2022-06-13T20:09:32.929479 arviz_version: 0.12.1 inference_library: pymc inference_library_version: 4.0.0b6
The fact that this data was associated with the patient_id
dimension when we created the data means that we can now use this dimension to index into the data.
idata.constant_data.sel(patient_id=979)
<xarray.Dataset> Dimensions: () Coordinates: patient_id int64 979 Data variables: age float64 53.9 bloodpressure float64 128.8 Attributes: created_at: 2022-06-13T20:09:32.929479 arviz_version: 0.12.1 inference_library: pymc inference_library_version: 4.0.0b6
Wrap-Up ¶
Hopefully this has given you a quick look into the advantages of using coordinates and dimensions and illustrated some basic, but illustrative scenarios. Head over to the PyMC Discourse if you have questions or need additional help!
%load_ext watermark
%watermark -p pymc,arviz,numpy,pandas
pymc : 4.0.0b6 arviz : 0.12.1 numpy : 1.22.4 pandas: 1.4.2