A Causal Study of Simulated Patient Recovery Data

I'm currently reading "Do No Harm", by Henry Marsh, a fascinating autobiography of a neurosurgeon. Since I am also reading about causality in statistics, I naturally noticed points in the book where confounding might confuse beliefs about a doctor's skill. Here I consider two approaches used in causal analysis to overcome confounding.

Imagine you're in charge of a bunch of hospitals and are looking at the survival statistics of all the patients with a certain illness (e.g., brain tumours) to help you decide which doctors to promote. (This is probably not how doctors are promoted, but anyway). You might be tempted to pick the doctors with the best survival rates. This is wrong because it ignores the assignment mechanism of patients to doctors. Let's see how wrong you could be with a specific example.

doctor id | num patients | survival rate
0 117 0.205
1 85 0.812
2 93 0.129
3 88 0.216
4 98 0.153
5 112 0.830
6 89 0.809
7 109 0.138
8 102 0.824
9 107 0.841

Let's make a set of statistical modelling assumptions about how this data was generated.

Assumptions:
D doctors
N patients
\(Z_{n,d}\) is whether patient n is assigned to doctor d
\(Y_n(d)\) is binary recovery if patient n is treated by doctor d
\(\beta_d\) is skill of doctor d (latent, binary: low or high skill)
\(\theta_n\) is the severity of patient n's illness (observed, binary: low or high severity)
Generative procedure:
for patient \(n = 1,\dots,N\):
     draw an indicator for whether the patient is assigned a highly skilled or less highly skilled doctor, \(s_n \sim \mathrm{Bernoulli}( \theta_n p_\mathrm{high} + (1-\theta_n) p_\mathrm{low} )\)
     if \(s_n = 1\): draw \(Z_{n .}\) uniformly from the pool of highly skilled doctors
     otherwise: draw \(Z_{n .}\) doctor d uniformly from the pool of other doctors
     calculate probability of recovery as follows: \(p_\mathrm{recover} = \beta_d \theta_n p_\mathrm{low} + (1-\beta_d) \theta_n p_\mathrm{very low} + (1-\theta_n) p_\mathrm{high}\)
     draw recovery \(Y_n(d) \sim \mathrm{Bernoulli}( p_\mathrm{recover} )\)

What this generative process is saying is that we should try to assign highly skilled doctors to severely ill patients and let the rest of the doctors take care of the rest of the patients, but this process is noisy and there is a non-zero probability that an average doctor care for a very ill patient or a very skilled doctor will care for a less ill patient.

Why this recovery probability? If the illness is severe, then the best that a skilled doctor can do is ensure a low probability of recovery, while a less skilled doctor will do even worse than that. If the illness is not severe, then it doesn't matter whether the patient gets a very skilled doctor or not, they are likely to recover.

In reality, the patient covariates \(\theta_d\) would not be a simple binary number, it would be features of the illness (e.g., tumour size and location, age of patient), and doctors might have skills along multiple dimensions. Binary numbers suit our thought experiment for the moment, but we'll revisit the dimensionality of covariates later.

Using this generative model and a given set of parameters \((p_\mathrm{high}, p_\mathrm{low}, p_\mathrm{very\; low}, N, D, \theta, \beta)\), I generated 10 doctors and 1000 patients (the data you see in the table above, actually). Now, let's pretend that we don't know \(\beta\), the doctor skill that we are trying to estimate from the data alone.
Would you want to see doctor 4? Naively, no, but who knows, they might be the top specialist in the country for your illness. I consider how to answer that question next.

Causality assumptions

We need to make a set of assumptions to allow us to perform a causal analysis. I'm writing them out explicitly for good statistical hygiene but won't discuss them at length here.
First, we need the stable unit treatment value assumption (SUTVA) which says that there are no interactions between units (patients) in the sense that the outcome of treatment for patient n does not depend on the assignment or recovery of any of the other patients. This might be conceivably broken if doctors become tied up with patients limiting their ability to see new patients.
Second, we need unconfoundedness, which says that assignment is conditionally independent of the outcome (i.e., recovery) given the patient covariates (i.e., judged severity of illness). This might be broken if there are some unknown patient covariates (e.g., where they live) that affect assignment. And the bad news from Imbens & Rubin 2015 is that we can never tell if this is the case.
Third, we need individualistic and probabilistic assignments (see Imbens & Rubin 2015).

Recovery Rate Stratification

Here is what the recovery rates look like for each doctor when we stratify by patient severity:
doctor id | num severe patients | recovery rate (non-severe) | recovery rate (severe)
0 103 0.857 0.117
1 7 0.885 0.000
2 87 0.833 0.080
3 78 0.900 0.128
4 92 1.000 0.098
5 6 0.877 0.000
6 11 0.923 0.000
7 97 0.917 0.041
8 11 0.923 0.000
9 8 0.909 0.000
The first thing to notice is that the number of severely ill patient assignments per doctor reveals the assignment mechanism quite clearly. But this only reveals information about doctor skill if we believe in benevolent assignments. One could easily design a perverse universe in which severely ill people get sent to the person least qualified to treat them. It is the stratified survival rate columns that tell you everything you need to know if you ever find yourself a patient in the simulated universe. If you're severely ill, see doctors 0, 2, 3, 4, or 7. Otherwise, see the first available doctor.
Split the patients up in low and high severity illnesses, then randomly match low severity patients (one treated by doctor d, another not treated by doctor d) and calculate the average difference in recovery. This gives us the following average treatment effects:
doctor id | avg treatment effect | doctor skill | num treated | avg recovery rate
0 0.000 1 117 0.205
1 -0.024 0 85 0.812
2 0.000 1 93 0.129
3 0.023 1 88 0.216
4 -0.020 1 98 0.153
5 -0.080 0 112 0.830
6 0.034 0 89 0.809
7 -0.064 1 109 0.138
8 0.010 0 102 0.824
9 0.037 0 107 0.841
Average treatment effect doesn't really help us identify skilled doctors in this example. It successfully identifies doctor 3 as being skilled, even though he has a low survival rate, and identifies doctor 1 as being unskilled, even though he has a high survival rate, but the rest are sort of confused. The problem arises because severely ill patients are rare and there are not enough patients overall (I reran the simulation with 10k patients and found I could reliably identify skilled and unskilled doctors looking at the average treatment effect). Skilled doctors raise the recovery rate of the average patient because they will greatly help severely ill patients and will perform like average doctors with less ill patients. Less skilled doctors lower the average recovery rate because they do a bad job with severely ill patients.
In general, looking at stratified recovery rates was successful because the unit covariate is low dimensional and is itself a good balancing score because it satisfies Y_n(d) conditionally independent of Z_{n d} given \theta_n for all n and d. This didn't have to be the case. In reality, we would expect many patient covariates, some of them irrelevant, others affecting assignment but not outcomes and others affecting both. Stratification is hopeless in that case because the dimensionality of things to stratify on overwhelms the number of units. For the purposes of continuing, let's assume that patient covariates belong to this more complex variate, motivating propensity score reweighting (or matching, though we have already done matching based on a balancing score so will focus on reweighting next).

Propensity Score Reweighting

Another strategy for causal analysis is to account for the assignment mechanism using propensity scoring. A propensity score is defined as the coarsest function e that satisfies: Y_n( . ) conditionally independent of Z_{n .} given e( \theta_n ). Coarseness means the size of the image of the function, and is useful when considering high dimensional covariates.
In this example, we'd like e( \theta_n ) to be the probability of being assigned a highly skilled doctor. But we don't know who is highly skilled in the first place (that's the unknown \beta). So we have to make do with e( \theta_n ) being a vector of length D representing the probability of being assigned each doctor. Here is the propensity score, dependent on \theta_n (split into two tables to fit on the page):
unit covariate | doc_0 | doc_1 | doc_2 | doc_3 | doc_4
non-severe 0.028 0.156 0.012 0.020 0.012
severe 0.206 0.014 0.174 0.156 0.184
unit covariate | doc_5 | doc_6 | doc_7 | doc_8 | doc_9
non-severe 0.212 0.156 0.024 0.182 0.198
severe 0.012 0.022 0.194 0.022 0.016
Propensity score re-weighting estimated treatment effect for doctor d = (1 / N) \sum_{n} ( Z_{n d} Y_n(obs) / e( \theta_n ) - (1-Z_{n d}) Y_n(obs) / (1-e( \theta_n ) ) where N is the number of patients and Y_n(obs) is the observed recovery for patient n.
Here is the average treatment effect after propensity score reweighting:
doctor id | doctor skill | treated recovery | untreated recovery | avg treatment effect
0 1 0.487 0.489 -0.003
1 0 0.442 0.495 -0.053
2 1 0.457 0.494 -0.037
3 1 0.514 0.489 0.025
4 1 0.549 0.491 0.058
5 0 0.439 0.497 -0.058
6 0 0.462 0.492 -0.030
7 1 0.479 0.498 -0.019
8 0 0.462 0.492 -0.030
9 0 0.455 0.493 -0.038
The results of this analysis are that we have now identified doctor 3 AND 4 as skilled, and no longer think several unskilled doctors are skilled (as previously thought with matching), in particular, matching estimated doctor 9 to increase the survival of their patients over other doctors but propensity reweighting reveals this not to be the case. How is this possible? Because propensity reweighting gives us a better estimate of the control (i.e,. NOT being treated by any give doctor d) because we can consider the patients of all other doctors instead of matching with a single other patient.

iPython Notebook

Python code for this entire analysis (and the generative model) can be found in this iPython notebook: link to iPython notebook code

Conclusions

Causality, even (perhaps especially) in simulation, is fun. I hope I have highlighted some interesting themes about finding counter intuitive conclusions from data that look at first straightforward. One main difference between my simulated scenario and many others that come up in causal analysis is that usually we're in danger of over-estimating the effect of a treatment (e.g., of going to university, of enrolling in a jobs program) whereas in this scenario we are under-estimating the effectiveness of highly skilled doctors.  
causalityadminComment