Comparing ATE estimators in multisite and cluster randomized trials

by Luke Miratrix

on Apr 4, 2025 · 11 min read · multisite trials blocking randomized trials cluster randomized trials ·

It is a sad truth that the same data can be analyzed different ways to get different results. Also, multiple models might be reasonable choices for a given dataset. Still, it is not a great feeling to wonder whether what seems like a subjective design decision is driving the results.

In some ongoing work, we are comparing a number of different estimators, and in order to do this we have written code to implement a range of different models that all can be applied uniformly to the same data. This has resulted in two R packages, blkvar and clusterRCT, with blkvar for multisite randomized trials, and clusterRCT for blocked, cluster randomized trials. You can install these packages from GitHub as follows:

devtools::install_github("lmiratrix/blkvar")
devtools::install_github("lmiratrix/clusterRCT")

These packages are quite easy to use, and I thought I would write up an overview of them, in case you want to see how different estimators give you different estimates on your own data.

Multi-site (blocked) individually randomized trials

Multisite trials are a common design in education, where a treatment is randomly assigned to units (e.g. students) within sites (e.g. schools). They are, in effect, a collection of mini-experiments that are then averaged over to get an overall estimated impact. A multisite trial is a form of blocked experiments, where the sites are the blocks. Frequently, each mini-experiment is too small for detecting effects just at that spot, but when you average, you can get a sense of overall trend.

It turns out that the way you average matters (do you weight big sites proportion to their size or weight all sites equally?). It also turns out that you can either be estimating for the sites in the sample you have, or some broader population of sites from which they were, or could have been, drawn. These choices are substantive–depending on your goals, you want to pick an estimator that goes after what you want. See our paper, L. W. Miratrix, Weiss, and Henderson (2021), for more discussion of this.

But even within a given, more precisely defined goal, you have options. For example, you can use a design-based estimator, a linear model, a linear model with weights, or a multilevel model. How might these differ? You can easily ask empirically as so:

library(blkvar)
datMS = generate_multilevel_data( num.X = 3 )
blkvar::compare_methods( Yobs, Z, sid, data=datMS ) %>% 
    knitr::kable( digits = 2)

method	ATE_hat	SE
DB-FP-Persons	0.34	0.07
DB-FP-Sites	0.38	0.08
DB-SP-Persons	0.34	0.12
DB-SP-Sites	0.38	0.13
FE	0.34	0.07
FE-CR	0.34	0.13
FE-Club	0.34	0.12
FE-Het	0.34	0.07
FE-IPTW	0.34	0.07
FE-IPTW-Sites	0.38	0.08
FE-Int-Persons	0.34	0.07
FE-Int-Sites	0.38	0.08
FIRC	0.38	0.13
RICC	0.34	0.07
RIRC	0.37	0.12
hybrid_m	0.34	0.07
hybrid_p	0.34	0.07
plug_in_big	0.34	0.07

Each row in the above is a different method for estimating the average treatment effect. See the package documentation and our technical supplement A for technical notes. The last three follow Pashley and Miratrix (2021), and are designed for blocked randomized experiments when you may not have many individuals within each block.

In scanning our list of average treatment effect (ATE) estimates, we see the ATE estimates are all about the same, but that the standard error estimates can be quite different. The \(\widehat{SE}\)s are different because some of these estimators are targeting a superpopulation, and thus have additional uncertainty regarding whether our dataset is representative of that superpopulation. We also often find larger standard error estimates when targeting the effect for the average site, compared to the effect for the average person.

As a small note, I am specifying blkvar::compare_methods() since both packages I am discussing have the same method name for comparing a bunch of estimators. Sorry for that name collision!

Anyway, you can also use our calc_summary_stats_formula() method to describe the dataset to get some summary features by site (i.e., block):

desc <- calc_summary_stats_formula( Yobs ~ Z * sid, data=datMS )
head( desc )

##    B n0       Ybar0      var0 n1       Ybar1      var1  n
## 1  1  9 -0.22770006 0.2500228  9 -0.19868097 0.1030202 18
## 3  2  3 -1.37314556 1.2251958  3 -0.62284282 1.0016651  6
## 5  3  7  0.78382216 0.1178015  7  1.43466442 0.1526219 14
## 7  4  8  0.05446895 0.4244150  8  0.05298241 0.1583759 16
## 9  5  2 -0.03849680 0.0799217  2  1.02788703 1.3993811  4
## 11 6  2  0.51218450 1.0303693  2 -0.23312612 1.0572995  4

You can also adjust for covariates:

blkvar::compare_methods( Yobs, Z, sid, data=datMS, 
                         control_formula = ~ X1 + X2 + X3 ) %>% 
    knitr::kable( digits = 2)

method	ATE_hat	SE
DB-FP-Persons-adj	0.34	0.07
DB-FP-Sites-adj	0.38	0.08
DB-SP-Persons-adj	0.34	0.12
DB-SP-Sites-adj	0.38	0.13
FE-CR-adj	0.34	0.13
FE-Club-adj	0.34	0.12
FE-Het-adj	0.34	0.07
FE-IPTW-Sites-adj	0.39	0.08
FE-IPTW-adj	0.34	0.07
FE-Int-Persons-adj	0.34	0.07
FE-Int-Sites-adj	0.38	0.08
FE-adj	0.34	0.07
FIRC-adj	0.38	0.13
RICC-adj	0.34	0.07
RIRC-adj	0.37	0.12
hybrid_m	0.34	0.07
hybrid_p	0.34	0.07
plug_in_big	0.34	0.07

This will include covariate adjustment for each method. For details of how these methods all work, see the technical appendix to our paper, located here.

Blocked, cluster-randomized trials

We are currently extending the general ideas we explored in multisite experiments to blocked, cluster-randomized trials. These are trials that have sites, as before, but we are randomizing clusters within these sites into treatment and control. For example, we might randomize schools (the clusters) within districts (the blocks) into treatment and control, and then look at student (the individuals) outcomes.

Our new package, clusterRCT, does the same thing as blkvar, but for this alternate design (and thus for alternate estimators):

library( clusterRCT )
data( fakeCRT )
clusterRCT::compare_methods( Yobs ~ T.x | S.id | D.id, data=fakeCRT,
                             control_formula = ~ X.jk + C.ijk ) %>%
    dplyr::select( method:df ) %>%
    knitr::kable( digits = 2)

method	weight	ATE_hat	SE_hat	p_value	df
LRa-FE-db	Cluster	0.16	0.08	0.04	187.00
LRa-FE-het	Cluster	0.16	0.08	0.04	187.00
LRa-FIcw-db	Cluster	0.16	0.08	0.04	178.00
LRa-FIcw-het	Cluster	0.16	0.08	0.04	178.00
LRapw-FE-db	Person	0.16	0.08	0.04	187.00
LRapw-FE-het	Person	0.16	0.08	0.04	187.00
LRapw-FIpw-db	Person	0.17	0.08	0.03	178.00
LRapw-FIpw-het	Person	0.17	0.08	0.03	178.00
LRi-FE-crve	Person	0.16	0.08	0.04	152.73
LRi-FE-db	Person	0.16	0.08	0.04	187.00
LRi-FIpw-crve	Person	0.16	0.08	0.03	141.13
LRi-FIpw-db	Person	0.16	0.08	0.03	178.00
LRicw-FE-crve	Cluster	0.15	0.08	0.06	136.05
LRicw-FE-db	Cluster	0.15	0.08	0.06	187.00
LRicw-FIcw-crve	Cluster	0.15	0.08	0.05	135.92
LRicw-FIcw-db	Cluster	0.15	0.08	0.05	178.00
MLM-FE	Cluster	0.15	0.08	0.05	184.08
MLM-FIcw	Cluster	0.16	0.08	0.04	178.00
MLM-RE	Cluster	0.15	0.08	0.06	185.02

Note we have gotten a bit more clever as we have gotten older: you can provide a formula notation to the compare_methods function to specify the outcome, the treatment assignment, the cluster membership, and the block membership. If you do not have blocks, just omit the last part (the “| D.id”).

We also have a bit of a nicer describe_data() method. I will show you how it works, and also show off that it will automatically handle missing data (as clusterRCT’s compare_methods() also does):

data( fakeBrokeCRT )
desc <- clusterRCT::describe_clusterRCT( Yobs ~ T.x | S.id | D.id, data=fakeBrokeCRT,
                                         control_formula = ~ X.jk + C.ijk )

## Warning: Missing data dropped and imputed in patch_data_set. 270 rows of 1021
## dropped.  34 values imputed.

## Warning: Dropping all-tx and/or all-co blocks. 293 of 751 rows of data dropped.

The first warning informs us that describe_clusterRCT has automatically done mean imputation for the covariates (but has dropped all observations missing treatment assignment, block membership, or outcome). The second warning messages indicate that the method has detected that there is weird structures in the data, in this case blocks that are all treated or all control.

Let’s see what summary statistics we get as output:

desc

## Cluster RCT: 458 units in 97 clusters across 5 blocks
## Block Statistics:
## 	Avg units/block: 91.60 (coef var 0.35)
## 	Avg clusters/block: 19.40 (coef var 0.34)
## 	25-75 Quantiles: 17.0-23.0 w/ IQR = 6.0
## 	ICC: 0.26
## 	Avg tx: 0.50 (coef var 0.22)
## 	25-75 Quantile tx: 0.48-0.50 w/ IQR 0.02
## Cluster Statistics:
## 	Avg size: 4.72 (coef var 0.33)
## 	25-75 Quantiles: 4.0-6.0 w/ IQR = 2.0
## 	ICC: 0.26
## 	Prop treated: 0.48
## 	R2.2: 0.44 (3 covariates)
## Unit Statistics:
## 	stddev( Y0 ): 1.07
## 	R2.1: 0.07 (2 covariates)
## missing data counts:
##      Yobs         Z clusterID   blockID      X.jk     C.ijk 
##        95        71       113       101         0        41

The desc variable is also a single-row dataframe, with columns for each summary statistic, making it easy to pull out specific values for other use (and also making it easy to stack descriptions of multiple datasets together).

E.g.,

desc$nbar

## [1] 4.721649

See the documentation for more details on what is available in the description.

You can look at some descriptive features by block as well

make_block_table( Yobs ~ T.x | S.id | D.id, data=fakeCRT)

## # A tibble: 10 × 8
##    blockID     J  nbar   ncv  n.25  n.75 n.IQR  p.tx
##    <fct>   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 1           9  4.78 0.416   3    6     3    0.667
##  2 2          11  5    0.420   4    6     2    0.727
##  3 3          19  5.32 0.243   4.5  6     1.5  0.421
##  4 4          17  5.41 0.236   5    6     1    0.353
##  5 5          23  5.61 0.245   5    7     2    0.435
##  6 6          24  4.83 0.249   4    5.25  1.25 0.458
##  7 7          23  5    0.346   4    6     2    0.478
##  8 8          23  4.74 0.350   3.5  6     2.5  0.478
##  9 9          25  5.28 0.335   4    6     2    0.48 
## 10 10         26  4.96 0.276   4    6     2    0.5

Some reflections

Both blkvar and clusterRCT make it easy to compare different estimators for an average treatment effect on the same data. But one might wonder: should you do that? And our answer is, “yes and no.”

First, it is a good idea to understand how estimators that seem to be estimating the same thing, truly are not. It is important to be very specific as to what the goals are, of an analysis, beyond vague statements such as “did it work?”

Best practice, these days, is to pre-register your analysis plan. In that case, you are first picking an estimand (and it’s fine to pick more than one). Next, you are picking an estimator from among those targeting your estimand. That selection is the estimator you will use. Without a pre-registered plan, if you look at any of the above tables, you might immediately start to wonder: which one is right? Does it happen to be the one with the largest effect? The one that is statistically significant? (We know this is what we would end up doing! Don’t do this!)

On the flip side, if the set of estimators targeting your chosen estimand do not all give about the same answer, it does seem like you should worry. This is akin to being asked to do a “sensitivity check” on your results in a paper. Frequently, for example, people ask to see adjusted and unadjusted results for an RCT’s impact estimate. Similar to how covariate adjustment can change an estimate, so can the choice of estimator.

Overall, these packages make doing these kinds of sensitivity analyses straightforward. But they also make it even more important to lay your cards on the table before you look at your data. I.e., you should pre-register your analysis plan, and then stick to it as best you can.

Comparison more broadly

Overall (with the caveat on pre-registration), I believe packages should promote comparisons, and I try to do that in my work more broadly. These two packages are very explicit about comparison, and allow cross-method comparison fairly directly. (They also provide function calls and source code for the individual methods, in case that is of interest; see package documentation for details.)

Other packages I have worked on keep a similar spirit of promoting comparison one way or another. For example, textreg, designed to compare text corpuses to one another (see the vignette on CRAN, or L. Miratrix and Ackerman (2016) for more), has visualization methods where you can compare a set of models with different tuning parameters and configurations to see how these models summarize text similarly and differently. Also hettx, focused on estimating treatment variation, allows for different options for different kinds of regression adjustment within the scope of the package.

For power calculations, PUMP has a method, update_grid() that will take an initial power result and recalculate power for all combinations of a set of passed parameters, allowing for easy exploration of how different choices can impact your results. (You can read more about this in this blog post about using PUMP or Hunter, Miratrix, and Porter (2024), our overview paper.)

In general, comparison allows us to see when what appears to be the same is actually different, and it also allows us to quickly identify what may be the drivers of a given result, or identify when things are much more unstable than we might hope.

Funder Acknowledgement

The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through R305D220046 to MDRC. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education. Thanks to Mike Weiss for initial comments and edits on this draft; it has been a pleasure working on these projects with him.

References

Hunter, Kristen B, Luke Miratrix, and Kristin Porter. 2024. “PUMP: Estimating Power, Minimum Detectable Effect Size, and Sample Size When Adjusting for Multiple Outcomes in Multi-Level Experiments.” Journal of Statistical Software 108: 1–43.

Miratrix, Luke W, Michael J Weiss, and Brit Henderson. 2021. “An Applied Researcher’s Guide to Estimating Effects from Multisite Individually Randomized Trials: Estimands, Estimators, and Estimates.” Journal of Research on Educational Effectiveness 14 (1): 270–308.

Miratrix, Luke, and Robin Ackerman. 2016. “Conducting Sparse Feature Selection on Arbitrarily Long Phrases in Text Corpora with a Focus on Interpretability.” Statistical Analysis and Data Mining: The ASA Data Science Journal 9 (6): 435–60.

Pashley, Nicole E, and Luke W Miratrix. 2021. “Insights on Variance Estimation for Blocked and Matched Pairs Designs.” Journal of Educational and Behavioral Statistics 46 (3): 271–96.