You Are Taking An Average
This is an excerpt from my forthcoming book on synthetic control methods.
Averaging in Statistics
Averages are the most basic summary statistic we are taught, aside from (I suppose) summation, and they arise quite naturally when we try to find a single value that best represents a collection of numbers. In ye olden days, Egyptians and Babylonians used summary statistics in construction and financial matters. In modern times, we use averages to summarize things like test grades, incomes, and other variables we care about.
Even still, the arithmetic average is, for most people, more of an ontological concept than a problem that must be solved for. For the average person/data scientist, the arithmetic average is mentally more akin, to facts like “the sun rises” or “Spain is in Europe” instead of something to be derived, solved for, or even justified from first principles. Indeed (at least in the United States), the arithmetic mean is taught with almost a childlike innocence of simple addition and division. Regularly, people use phrases like “on average” in casual conversation to denote frequency. But what is an average anyways?
The point of this post is to show that the arithmetic average is the solution to an optimization problem. I will connect it to how we think about counterfactual estimation.
Definitions
In doing so, we will require a few rules. In the book I have a dedicated chapter on the required mathematics, but I repeat the definitions here for convenience.
(optional mathematical background)
Definition 1 (The Derivative) For a scalar function \(f : \mathbb{R} \to \mathbb{R}\), the derivative measures the instantaneous rate of change: \[ \frac{\mathrm{d}f(x)}{\mathrm{d}x} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}. \]
Intuition: How \(f(x)\) changes for an infinitesimal change in \(x\).
Derivatives have rules attached to them. For example, one is the power rule:
Definition 2 (The Power Rule) If \(f(x) = x^n\) for \(n \in \mathbb{R}\): \[ \frac{\mathrm{d}f(x)}{\mathrm{d}x} = n x^{n-1}. \]
Worked example: \(\frac{\mathrm{d}}{\mathrm{d}x} x^3 = 3x^2\).
Pretty straightforward. The derivative of \(x^2\) is just \(2x\). For \(4q^5\) it’s \(20q^4\). Another important rule is the chain rule.
Definition 3 (The Chain Rule) For \(y = f(g(x))\): \[ \frac{\mathrm{d}y}{\mathrm{d}x} = \frac{\mathrm{d}f}{\mathrm{d}g} \cdot \frac{\mathrm{d}g}{\mathrm{d}x}. \]
Definition 4 (Objective Function) A scalar-valued function we aim to minimize or maximize:
\[ f : \mathbb{R}^n \to \mathbb{R}, \]
\[ \min_{\mathbf{x}} f(\mathbf{x}) \quad \text{or} \quad \max_{\mathbf{x}} f(\mathbf{x}). \]
Deriving the Arithmetic Mean
To formalize this, suppose we have a sequence of data points \((x_i)_{i=1}^n \subset \mathbb{R}, \quad n \in \mathbb{Z}^+\). We want to choose a single number, \(\mu\) (pronounced “m-you”), that is as close as possible to all of them. Here, \(\mu\) is a placeholder variable representing a candidate for the “best” summary value, which we must solve for. For each observed value \(x_i\), the deviation from \(\mu\) is
\[ x_i - \mu. \]
However, we do not want to minimize the distance to a single number; we want a number that minimizes the discrepancy to all of the numbers. To measure the overall discrepancy across all points, we sum these deviations:
\[ \sum_{i=1}^{n} (x_i - \mu). \]
This quantity alone is not sufficient because positive and negative deviations could cancel each other out. (The sum of deviations is always zero when evaluated at the mean, so it provides no discrimination power among candidate values of \(\mu\).) To penalize deviations regardless of sign, we square each term, giving the objective function:
\[ \hat{\mu} = \operatorname*{argmin}_{\mu \in \mathbb{R}} \sum_{i=1}^{n} (x_i - \mu)^2. \]
This function is convex and differentiable, so we can find its minimum by taking the derivative with respect to \(\mu\). The function \((x_i - \mu)^2\) is a composition of two functions, so we apply the chain rule. The outer function is \(f(u) = u^2\), where \(u = x_i - \mu\), and its derivative (by the power rule) is
\[ \frac{\mathrm{d}}{\mathrm{d}u} u^2 = 2u. \]
The inner function is \(u(\mu) = x_i - \mu\). Since \(x_i\) is constant with respect to \(\mu\),
\[ \frac{\mathrm{d}}{\mathrm{d}\mu} (x_i - \mu) = -1. \]
By the chain rule, multiplying the outer and inner derivatives gives
\[ \frac{\mathrm{d}}{\mathrm{d}\mu} (x_i - \mu)^2 = 2(x_i - \mu)\cdot(-1) = -2(x_i - \mu). \]
Finally, summing over all \(i\) gives the derivative of the entire objective function:
\[ \frac{\mathrm{d}}{\mathrm{d}\mu} \sum_{i=1}^{n} (x_i - \mu)^2 = \sum_{i=1}^{n} [-2(x_i - \mu)] = -2 \sum_{i=1}^{n} (x_i - \mu), \]
where we factor out the constant \(-2\). Setting this derivative equal to zero gives the first-order condition for a minimum:
\[ -2 \sum_{i=1}^{n} (x_i - \mu) = 0. \]
Dividing both sides by \(-2\) simplifies the equation to
\[ \sum_{i=1}^{n} (x_i - \mu) = 0. \]
Next, we split the sum into two parts to isolate \(\mu\):
\[ \sum_{i=1}^{n} x_i - \sum_{i=1}^{n} \mu = 0. \]
Here, \(\mu\) is the same value for every term in the sum and does not depend on \(i\). This is exactly what we seek: one single value that is as close as possible to all the data points simultaneously. Because \(\mu\) is constant with respect to the index \(i\), the second sum is simply \(\mu\) added to itself \(n\) times:
\[ \sum_{i=1}^{n} \mu = n\mu. \]
Substituting this result gives
\[ \sum_{i=1}^{n} x_i - n\mu = 0. \]
Rearranging the equation yields
\[ \sum_{i=1}^{n} x_i = n\mu. \]
Finally, dividing both sides by \(n\) gives
\[ \mu = \frac{1}{n} \sum_{i=1}^{n} x_i. \]
Since the second derivative of the objective function is
\[ \frac{\mathrm{d}^2}{\mathrm{d}\mu^2} \sum_{i=1}^{n} (x_i - \mu)^2 = \frac{\mathrm{d}}{\mathrm{d}\mu} \left[-2 \sum_{i=1}^{n} (x_i - \mu)\right] = 2n > 0 \quad (n \geq 1), \]
this critical point is a global minimum.
Definition 5 (Arithmetic Average) For some \((x_i)_{i=1}^n \subset \mathbb{R}, \quad n \in \mathbb{Z}^+\), the arithmetic average is
\[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i, \]
where \(n\) is the length of the sequence and each observation receives equal weight \(1/n\).
As we can see, the value of \(\mu\) that minimizes the total squared deviation in this instance is the arithmetic mean of the data. Another way to think about this result is geometrically, using the language of Hilbert spaces. Consider the vector of data points
\[ \mathbf{x} = (x_1, x_2, \dots, x_n) \in \mathbb{R}^n \]
equipped with the standard inner product
\[ \langle \mathbf{a}, \mathbf{b} \rangle = \sum_{i=1}^{n} a_i b_i. \]
Minimizing the sum of squared deviations
\[ \sum_{i=1}^{n} (x_i - \mu)^2 \]
is equivalent to finding the orthogonal projection of \(\mathbf{x}\) onto the one-dimensional subspace spanned by the vector \(\mathbf{1} = (1, 1, \dots, 1)\). The solution \(\mu\) is exactly the scalar that defines this projection:
\[ \boxed{\mu = \frac{1}{n} \sum_{i=1}^{n} x_i}. \]
From this perspective, the arithmetic mean is the vector in the direction of \(\mathbf{1}\) that is closest in Euclidean distance to the data vector \(\mathbf{x}\).
So What?
Now, you may be saying, “Hey Jared, this is overkill. Why bother doing all of this for something as simple as an average?” The derivation shows us that the average is the solution to an optimization problem.
That single insight connects simple statistics to the machinery of modern econometrics. Variance, mean squared error, and regression coefficients are all built on the same principle: finding a number, or set of numbers, that minimizes some measure of error. The arithmetic mean is simply the most familiar and elementary instance of this idea. Every estimator in this book is, at its core, an application of averaging. This fact will become crucial in later chapters, when we move from simple averages to more sophisticated ones, including the weighted averages that power synthetic control methods. Whatever method you choose, you’re never not averaging.
The reason I dedicate a chapter to the arithmetic mean is also to make sure we start on the right footing. For some reason, synthetic control methods are often described in dramatic terms: as a Frankenstein’s monster, a mash-up of controls, or a “fake version” of the treated unit. For example, Haus and Meta scientists have used these kinds of analogies to frame SCM. Other descriptions present variations of the same idea, including the claim that SCM constructs a “fake control group.”
While I understand the intent behind this phrasing, I believe it is, and at the very least has a high potential to be, misleading. As we have just seen, the arithmetic mean is itself a special case of a weighted average, one in which all units receive equal weight. Yet we do not describe the national average income as “Franken-Income,” nor the weekly mean of basket size as a “Franken-Basket.” Likewise, the idea of a “fake control group” obscures the fact that every unit in the donor pool is real, observed data. Synthetic control predictions, like any other average, do not fabricate data at all. They simply assign weights, subject to constraints, in exactly the same spirit as more familiar averages.
In fact, these descriptions become even more puzzling once we notice that the arithmetic mean already satisfies the classic SCM constraints. Recall the formula
\[ \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i. \]
We can rewrite this as a weighted sum by setting \(w_i = 1/n\) for all \(i\), yielding \(\sum_{i=1}^n y_i w_i\). These weights are non-negative and sum to one, which is precisely the defining constraints of the synthetic control method which operates over multiple vectors/units instead of a single vector. The arithmetic mean of the donor pool, therefore, is a special case of SCM in which all weights are uniform. The only meaningful difference is that classic SCM does not impose uniformity.
Yet the moment we allow non-uniform weights with \(\sum_i w_i = 1\), the same operation is suddenly described as sophisticated or different or complex, even though the weights from SCM, like the arithmetic mean, are also byproduct of an optimization problem. The key difference is that the arithmetic average is trivially closed form and the SCM weights are not. If a client asked me to compute the average of sales across markets over time and I replied, “You mean you want me to make a mashed-up Frankenstein monster of your sales data?” they would rightly think I had lost the plot.
I think part of the reason for this anxiety around weighted averages with SCM is that the simple arithmetic average of some subset of controls is framed implicitly as Not a Decision. The arithmetic average weights just are what they are by construction, so much so that many people (I suspect) do not even see it as a decision they are making. Presumably this is also why the uniform weights (as in, say, Difference-in-Differences) are never questioned, but the weights from SCM are met with suspicion. In contrast, weights that are not uniform, and who up front come from an optimization, are met with more suspicion because then, when framed openly as a decision, it becomes a thing we may attack or defend.
The point of this post is simple: every single estimator in this book uses a weighted average of controls which is derived from an optimization problem, whether we are discussing Difference-in-Differences or SCM. There’s nothing inherently special about either the uniform weight scheme or the variant weight scheme. Both are byproduct optimization solutions, and it is the task of the econometrician or data scientist who is interested in causal inference to know when which solution to called for.