What Is An Average?

Econometric Theory
Statistics
Author

Jared Greathouse

Published

January 29, 2026

This is an excerpt from my forthcoming book on synthetic control methods.

“I’m too profound to go back and forth, with no average dork”
Solana Imani Rowe


Averages are the most basic summary statistic we are taught, aside from (I suppose) summation, and they arise quite naturally when we try to find a single value that best represents a collection of numbers. In ye olden days, the Egyptians and Babylonians used summary statistics to aid in decision making for astronomy, agriculture, construction and financial matters. In modern times, we use summary statistics like averages to summarize things like test grades, incomes, and other variables we care about.

Even still, the arithmetic average is, for most people, more of an ontological concept than a problem that must be solved for. For the average person/data scientist, the arithmetic average is mentally more akin, to facts like “the sun rises” or “Spain is in Europe” instead of something to be derived, solved for, or even justified from first principles. Indeed (at least in the United States), the arithmetic mean is taught with almost a childlike innocence of simple addition and division. Regularly, people use phrases like “on average” in casual conversation to denote frequency. But what is an average anyways?

The point of this post is to show that the arithmetic average is actually the solution to an optimization problem. I will connect it to how we think about counterfactual estimation in econometrics and marketing science.

Definitions

We will require a few rules. In the book I have a dedicated chapter on the required mathematics, but I repeat the relevant definitions here for convenience.

(optional mathematical background)

Definition 1 (The Derivative) For a scalar function \(f : \mathbb{R} \to \mathbb{R}\), the derivative measures the instantaneous rate of change: \[ \frac{\mathrm{d}f(x)}{\mathrm{d}x} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}. \]

Intuition: How \(f(x)\) changes for an infinitesimal change in \(x\).

Derivatives have rules attached to them. For example, one is the power rule:

Definition 2 (The Power Rule) If \(f(x) = x^n\) for \(n \in \mathbb{R}\): \[ \frac{\mathrm{d}f(x)}{\mathrm{d}x} = n x^{n-1}. \]

Worked example: \(\frac{\mathrm{d}}{\mathrm{d}x} x^3 = 3x^2\).

Pretty straightforward. The derivative of \(x^2\) is just \(2x\). For \(4q^5\) it’s \(20q^4\). Another important rule is the chain rule.

Definition 3 (The Chain Rule) For \(y = f(g(x))\): \[ \frac{\mathrm{d}y}{\mathrm{d}x} = \frac{\mathrm{d}f}{\mathrm{d}g} \cdot \frac{\mathrm{d}g}{\mathrm{d}x}. \]

Definition 4 (Objective Function) A scalar-valued function we aim to minimize or maximize:

\[ f : \mathbb{R}^n \to \mathbb{R}, \]

\[ \min_{\mathbf{x}} f(\mathbf{x}) \quad \text{or} \quad \max_{\mathbf{x}} f(\mathbf{x}). \]

Deriving the Arithmetic Mean

To begin, suppose we have some sequence of data points \((x_i)_{i=1}^n \subset \mathbb{R}, \quad n \in \mathbb{Z}^+\). We want to choose a single number, \(\mu\) (pronounced “m-you”), that is as close as possible to all of the numbers in the sequence. Here, \(\mu\) is a placeholder variable representing a candidate for the “best” summary value. We do not know its value a priori, so we must solve for. First, for each observed value of the sequence \(x_i\), define the deviation from \(\mu\) as

\[ x_i - \mu. \]

Simple enough, this is just subtracting one value from the other value. However, we do not want to minimize the distance to a single number in the sequence; we want a single number that minimizes the distance to all of the numbers in the sequence at once. To measure the overall discrepancy across all points, we can now sum these deviations:

\[ \sum_{i=1}^{n} (x_i - \mu), \]

giving us the total deviation from any given data point. However, even this quantity alone is not sufficient because positive and negative deviations could cancel each other out- if one deviation is negative 2 and the next is positive 2, adding these up gives a total deviation of 0, which makes no sense. Furthermore, it assumes that deviations of 2 and -20 should be treated the same as deviations of 400.

To penalize our deviations regardless of sign, we square each deviation, giving the objective function:

\[ \hat{\mu} = \operatorname*{argmin}_{\mu \in \mathbb{R}} \sum_{i=1}^{n} (x_i - \mu)^2, \]

returning the sum of the squared deviations. Now this is something we can begin to work with.

This function of the residuals is strictly convex and differentiable. We can find its minimum by taking the derivative of the function with respect to \(\mu\). The function \((x_i - \mu)^2\) is a composition of two functions, so we apply the chain rule. The outer function is \(f(u) = u^2\), where \(u = x_i - \mu\). Its derivative (by the power rule) is

\[ \frac{\mathrm{d}}{\mathrm{d}u} u^2 = 2u. \]

The inner function is \(u(\mu) = x_i - \mu\), or the deviation that we spoke of above. Since \(x_i\) is constant with respect to \(\mu\),

\[ \frac{\mathrm{d}}{\mathrm{d}\mu} (x_i - \mu) = -1. \]

By the chain rule, multiplying the outer and inner derivatives gives us

\[ \frac{\mathrm{d}}{\mathrm{d}\mu} (x_i - \mu)^2 = 2(x_i - \mu)\cdot(-1) = -2(x_i - \mu). \]

Finally, summing over all \(i\) gives the derivative of the entire objective function:

\[ \frac{\mathrm{d}}{\mathrm{d}\mu} \sum_{i=1}^{n} (x_i - \mu)^2 = \sum_{i=1}^{n} [-2(x_i - \mu)] = -2 \sum_{i=1}^{n} (x_i - \mu). \]

Above, we factor out the constant \(-2\). Setting this derivative equal to zero gives us the first-order conditions for a minimum:

\[ -2 \sum_{i=1}^{n} (x_i - \mu) = 0. \]

Now we must solve for the unknown. Dividing both sides by \(-2\) simplifies the equation to

\[ \sum_{i=1}^{n} (x_i - \mu) = 0. \]

Next, we split the sum into two parts using the linearity of summation

\[ \sum_{i=1}^{n} x_i - \sum_{i=1}^{n} \mu = 0. \]

Here we are exploiting the fact that summation is a linear operator. Summation distributes over addition and allows constants to factor out. Since \(\mu\) does not depend on \(i\), it appears identically in every term of the sum. Beyond, this is exactly what we seek: one single value that is as close as possible to all the data points simultaneously

\[ \sum_{i=1}^{n} \mu \;\defeq\; \underbrace{\mu + \mu + \cdots + \mu}_{n\ \text{terms}} \;=\; n\mu. \]

Now we substitute this result into the second term on the left-hand side, yielding

\[ \sum_{i=1}^{n} x_i - n\mu = 0. \]

We now seek to isolate the unknown \(\mu\). First, add \(n\mu\) to both sides:

\[ \sum_{i=1}^{n} x_i = n\mu. \]

To solve for \(\mu\), we divide both sides by \(n\). Equivalently, we multiply both sides by the multiplicative inverse of \(n\), namely \(n^{-1} = \tfrac{1}{n}\) (which exists since \(n \ge 1\)):

Definition 5 Arithmetic Average

Let \((x_i)_{i=1}^n \subset \mathbb{R}\), with \(n \in \mathbb{Z}^+\).
The arithmetic average of the sequence is defined as

\[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i, \]

where \(n\) is the number of observations and each observation receives equal weight \(1/n\).

Written this way, the arithmetic mean is revealed not merely as a quotient (a divided sum), but also as a linear combination of the data: each observation \(x_i\) is multiplied by the same weight \(1/n\), and the results are summed. In other words, the arithmetic mean is a uniformly weighted sum of the observations.

To verify that this critical point is indeed a minimum, we examine the second derivative of the objective function:

\[ \frac{\mathrm{d}^2}{\mathrm{d}\mu^2} \sum_{i=1}^{n} (x_i - \mu)^2 = -2 \sum_{i=1}^{n} \frac{\mathrm{d}}{\mathrm{d}\mu}(x_i - \mu) = -2 \sum_{i=1}^{n} (-1) = 2n > 0 \quad (n \geq 1). \]

Since the second derivative is strictly positive, the objective function is strictly convex in \(\mu\), and the critical point is a global minimum. Thus, the value of \(\mu\) that minimizes the total squared deviation from the data is the arithmetic mean. Seen through this lens, the mean is not a primitive concept but the solution to a well-defined optimization problem, and its familiar form arises naturally from basic algebra applied to the first-order condition.

The Time-Period Average in Panel Data

Let’s graduate to panel data. Suppose we observe panel data for some outcome of interest \((x_{it})_{i=1,\dots,N;\; t=1,\dots,T} \subset \mathbb{R}\),
where \(i\) indexes markets and \(t\) indexes time periods. For concreteness, think of \(x_{it}\) as clicks per million impressions from a CTV campaign in market \(i\) during period \(t\).

Now, our goal is (likely) not to choose a single number that summarizes the entire dataset at once. Instead, we want to choose one number per time period that best represents the cross section of markets observed in that period. Denote this number by \(\gamma_t\). As before, \(\gamma_t\) is a placeholder variable representing a candidate for the best summary value at time \(t\). We do not know its value a priori, and we will solve for it.

Fix a particular time period \(t\). For each market \(i\), define the deviation from \(\gamma_t\) as

\[ x_{it} - \gamma_t. \]

This is simply the difference between the observed value in market \(i\) and the candidate summary value for period \(t\). However, we do not want a number that is close to a single market’s value; we want a number that is close to all markets’ values at that time. To measure the overall discrepancy within period \(t\), we sum these deviations across markets:

\[ \sum_{i=1}^N (x_{it} - \gamma_t). \]

As above, to penalize deviations regardless of sign, and to place greater weight on larger discrepancies, we square each deviation. This yields the objective function

\[ \hat{\gamma}_t = \operatorname*{argmin}_{\gamma_t \in \mathbb{R}} \sum_{i=1}^N (x_{it} - \gamma_t)^2. \]

This objective function measures the total squared deviation between \(\gamma_t\) and all markets’ outcomes in period \(t\). Importantly, this problem is solved separately for each time period, which is why in real life we use SQL or numpy to do it for us.

The objective function is strictly convex and differentiable in \(\gamma_t\), so we can find its minimum by taking the derivative with respect to \(\gamma_t\). The function \((x_{it} - \gamma_t)^2\) is a composition of two functions. The outer function is \(f(u) = u^2\), whose derivative is

\[ \frac{\mathrm{d}}{\mathrm{d}u} u^2 = 2u. \]

The inner function is \(u(\gamma_t) = x_{it} - \gamma_t\). Since \(x_{it}\) is constant with respect to \(\gamma_t\),

\[ \frac{\mathrm{d}}{\mathrm{d}\gamma_t}(x_{it} - \gamma_t) = -1. \]

Applying the chain rule yields

\[ \frac{\mathrm{d}}{\mathrm{d}\gamma_t}(x_{it} - \gamma_t)^2 = 2(x_{it} - \gamma_t)(-1) = -2(x_{it} - \gamma_t). \]

Summing over all markets \(i\), the derivative of the full objective function is

\[ \frac{\mathrm{d}}{\mathrm{d}\gamma_t} \sum_{i=1}^N (x_{it} - \gamma_t)^2 = -2 \sum_{i=1}^N (x_{it} - \gamma_t). \]

Setting this derivative equal to zero gives the first-order condition for a minimum:

\[ -2 \sum_{i=1}^N (x_{it} - \gamma_t) = 0. \]

Dividing both sides by \(-2\) simplifies the expression to

\[ \sum_{i=1}^N (x_{it} - \gamma_t) = 0. \]

Following the above argument, we split the sum into two parts:

\[ \sum_{i=1}^N x_{it} - \sum_{i=1}^N \gamma_t = 0. \]

Thus, we have

\[ \sum_{i=1}^N \gamma_t = \underbrace{\gamma_t + \gamma_t + \cdots + \gamma_t}_{N\ \text{terms}} = N \gamma_t. \]

Substituting this result yields

\[ \sum_{i=1}^N x_{it} - N \gamma_t = 0. \]

To isolate \(\gamma_t\), add \(N \gamma_t\) to both sides:

\[ \sum_{i=1}^N x_{it} = N \gamma_t. \]

Dividing both sides by \(N\) gives

\[ \gamma_t = \frac{1}{N} \sum_{i=1}^N x_{it}. \]

Thus, for each time period \(t\), the value \(\gamma_t\) that minimizes the total squared deviation from all markets’ outcomes is the arithmetic mean of those outcomes. This is the exact same quantity as the national/state/regional average at time \(t\), but here it arises directly as the solution to a scalar optimization problem applied separately in each period.

So What?

Now, you may be saying, “Hey Jared, this is overkill. Why even bother doing all of this for something as simple as an average?” Well, the derivation shows us that the average is the solution to an optimization problem. This fact connects simple statistics to the machinery of modern econometrics. Variance, mean squared error, and regression coefficients are all built on the same principle: finding a number, or set of numbers, that minimizes some measure of error. The arithmetic mean is simply the most familiar and elementary instance of this idea. Every estimator in this book is, at its core, an application of averaging. This fact will become crucial in later chapters, when we move from simple averages to more sophisticated ones, including the weighted averages that power synthetic control methods. These derivations show that while the arithmetic mean may feel objective, it is at the end of the day an optimization problem such we all do not treat as one.

The reason I dedicate the third chapter to the arithmetic mean is also to make sure we start on the right footing. For some reason, synthetic control methods are often described in dramatic terms: as a Frankenstein’s monster, a mash-up of controls, or a “fake version” of the treated unit. For example, Haus and Meta scientists have used these kinds of analogies to frame SCM for marketers. Other descriptions present variations of the same idea, including the claim that SCM constructs a “fake control group.”

While I understand the intent behind this phrasing, presumably to avoid all the math we just went through for business people, I believe it is, or at the very least has a high potential to be, misleading to use these kinds of analogies. When we use works like “fake” or “Frankenstein”, these to me come off as loaded terms that suggest deception or haphazardly grafting together data points. However as we have just seen, the arithmetic mean does exactly this. The national mean income over time? Takes all incomes across individuals and time points, assigns each person/market a weight, and computes the weighted sum across time. If we’re in online retail and we care about the mean basket size to show to the client later on versus some other metric, we’re taking the size of each basket we have and computing a sum where each basket is given the same weight. Yet, we do not describe the national average income as “Franken-Income,” nor the weekly mean of basket size as a “Franken-Basket.” If a client asked me to compute the average of impressions to their website across markets over time and I replied, “You mean you want me to make a Frankenstein monster of your website traffic data?” they would rightly think I had lost the plot.

Likewise, the idea of a “fake control group” obscures the fact that every unit in the donor pool is a real control unit. Synthetic control predictions, like any other kind of average, do not fabricate data at all. They but summarize data by weighting observations per some loss function. Actually, these descriptions become even more puzzling once we notice that the arithmetic mean already satisfies the classic SCM constraints. Recall the formula

\[ \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i. \]

for some outcome variable \(y\). We can rewrite this as a weighted sum by setting \(w_i\) equal to \(1/n \: \forall \, i\), yielding \(\sum_{i=1}^n y_i w_i\). Notice these weights are non-negative (positive, technically) and sum to one. This precisely the defining constraint set of the original SCM. The arithmetic mean of the donor pool, therefore, is a very special case of SCM, where the mean matches to one control unit optimally.

Of course, this is rarely true, so control units for SCM tend to get different weights, some more than others, which I have noticed skepticism around. I think the fact that the weights are 1) explicit and 2) non-uniform makes some people uncomfortable because it suggests some kind of Decision at work. When the weights are obscured by the arithmetic mean, nobody sees the weights because they are accepted as axiom, almost. Thus, arithmetic mean weights are implicitly framed as Not a Decision, just a pure computational fact. However, with SCM we can see the weight vector. I guess it’s easier to frame these as a Decision, and the moment this happens, it becomes a thing we may attack or defend. Presumably this is also why the uniform weights (as in, say, Difference-in-Differences) are never questioned.

The point of this post (and through the rest of this chapter) is simple: every single estimator in this book uses a weighted average of controls which is derived from an optimization problem. You are never not averaging. It is the task of the econometrician to know when which flavor of weights are called for and why. In the remainder of the chapter, we generalize this idea from scalars to vectors and matrices.

Forward Augmented Synthetic Controls

Causal Inference
Econometrics

What is a Synthetic Control?

Econometrics
Causal Inference

Synthetic Controls With More Than One Outcome

Causal Inference
Econometrics

Forward Selected Synthetic Control

Machine Learning
Econometrics
No matching items