A Generalized Central Limit Theorem for Two–Sample
U–Statistics Under Dependence
1 Introduction
This document sketches (at a high level) a proof of the asymptotic normality of two–sample
U–statistics, even though the summands exhibit dependence across indices. The standard
i.i.d. Central Limit Theorem (CLT) does not apply directly because each \(X_i\) is re-used against all \(Y_j\) (and
vice versa), causing correlations. Nonetheless, there is a well-known generalized CLT (often attributed
to Hoeffding [Hoe48], detailed in [Ser80], [Vaa98], and others) that shows such U–statistics still
converge to a normal distribution when properly normalized.
2 Two–Sample U–Statistics
Let \(\{X_i\}_{i=1}^m\) be i.i.d. random variables from a distribution \(F\), and let \(\{Y_j\}_{j=1}^n\) be i.i.d. from a distribution \(G\), independent
of \(\{X_i\}\). We define a (measurable) kernel
\[ \phi (x,y) : \mathcal {X} \times \mathcal {Y} \;\to \; \mathbb {R} \]
with finite second moments. The associated two–sample U–statistic is
\[ U_{m,n} \;=\; \frac {1}{m\,n} \sum _{i=1}^m \sum _{j=1}^n \phi \bigl (X_i, Y_j\bigr ). \]
We want to prove that for large \(m,n\), this statistic is asymptotically normal after an appropriate centering
and scaling.
Notation and Limits
Define \(N = m+n\) as the total sample size. Often one assumes \(m,n \to \infty \) with
\[ \frac {m}{m+n} \;\to \; \alpha \in (0,1), \quad \text {so } \frac {n}{m+n} \;\to \; 1-\alpha . \]
We will show that
\[ \sqrt {m+n}\,\Bigl (U_{m,n} - \E [U_{m,n}]\Bigr ) \;\xrightarrow {\;d\;}\; \mathcal {N}\bigl (0,\sigma ^2\bigr ) \]
for some \(\sigma ^2 > 0\) that depends on \(F,G,\phi \) and on the asymptotic ratio \(\alpha \).
3 Hoeffding–Type Decomposition
A classical approach is to write a Hoeffding decomposition of the kernel \(\phi (x,y)\) in terms of its
“one–dimensional projections.” That is, we write
\[ \phi (x,y) \;=\; \theta \;+\; f(x) \;+\; g(y) \;+\; h(x,y), \]
where:
\[ \theta \;=\; \E [\phi (X,Y)], \quad f(x) \;=\; \E [\phi (x,Y)] - \theta , \quad g(y) \;=\; \E [\phi (X,y)] - \theta , \]
and
\[ h(x,y) \;=\; \phi (x,y) - \theta - f(x) - g(y). \]
By construction, \(h(x,y)\) satisfies
\[ \E [h(X,Y)\,\bigl |\bigr .\,X=x] \;=\; 0, \quad \E [h(X,Y)\,\bigl |\bigr .\,Y=y] \;=\; 0. \]
Hence \(h\) has mean \(0\) (both marginally in \(x\) and \(y\)).
Then
\[ U_{m,n} \;=\; \frac {1}{mn} \sum _{i=1}^m \sum _{j=1}^n \phi (X_i,Y_j) \;=\; \theta \;+\; \frac {1}{mn}\sum _{i=1}^m \sum _{j=1}^n \Bigl [f(X_i) + g(Y_j) + h(X_i,Y_j)\Bigr ]. \]
We separate this into three sums:
\[ U_{m,n} - \theta \;=\; \frac {1}{mn}\sum _{i,j} f(X_i) \;+\; \frac {1}{mn}\sum _{i,j} g(Y_j) \;+\; \frac {1}{mn}\sum _{i,j} h(X_i,Y_j). \]
Notice that
\[ \frac {1}{mn}\sum _{i=1}^m \sum _{j=1}^n f(X_i) \;=\; \frac {1}{m}\,\sum _{i=1}^m f(X_i) \quad \text {and} \quad \frac {1}{mn}\sum _{i=1}^m \sum _{j=1}^n g(Y_j) \;=\; \frac {1}{n}\,\sum _{j=1}^n g(Y_j). \]
Hence
\[ U_{m,n} - \theta \;=\; \frac {1}{m}\,\sum _{i=1}^m f(X_i) \;+\; \frac {1}{n}\,\sum _{j=1}^n g(Y_j) \;+\; \frac {1}{mn}\sum _{i=1}^m\sum _{j=1}^n h(X_i,Y_j). \]
Define
\[ A_m \;=\; \sum _{i=1}^m \bigl [f(X_i)\bigr ], \quad B_n \;=\; \sum _{j=1}^n \bigl [g(Y_j)\bigr ], \quad C_{m,n} \;=\; \sum _{i=1}^m\sum _{j=1}^n h(X_i,Y_j). \]
So
\[ U_{m,n} - \theta \;=\; \frac {A_m}{m} \;+\; \frac {B_n}{n} \;+\; \frac {C_{m,n}}{mn}. \]
Centering
We also have \(\E [A_m] = m\,\E [f(X)] = 0\) by construction of \(f\). Similarly \(\E [B_n]=0\), and \(\E [C_{m,n}]=0\). Thus
\[ \E [U_{m,n}] \;=\; \theta . \]
4 Proof of Asymptotic Normality: Outline
We want to show
\[ \sqrt {m+n} \, \Bigl (U_{m,n} - \theta \Bigr ) \;=\; \sqrt {m+n} \, \Bigl ( \frac {A_m}{m} + \frac {B_n}{n} + \frac {C_{m,n}}{mn} \Bigr ) \;\xrightarrow {\;d\;}\; \mathcal {N}(0,\sigma ^2). \]
Below is a sketch of how each term behaves and how one applies classical CLT arguments:
Term 1: \(\sqrt {m+n}\,\tfrac {A_m}{m}\)
Recall \(A_m = \sum _{i=1}^m f(X_i)\) where \(\{X_i\}\) are i.i.d. from \(F\). Then
\[ \E [f(X_i)] = 0, \quad \Var [f(X_i)] =: \sigma _f^2 < \infty . \]
A classical CLT for i.i.d. sequences shows
\[ \frac {A_m}{\sqrt {m}} \;=\; \frac {1}{\sqrt {m}} \sum _{i=1}^m f(X_i) \;\xrightarrow {\;d\;}\; \mathcal {N}(0,\,\sigma _f^2). \]
But we are multiplying by \(\sqrt {m+n}\) and dividing by \(m\). Observe that
\[ \sqrt {m+n}\,\frac {A_m}{m} \;=\; \frac {\sqrt {m+n}}{m}\,\sum _{i=1}^m f(X_i) \;=\; \sqrt {\frac {m+n}{m}} \,\frac {A_m}{\sqrt {m}}. \]
If \(m/(m+n)\to \alpha \in (0,1)\), then \(\frac {m+n}{m} \to \frac {1}{\alpha }\). Thus
\[ \sqrt {\frac {m+n}{m}} \;\to \; \frac {1}{\sqrt {\alpha }}. \]
Hence
\[ \sqrt {m+n}\,\frac {A_m}{n} \;=\; \left (\frac {1}{\sqrt {\alpha }} + o(1)\right )\,\frac {A_m}{\sqrt {m}}. \]
So in the limit, that term converges in distribution to \(\mathcal {N}\bigl (0,\,\sigma _f^2/(1-\alpha )\bigr )\) (by Slutsky’s theorem).
Term 2: \(\sqrt {m+n}\,\tfrac {B_n}{m}\)
An exactly analogous argument applies to \(B_n = \sum _{j=1}^n g(Y_j)\) with \(\Var [g(Y_j)] =: \sigma _g^2\), so
\[ \sqrt {m+n}\,\frac {B_n}{n} \;=\; \sqrt {\frac {m+n}{n}} \, \frac {B_n}{\sqrt {n}} \;\xrightarrow {\;d\;}\; \mathcal {N}\Bigl (0,\;\frac {\sigma _g^2}{1-\alpha }\Bigr ). \]
Term 3: \(\sqrt {m+n}\,\tfrac {C_{m,n}}{mn}\)
Recall
\[ C_{m,n} \;=\; \sum _{i=1}^m\sum _{j=1}^n h(X_i,Y_j), \quad \E [C_{m,n}] = 0, \]
where \(h(x,y)\) has \(\E [h(X,Y)\mid X]=0\) and \(\E [h(X,Y)\mid Y]=0\). Intuitively, \(h(X_i,Y_j)\) is a zero-mean “interaction part” that is uncorrelated across different
pairs \((i,j)\neq (i',j')\), except for sharing \(X_i\) or \(Y_j\). The literature on two–sample U–statistics (see [Ser80], [Vaa98, Chapter
12]) shows that
\[ \Var \Bigl (\sum _{i,j} h(X_i,Y_j)\Bigr ) \;=\; O\bigl (m\,n\bigr ). \]
More precisely, the partial overlaps do not inflate the order beyond \(mn\) (they do not produce an \((mn)^2\)
term!).
Thus heuristically
\[ \Var \!\Bigl (\frac {C_{m,n}}{mn}\Bigr ) \;=\; \frac {1}{(mn)^2}\,\Var \!\Bigl (\sum _{i,j} h(X_i,Y_j)\Bigr ) \;=\; O\Bigl (\frac {m\,n}{(mn)^2}\Bigr ) \;=\; O\Bigl (\frac {1}{mn}\Bigr ). \]
Hence
\[ \sqrt {m+n}\,\frac {C_{m,n}}{mn} \;=\; O_p\Bigl (\sqrt {m+n}\,\frac {1}{\sqrt {mn}}\Bigr ) \;=\; O_p\Bigl (\sqrt {\frac {m+n}{mn}}\Bigr ). \]
Since \(mn/(m+n)\to \alpha (1-\alpha )\,N\), we see that
\[ \sqrt {\frac {m+n}{mn}} \;\;=\;\; \sqrt {\frac {1}{m} + \frac {1}{n}} \;\to \; 0 \quad \text {(if $m,n\to \infty $ proportionally)}. \]
Thus \(\sqrt {m+n}\,\frac {C_{m,n}}{mn} \xrightarrow {p} 0\), meaning it is negligible in the \(\sqrt {m+n}\) scaling.
Combining the Three Terms
Summarizing:
\[ \sqrt {m+n}\,\Bigl (U_{m,n} - \theta \Bigr ) \;=\; \underbrace {\sqrt {m+n}\,\frac {A_m}{m}}_{\text {CLT } \to \, \mathcal {N}(0,\,\sigma _f^2/\alpha )} \;+\; \underbrace {\sqrt {m+n}\,\frac {B_n}{n}}_{\text {CLT } \to \, \mathcal {N}(0,\,\sigma _g^2/(1-\alpha ))} \;+\; \underbrace {\sqrt {m+n}\,\frac {C_{m,n}}{mn}}_{\xrightarrow {p} 0}. \]
By Slutsky’s theorem (and properties of sums of asymptotically normal terms), we get an overall
normal limit whose variance is \(\sigma _f^2/\alpha + \sigma _g^2/(1-\alpha )\). Here,
\[ \sigma _f^2 \;=\; \Var \!\bigl (f(X)\bigr ), \quad \sigma _g^2 \;=\; \Var \!\bigl (g(Y)\bigr ), \]
and \( f(x) = \E [\phi (x,Y)]-\theta , \; g(y) = \E [\phi (X,y)]-\theta . \)
That yields the main theorem:
\[ \sqrt {m+n}\,\bigl (U_{m,n} - \theta \bigr ) \;\xrightarrow {d}\; \mathcal {N}\Bigl (0,\;\sigma ^2\Bigr ) \quad \text {where}\quad \sigma ^2 \;=\; \frac {\Var [f(X)]}{\alpha } \;+\; \frac {\Var [g(Y)]}{1-\alpha }. \]
(There are alternative expressions involving \(\xi _{10}\), \(\xi _{01}\) from the “four-case” breakdown, etc. They turn out to
coincide with these \(\sigma _f^2,\sigma _g^2\) under appropriate definitions; see [Ser80; Vaa98] for details.)
Remark 1 (Multiple or vector-valued kernels). The above argument generalizes to the case \(\phi : \mathcal {X}\times \mathcal {Y} \to \mathbb {R}^d\), so
that \((U_{m,n}-\theta )\) is in \(\mathbb {R}^d\). Then \(f\) and \(g\) also take values in \(\mathbb {R}^d\). The same decomposition and variance considerations
apply (just use the multivariate CLT).
5 Variance estimators
5.1 Projection-Based (Hoeffding) Variance Estimator
The variance
In the case of an uni-dimensional kernel, it is straightforward to estimate the variance \(\sigma ^2\) of the limiting
normal distribution by replacing \(\theta \), \(\E [\phi (x,Y)]\), and \(\E [\phi (X,y)]\) by their empirical version.
-
Estimate \(\theta \)
A consistent estimate is the U–statistic itself:
\[ \hat {\theta } \;=\; U_{m,n} \;=\; \frac {1}{mn}\,\sum _{i=1}^m \sum _{j=1}^n \phi (X_i,Y_j). \]
-
Estimate \(f(x)\)
For each observed \(X_i\), the true \(f(X_i)\) is \(\E [\phi (X_i, Y)] - \theta \). We approximate \(\E [\phi (X_i, Y)]\) by averaging over the \(Y_j\)’s:
\[ \hat {f}_i \;=\; \frac {1}{n}\,\sum _{j=1}^n \phi (X_i, Y_j) \;-\; \hat {\theta }. \]
-
Estimate \(g(y)\)
By symmetry, for each \(Y_j\),
\[ \hat {g}_j \;=\; \frac {1}{m}\,\sum _{i=1}^m \phi (X_i, Y_j) \;-\; \hat {\theta }. \]
Sample-based estimates of \(\Var [f(X)]\) and \(\Var [g(Y)]\) are given by
\[ \widehat {\Var }\bigl [f(X)\bigr ] \;=\; \frac {1}{m}\,\sum _{i=1}^m \bigl (\hat {f}_i\bigr )^2 \qquad \text {and} \qquad \widehat {\Var }\bigl [g(Y)\bigr ] \;=\; \frac {1}{n}\,\sum _{j=1}^n \bigl (\hat {g}_j\bigr )^2. \]
(One might prefer an \(m-1\) or \(n-1\) in the denominators for unbiasedness, but for large \(m,n\) it does not
matter.)
Putting it all together, if
\[ \alpha _{m,n} \;=\; \frac {m}{m+n} \to \alpha \in (0,1), \]
a consistent estimator for the asymptotic variance \(\sigma ^2\) is
\[ \widehat {\sigma }^2 \;=\; \frac {\widehat {\Var }[f(X)]}{\,\alpha _{m,n}\,} \;+\; \frac {\widehat {\Var }[g(Y)]}{\,1-\alpha _{m,n}\,}. \]
5.2 Direct Covariance Decomposition Estimator
An alternative way to arrive at a variance estimator is to “unpack” the variance of the U-statistic by
writing
\[ U_{m,n} = \frac {1}{mn} \sum _{i=1}^m \sum _{j=1}^n \phi (X_i,Y_j) \]
and then writing
\[ \Var (U_{m,n}) = \frac {1}{m^2n^2}\sum _{i,j}\sum _{i',j'} \Cov \Bigl (\phi (X_i,Y_j),\phi (X_{i'},Y_{j'})\Bigr ). \]
Because the two samples are independent, we have four distinct cases for the indices:
-
Case 1: \(i=i'\) and \(j=j'\)
There are \(mn\) terms. Each contributes
\[ \Var \bigl (\phi (X_i,Y_j)\bigr ). \]
-
Case 2: \(i=i'\) and \(j\neq j'\)
There are \(m \, n(n-1)\) terms. For a fixed \(i\), using the law of total expectation and by conditioning on \(X_i\),
\begin{align*} \Cov \Bigl (\phi (X_i,Y_j),\phi (X_i,Y_{j'})\Bigr ) &= \E \Bigl [\phi (X_i,Y_j)\phi (X_i,Y_{j'})\Bigr ] - \theta ^2 \\ &= \E \Bigl [ \bigl (\E [\phi (X_i,Y_j) \mid X_i]\bigr )^2\Bigr ] - \theta ^2 \\ &= \sigma _1^2, \end{align*}
where \(\sigma _1^2 = \Var \bigl (\E [\phi (X,Y) \mid X]\bigr )\).
-
Case 3: \(i\neq i'\) and \(j=j'\)
Similarly, there are \(n \, m(m-1)\) terms, each with covariance
\[ \Cov \Bigl (\phi (X_i,Y_j),\phi (X_{i'},Y_j)\Bigr ) = \sigma _2^2, \]
where \(\sigma _2^2 = \Var \bigl (\E [\phi (X,Y) \mid Y]\bigr )\).
-
Case 4: \(i\neq i'\) and \(j\neq j'\)
In this case the pairs \((X_i,Y_j)\) and \((X_{i'},Y_{j'})\) are independent so the covariance is zero.
Thus, in population terms we have
\[ \Var (U_{m,n}) = \frac {1}{m^2 n^2}\Bigl \{\, m\,n\Var \bigl (\phi (X,Y)\bigr ) + m\,n(n-1)\,\sigma _1^2 + n\,m(m-1)\,\sigma _2^2 \Bigr \}. \]
A natural plug-in estimator replaces the population quantities by their sample analogues. Define the
overall U–statistic mean
\[ U_{m,n} = \frac {1}{mn}\sum _{i=1}^m\sum _{j=1}^n \phi (X_i,Y_j), \]
and then form the following estimates:
-
For the variance term (Case 1):
\[ \widehat {V} = \frac {1}{mn-1}\sum _{i=1}^m\sum _{j=1}^n \Bigl [\phi (X_i,Y_j)-U_{m,n}\Bigr ]^2. \]
-
For the covariance among terms sharing the same \(X_i\) (Case 2):
\[ \widehat {C}_1 = \frac {1}{m\,n(n-1)} \sum _{i=1}^m \sum _{\substack {j=1}}^n \sum _{\substack {j'=1 \\ j'\neq j}}^n\Bigl [\phi (X_i,Y_j)-U_{m,n}\Bigr ] \Bigl [\phi (X_i,Y_{j'})-U_{m,n}\Bigr ]. \]
-
For the covariance among terms sharing the same \(Y_j\) (Case 3):
\[ \widehat {C}_2 = \frac {1}{n\,m(m-1)} \sum _{j=1}^n \sum _{i=1}^m \sum _{\substack {i'=1 \\ i'\neq i}}^m \Bigl [\phi (X_i,Y_j)-U_{m,n}\Bigr ] \Bigl [\phi (X_{i'},Y_j)-U_{m,n}\Bigr ]. \]
Then a direct plug-in estimator for the variance of \(U_{m,n}\) is given by
\[ \widehat {\Var }(U_{m,n}) \;=\; \frac {1}{m^2n^2}\Bigl \{\, mn\,\widehat {V} \;+\; m\,n(n-1)\,\widehat {C}_1 \;+\; n\,m(m-1)\,\widehat {C}_2 \Bigr \}. \]
Equivalently, after canceling a factor of \(mn\) in the numerator,
\[ \widehat {\Var }(U_{m,n}) \;=\; \frac {\widehat {V}}{mn} \;+\; \frac {n-1}{mn}\,\widehat {C}_1 \;+\; \frac {m-1}{mn}\,\widehat {C}_2. \]
This estimator directly reflects the contribution to the variance from the four covariance
configurations in the double sum, and under standard conditions it provides a consistent estimate of
\(\Var (U_{m,n})\).
Matrix Representation
For compactness and ease of computation, one can arrange the centered kernels into an \(m\times n\) matrix
\[ C = \Bigl [\phi (X_i,Y_j)-U_{m,n}\Bigr ]_{i=1,\dots ,m; \; j=1,\dots ,n}. \]
Then, by introducing the \(m\times m\) and \(n\times n\) matrices of ones, \(J_m\) and \(J_n\), and using some matrix algebra, one can show
that the variance estimator is equivalent to
\[ \widehat {\Var }(U_{m,n}) = \frac {1}{m^2n^2}\trace \Bigl (C\bigl [J_m C + C J_n - C\bigr ]^\top \Bigr ), \]
where \(I_m\) (and \(I_n\)) is the identity matrix of the appropriate size.
References
-
[Hoe48]
-
Wassily Hoeffding. “A Class of Statistics with Asymptotically Normal Distribution”.
In: The Annals of Mathematical Statistics 19.3 (Sept. 1948), pp. 293–325. doi:
10.1214/aoms/1177730196.
-
[Ser80]
-
Robert J. Serfling. Approximation Theorems of Mathematical Statistics. Wiley Series
in Probability and Mathematical Statistics. Hoboken: John Wiley & Sons, 1980. doi:
10.1002/9780470316481.
-
[Vaa98]
-
A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and
Probabilistic Mathematics. Cambridge: Cambridge University Press, 1998. doi:
10.1017/CBO9780511802256.
A Matrix Representation of the Variance
We start by writing
\[ c_{ij} \;=\; \phi (X_i,Y_j)-U_{m,n}, \]
and let
\[ C = \bigl [c_{ij}\bigr ]_{i=1,\dots ,m; \,j=1,\dots ,n}. \]
Also, define the row- and column–sums as
\[ S_{i\cdot } \;=\; \sum _{j=1}^n c_{ij} \quad \text {and}\quad S_{\cdot j} \;=\; \sum _{i=1}^m c_{ij}. \]
Expressing the Trace
Notice that
\[ \trace \Bigl ( C\bigl [J_m C + C J_n - C\bigr ]^\top \Bigr ) =\sum _{i=1}^m\sum _{j=1}^n c_{ij}\,\Bigl [(J_m C)_{ij}+(C J_n)_{ij}-c_{ij}\Bigr ]. \]
Because \(J_m\) is the \(m\times m\) matrix of ones and \(J_n\) is the \(n\times n\) matrix of ones, we have
\[ (J_m C)_{ij} = \sum _{k=1}^m c_{kj} \;=\; S_{\cdot j} \]
and
\[ (C J_n)_{ij} = \sum _{\ell =1}^n c_{i\ell } \;=\; S_{i\cdot }. \]
Thus, the trace becomes
\begin{align*} \trace \Bigl (C\bigl [J_m C + C J_n - C\bigr ]^\top \Bigr ) &= \sum _{i=1}^m \sum _{j=1}^n c_{ij} \Bigl [S_{\cdot j} + S_{i\cdot } - c_{ij}\Bigr ] \\ &= \sum _{j=1}^n \Bigl (\sum _{i=1}^m c_{ij}\Bigr )^2 +\sum _{i=1}^m \Bigl (\sum _{j=1}^n c_{ij}\Bigr )^2 -\sum _{i=1}^m\sum _{j=1}^n c_{ij}^2 \\ &= \sum _{j=1}^n S_{\cdot j}^2 + \sum _{i=1}^m S_{i\cdot }^2 - \sum _{i=1}^m \sum _{j=1}^n c_{ij}^2. \end{align*}
Rewriting the Variance Estimator
The variance estimator is given by
\[ \widehat {\Var }(U_{m,n}) \;=\; \frac {\widehat {V}}{mn} \;+\; \frac {n-1}{mn}\, \widehat {C}_1 \;+\; \frac {m-1}{mn}\,\widehat {C}_2, \]
with
\[ \widehat {V} \;=\; \frac {1}{mn-1} \sum _{i,j} c_{ij}^2. \]
To see the connection with the trace formula, let’s reexpress the “covariance-terms.” In fact, a short
calculation shows that
\[ \widehat {C}_1 \;=\; \frac {1}{m\,n(n-1)}\Biggl [\sum _{i=1}^m S_{i\cdot }^2 - \sum _{i,j} c_{ij}^2\Biggr ] \]
and
\[ \widehat {C}_2 \;=\; \frac {1}{n\,m(m-1)}\Biggl [\sum _{j=1}^n S_{\cdot j}^2 - \sum _{i,j} c_{ij}^2\Biggr ]. \]
Substitute these into the variance estimator. After some algebra (and noting that the factors
combine so that all three terms have an overall denominator proportional to \(m^2n^2\)), one finds that
\[ \widehat {\Var }(U_{m,n}) \;=\; \frac {1}{m^2n^2} \Bigl [\sum _{i=1}^m S_{i\cdot }^2 + \sum _{j=1}^n S_{\cdot j}^2 - \sum _{i,j} c_{ij}^2\Bigr ]. \]
But the right-hand side is exactly the trace expression we obtained above. Hence,
\[ \widehat {\Var }(U_{m,n}) \;=\; \frac {1}{m^2n^2} \trace \Bigl (C \bigl [J_m C + C J_n - C\bigr ]^\top \Bigr ). \]