A Generalized Central Limit Theorem for Two–Sample U–Statistics Under Dependence


1 Introduction

This document sketches (at a high level) a proof of the asymptotic normality of two–sample U–statistics, even though the summands exhibit dependence across indices. The standard i.i.d. Central Limit Theorem (CLT) does not apply directly because each \(X_i\) is re-used against all \(Y_j\) (and vice versa), causing correlations. Nonetheless, there is a well-known generalized CLT (often attributed to Hoeffding [Hoe48], detailed in [Ser80], [Vaa98], and others) that shows such U–statistics still converge to a normal distribution when properly normalized.

2 Two–Sample U–Statistics

Let \(\{X_i\}_{i=1}^m\) be i.i.d. random variables from a distribution \(F\), and let \(\{Y_j\}_{j=1}^n\) be i.i.d. from a distribution \(G\), independent of \(\{X_i\}\). We define a (measurable) kernel

\[ \phi (x,y) : \mathcal {X} \times \mathcal {Y} \;\to \; \mathbb {R} \]

with finite second moments. The associated two–sample U–statistic is

\[ U_{m,n} \;=\; \frac {1}{m\,n} \sum _{i=1}^m \sum _{j=1}^n \phi \bigl (X_i, Y_j\bigr ). \]

We want to prove that for large \(m,n\), this statistic is asymptotically normal after an appropriate centering and scaling.

Notation and Limits

Define \(N = m+n\) as the total sample size. Often one assumes \(m,n \to \infty \) with

\[ \frac {m}{m+n} \;\to \; \alpha \in (0,1), \quad \text {so } \frac {n}{m+n} \;\to \; 1-\alpha . \]

We will show that

\[ \sqrt {m+n}\,\Bigl (U_{m,n} - \E [U_{m,n}]\Bigr ) \;\xrightarrow {\;d\;}\; \mathcal {N}\bigl (0,\sigma ^2\bigr ) \]

for some \(\sigma ^2 > 0\) that depends on \(F,G,\phi \) and on the asymptotic ratio \(\alpha \).

3 Hoeffding–Type Decomposition

A classical approach is to write a Hoeffding decomposition of the kernel \(\phi (x,y)\) in terms of its “one–dimensional projections.” That is, we write

\[ \phi (x,y) \;=\; \theta \;+\; f(x) \;+\; g(y) \;+\; h(x,y), \]

where:

\[ \theta \;=\; \E [\phi (X,Y)], \quad f(x) \;=\; \E [\phi (x,Y)] - \theta , \quad g(y) \;=\; \E [\phi (X,y)] - \theta , \]

and

\[ h(x,y) \;=\; \phi (x,y) - \theta - f(x) - g(y). \]

By construction, \(h(x,y)\) satisfies

\[ \E [h(X,Y)\,\bigl |\bigr .\,X=x] \;=\; 0, \quad \E [h(X,Y)\,\bigl |\bigr .\,Y=y] \;=\; 0. \]

Hence \(h\) has mean \(0\) (both marginally in \(x\) and \(y\)).

U–Statistic in Decomposed Form

Then

\[ U_{m,n} \;=\; \frac {1}{mn} \sum _{i=1}^m \sum _{j=1}^n \phi (X_i,Y_j) \;=\; \theta \;+\; \frac {1}{mn}\sum _{i=1}^m \sum _{j=1}^n \Bigl [f(X_i) + g(Y_j) + h(X_i,Y_j)\Bigr ]. \]

We separate this into three sums:

\[ U_{m,n} - \theta \;=\; \frac {1}{mn}\sum _{i,j} f(X_i) \;+\; \frac {1}{mn}\sum _{i,j} g(Y_j) \;+\; \frac {1}{mn}\sum _{i,j} h(X_i,Y_j). \]

Notice that

\[ \frac {1}{mn}\sum _{i=1}^m \sum _{j=1}^n f(X_i) \;=\; \frac {1}{m}\,\sum _{i=1}^m f(X_i) \quad \text {and} \quad \frac {1}{mn}\sum _{i=1}^m \sum _{j=1}^n g(Y_j) \;=\; \frac {1}{n}\,\sum _{j=1}^n g(Y_j). \]

Hence

\[ U_{m,n} - \theta \;=\; \frac {1}{m}\,\sum _{i=1}^m f(X_i) \;+\; \frac {1}{n}\,\sum _{j=1}^n g(Y_j) \;+\; \frac {1}{mn}\sum _{i=1}^m\sum _{j=1}^n h(X_i,Y_j). \]

Define

\[ A_m \;=\; \sum _{i=1}^m \bigl [f(X_i)\bigr ], \quad B_n \;=\; \sum _{j=1}^n \bigl [g(Y_j)\bigr ], \quad C_{m,n} \;=\; \sum _{i=1}^m\sum _{j=1}^n h(X_i,Y_j). \]

So

\[ U_{m,n} - \theta \;=\; \frac {A_m}{m} \;+\; \frac {B_n}{n} \;+\; \frac {C_{m,n}}{mn}. \]

Centering

We also have \(\E [A_m] = m\,\E [f(X)] = 0\) by construction of \(f\). Similarly \(\E [B_n]=0\), and \(\E [C_{m,n}]=0\). Thus

\[ \E [U_{m,n}] \;=\; \theta . \]

4 Proof of Asymptotic Normality: Outline

We want to show

\[ \sqrt {m+n} \, \Bigl (U_{m,n} - \theta \Bigr ) \;=\; \sqrt {m+n} \, \Bigl ( \frac {A_m}{m} + \frac {B_n}{n} + \frac {C_{m,n}}{mn} \Bigr ) \;\xrightarrow {\;d\;}\; \mathcal {N}(0,\sigma ^2). \]

Below is a sketch of how each term behaves and how one applies classical CLT arguments:

Term 1: \(\sqrt {m+n}\,\tfrac {A_m}{m}\)

Recall \(A_m = \sum _{i=1}^m f(X_i)\) where \(\{X_i\}\) are i.i.d. from \(F\). Then

\[ \E [f(X_i)] = 0, \quad \Var [f(X_i)] =: \sigma _f^2 < \infty . \]

A classical CLT for i.i.d. sequences shows

\[ \frac {A_m}{\sqrt {m}} \;=\; \frac {1}{\sqrt {m}} \sum _{i=1}^m f(X_i) \;\xrightarrow {\;d\;}\; \mathcal {N}(0,\,\sigma _f^2). \]

But we are multiplying by \(\sqrt {m+n}\) and dividing by \(m\). Observe that

\[ \sqrt {m+n}\,\frac {A_m}{m} \;=\; \frac {\sqrt {m+n}}{m}\,\sum _{i=1}^m f(X_i) \;=\; \sqrt {\frac {m+n}{m}} \,\frac {A_m}{\sqrt {m}}. \]

If \(m/(m+n)\to \alpha \in (0,1)\), then \(\frac {m+n}{m} \to \frac {1}{\alpha }\). Thus

\[ \sqrt {\frac {m+n}{m}} \;\to \; \frac {1}{\sqrt {\alpha }}. \]

Hence

\[ \sqrt {m+n}\,\frac {A_m}{n} \;=\; \left (\frac {1}{\sqrt {\alpha }} + o(1)\right )\,\frac {A_m}{\sqrt {m}}. \]

So in the limit, that term converges in distribution to \(\mathcal {N}\bigl (0,\,\sigma _f^2/(1-\alpha )\bigr )\) (by Slutsky’s theorem).

Term 2: \(\sqrt {m+n}\,\tfrac {B_n}{m}\)

An exactly analogous argument applies to \(B_n = \sum _{j=1}^n g(Y_j)\) with \(\Var [g(Y_j)] =: \sigma _g^2\), so

\[ \sqrt {m+n}\,\frac {B_n}{n} \;=\; \sqrt {\frac {m+n}{n}} \, \frac {B_n}{\sqrt {n}} \;\xrightarrow {\;d\;}\; \mathcal {N}\Bigl (0,\;\frac {\sigma _g^2}{1-\alpha }\Bigr ). \]

Term 3: \(\sqrt {m+n}\,\tfrac {C_{m,n}}{mn}\)

Recall

\[ C_{m,n} \;=\; \sum _{i=1}^m\sum _{j=1}^n h(X_i,Y_j), \quad \E [C_{m,n}] = 0, \]

where \(h(x,y)\) has \(\E [h(X,Y)\mid X]=0\) and \(\E [h(X,Y)\mid Y]=0\). Intuitively, \(h(X_i,Y_j)\) is a zero-mean “interaction part” that is uncorrelated across different pairs \((i,j)\neq (i',j')\), except for sharing \(X_i\) or \(Y_j\). The literature on two–sample U–statistics (see [Ser80], [Vaa98, Chapter 12]) shows that

\[ \Var \Bigl (\sum _{i,j} h(X_i,Y_j)\Bigr ) \;=\; O\bigl (m\,n\bigr ). \]

More precisely, the partial overlaps do not inflate the order beyond \(mn\) (they do not produce an \((mn)^2\) term!).

Thus heuristically

\[ \Var \!\Bigl (\frac {C_{m,n}}{mn}\Bigr ) \;=\; \frac {1}{(mn)^2}\,\Var \!\Bigl (\sum _{i,j} h(X_i,Y_j)\Bigr ) \;=\; O\Bigl (\frac {m\,n}{(mn)^2}\Bigr ) \;=\; O\Bigl (\frac {1}{mn}\Bigr ). \]

Hence

\[ \sqrt {m+n}\,\frac {C_{m,n}}{mn} \;=\; O_p\Bigl (\sqrt {m+n}\,\frac {1}{\sqrt {mn}}\Bigr ) \;=\; O_p\Bigl (\sqrt {\frac {m+n}{mn}}\Bigr ). \]

Since \(mn/(m+n)\to \alpha (1-\alpha )\,N\), we see that

\[ \sqrt {\frac {m+n}{mn}} \;\;=\;\; \sqrt {\frac {1}{m} + \frac {1}{n}} \;\to \; 0 \quad \text {(if $m,n\to \infty $ proportionally)}. \]

Thus \(\sqrt {m+n}\,\frac {C_{m,n}}{mn} \xrightarrow {p} 0\), meaning it is negligible in the \(\sqrt {m+n}\) scaling.

Combining the Three Terms

Summarizing:

\[ \sqrt {m+n}\,\Bigl (U_{m,n} - \theta \Bigr ) \;=\; \underbrace {\sqrt {m+n}\,\frac {A_m}{m}}_{\text {CLT } \to \, \mathcal {N}(0,\,\sigma _f^2/\alpha )} \;+\; \underbrace {\sqrt {m+n}\,\frac {B_n}{n}}_{\text {CLT } \to \, \mathcal {N}(0,\,\sigma _g^2/(1-\alpha ))} \;+\; \underbrace {\sqrt {m+n}\,\frac {C_{m,n}}{mn}}_{\xrightarrow {p} 0}. \]

By Slutsky’s theorem (and properties of sums of asymptotically normal terms), we get an overall normal limit whose variance is \(\sigma _f^2/\alpha + \sigma _g^2/(1-\alpha )\). Here,

\[ \sigma _f^2 \;=\; \Var \!\bigl (f(X)\bigr ), \quad \sigma _g^2 \;=\; \Var \!\bigl (g(Y)\bigr ), \]

and \( f(x) = \E [\phi (x,Y)]-\theta , \; g(y) = \E [\phi (X,y)]-\theta . \)

That yields the main theorem:

\[ \sqrt {m+n}\,\bigl (U_{m,n} - \theta \bigr ) \;\xrightarrow {d}\; \mathcal {N}\Bigl (0,\;\sigma ^2\Bigr ) \quad \text {where}\quad \sigma ^2 \;=\; \frac {\Var [f(X)]}{\alpha } \;+\; \frac {\Var [g(Y)]}{1-\alpha }. \]

(There are alternative expressions involving \(\xi _{10}\), \(\xi _{01}\) from the “four-case” breakdown, etc. They turn out to coincide with these \(\sigma _f^2,\sigma _g^2\) under appropriate definitions; see [Ser80; Vaa98] for details.)

Remark 1 (Multiple or vector-valued kernels). The above argument generalizes to the case \(\phi : \mathcal {X}\times \mathcal {Y} \to \mathbb {R}^d\), so that \((U_{m,n}-\theta )\) is in \(\mathbb {R}^d\). Then \(f\) and \(g\) also take values in \(\mathbb {R}^d\). The same decomposition and variance considerations apply (just use the multivariate CLT).

5 Variance estimators

5.1 Projection-Based (Hoeffding) Variance Estimator

The variance

In the case of an uni-dimensional kernel, it is straightforward to estimate the variance \(\sigma ^2\) of the limiting normal distribution by replacing \(\theta \), \(\E [\phi (x,Y)]\), and \(\E [\phi (X,y)]\) by their empirical version.

Sample-based estimates of \(\Var [f(X)]\) and \(\Var [g(Y)]\) are given by

\[ \widehat {\Var }\bigl [f(X)\bigr ] \;=\; \frac {1}{m}\,\sum _{i=1}^m \bigl (\hat {f}_i\bigr )^2 \qquad \text {and} \qquad \widehat {\Var }\bigl [g(Y)\bigr ] \;=\; \frac {1}{n}\,\sum _{j=1}^n \bigl (\hat {g}_j\bigr )^2. \]

(One might prefer an \(m-1\) or \(n-1\) in the denominators for unbiasedness, but for large \(m,n\) it does not matter.)

Putting it all together, if

\[ \alpha _{m,n} \;=\; \frac {m}{m+n} \to \alpha \in (0,1), \]

a consistent estimator for the asymptotic variance \(\sigma ^2\) is

\[ \widehat {\sigma }^2 \;=\; \frac {\widehat {\Var }[f(X)]}{\,\alpha _{m,n}\,} \;+\; \frac {\widehat {\Var }[g(Y)]}{\,1-\alpha _{m,n}\,}. \]

5.2 Direct Covariance Decomposition Estimator

An alternative way to arrive at a variance estimator is to “unpack” the variance of the U-statistic by writing

\[ U_{m,n} = \frac {1}{mn} \sum _{i=1}^m \sum _{j=1}^n \phi (X_i,Y_j) \]

and then writing

\[ \Var (U_{m,n}) = \frac {1}{m^2n^2}\sum _{i,j}\sum _{i',j'} \Cov \Bigl (\phi (X_i,Y_j),\phi (X_{i'},Y_{j'})\Bigr ). \]

Because the two samples are independent, we have four distinct cases for the indices:

Thus, in population terms we have

\[ \Var (U_{m,n}) = \frac {1}{m^2 n^2}\Bigl \{\, m\,n\Var \bigl (\phi (X,Y)\bigr ) + m\,n(n-1)\,\sigma _1^2 + n\,m(m-1)\,\sigma _2^2 \Bigr \}. \]

A natural plug-in estimator replaces the population quantities by their sample analogues. Define the overall U–statistic mean

\[ U_{m,n} = \frac {1}{mn}\sum _{i=1}^m\sum _{j=1}^n \phi (X_i,Y_j), \]

and then form the following estimates:

Then a direct plug-in estimator for the variance of \(U_{m,n}\) is given by

\[ \widehat {\Var }(U_{m,n}) \;=\; \frac {1}{m^2n^2}\Bigl \{\, mn\,\widehat {V} \;+\; m\,n(n-1)\,\widehat {C}_1 \;+\; n\,m(m-1)\,\widehat {C}_2 \Bigr \}. \]

Equivalently, after canceling a factor of \(mn\) in the numerator,

\[ \widehat {\Var }(U_{m,n}) \;=\; \frac {\widehat {V}}{mn} \;+\; \frac {n-1}{mn}\,\widehat {C}_1 \;+\; \frac {m-1}{mn}\,\widehat {C}_2. \]

This estimator directly reflects the contribution to the variance from the four covariance configurations in the double sum, and under standard conditions it provides a consistent estimate of \(\Var (U_{m,n})\).

Matrix Representation

For compactness and ease of computation, one can arrange the centered kernels into an \(m\times n\) matrix

\[ C = \Bigl [\phi (X_i,Y_j)-U_{m,n}\Bigr ]_{i=1,\dots ,m; \; j=1,\dots ,n}. \]

Then, by introducing the \(m\times m\) and \(n\times n\) matrices of ones, \(J_m\) and \(J_n\), and using some matrix algebra, one can show that the variance estimator is equivalent to

\[ \widehat {\Var }(U_{m,n}) = \frac {1}{m^2n^2}\trace \Bigl (C\bigl [J_m C + C J_n - C\bigr ]^\top \Bigr ), \]

where \(I_m\) (and \(I_n\)) is the identity matrix of the appropriate size.

References

[Hoe48]

Wassily Hoeffding. “A Class of Statistics with Asymptotically Normal Distribution”. In: The Annals of Mathematical Statistics 19.3 (Sept. 1948), pp. 293–325. doi: 10.1214/aoms/1177730196.

[Ser80]

Robert J. Serfling. Approximation Theorems of Mathematical Statistics. Wiley Series in Probability and Mathematical Statistics. Hoboken: John Wiley & Sons, 1980. doi: 10.1002/9780470316481.

[Vaa98]

A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge: Cambridge University Press, 1998. doi: 10.1017/CBO9780511802256.

A Matrix Representation of the Variance

We start by writing

\[ c_{ij} \;=\; \phi (X_i,Y_j)-U_{m,n}, \]

and let

\[ C = \bigl [c_{ij}\bigr ]_{i=1,\dots ,m; \,j=1,\dots ,n}. \]

Also, define the row- and column–sums as

\[ S_{i\cdot } \;=\; \sum _{j=1}^n c_{ij} \quad \text {and}\quad S_{\cdot j} \;=\; \sum _{i=1}^m c_{ij}. \]

Expressing the Trace

Notice that

\[ \trace \Bigl ( C\bigl [J_m C + C J_n - C\bigr ]^\top \Bigr ) =\sum _{i=1}^m\sum _{j=1}^n c_{ij}\,\Bigl [(J_m C)_{ij}+(C J_n)_{ij}-c_{ij}\Bigr ]. \]

Because \(J_m\) is the \(m\times m\) matrix of ones and \(J_n\) is the \(n\times n\) matrix of ones, we have

\[ (J_m C)_{ij} = \sum _{k=1}^m c_{kj} \;=\; S_{\cdot j} \]

and

\[ (C J_n)_{ij} = \sum _{\ell =1}^n c_{i\ell } \;=\; S_{i\cdot }. \]

Thus, the trace becomes

\begin{align*} \trace \Bigl (C\bigl [J_m C + C J_n - C\bigr ]^\top \Bigr ) &= \sum _{i=1}^m \sum _{j=1}^n c_{ij} \Bigl [S_{\cdot j} + S_{i\cdot } - c_{ij}\Bigr ] \\ &= \sum _{j=1}^n \Bigl (\sum _{i=1}^m c_{ij}\Bigr )^2 +\sum _{i=1}^m \Bigl (\sum _{j=1}^n c_{ij}\Bigr )^2 -\sum _{i=1}^m\sum _{j=1}^n c_{ij}^2 \\ &= \sum _{j=1}^n S_{\cdot j}^2 + \sum _{i=1}^m S_{i\cdot }^2 - \sum _{i=1}^m \sum _{j=1}^n c_{ij}^2. \end{align*}

Rewriting the Variance Estimator

The variance estimator is given by

\[ \widehat {\Var }(U_{m,n}) \;=\; \frac {\widehat {V}}{mn} \;+\; \frac {n-1}{mn}\, \widehat {C}_1 \;+\; \frac {m-1}{mn}\,\widehat {C}_2, \]

with

\[ \widehat {V} \;=\; \frac {1}{mn-1} \sum _{i,j} c_{ij}^2. \]

To see the connection with the trace formula, let’s reexpress the “covariance-terms.” In fact, a short calculation shows that

\[ \widehat {C}_1 \;=\; \frac {1}{m\,n(n-1)}\Biggl [\sum _{i=1}^m S_{i\cdot }^2 - \sum _{i,j} c_{ij}^2\Biggr ] \]

and

\[ \widehat {C}_2 \;=\; \frac {1}{n\,m(m-1)}\Biggl [\sum _{j=1}^n S_{\cdot j}^2 - \sum _{i,j} c_{ij}^2\Biggr ]. \]

Substitute these into the variance estimator. After some algebra (and noting that the factors combine so that all three terms have an overall denominator proportional to \(m^2n^2\)), one finds that

\[ \widehat {\Var }(U_{m,n}) \;=\; \frac {1}{m^2n^2} \Bigl [\sum _{i=1}^m S_{i\cdot }^2 + \sum _{j=1}^n S_{\cdot j}^2 - \sum _{i,j} c_{ij}^2\Bigr ]. \]

But the right-hand side is exactly the trace expression we obtained above. Hence,

\[ \widehat {\Var }(U_{m,n}) \;=\; \frac {1}{m^2n^2} \trace \Bigl (C \bigl [J_m C + C J_n - C\bigr ]^\top \Bigr ). \]