Discussion: Asymptotic Normality and Optimalities in Estimation of Large Gaussian Graphical Models, Part 2
Introduction
As part of project I've been working on, I'm reading Asymptotic Normality and Optimalities in Estimation of Large Gaussian Graphical Models, a paper by Ren, Sun, Zhang, and Zhou.
Motivation, problem setup, and other preliminaries are addressed in a previous blog post. In this post, we will begin discussing the statistical inference results of the paper by walking through the proof for Theorem 2, which places bounds on the distribution of the estimates and makes the results in Theorem 3 possible. The results of Theorem 1 are simply a special case of Theorem 2.
Preliminaries
First, we recall the definitions of the estimators for the sub-blocks of the
precision and conditional covariance matrices. We begin with the scaled
lasso regression parameter estimates:
which yield \(\hat\epsilon_A\) as the residual estimates from regression \(\XX_A\) on \(\XX_{A^c}\). Then, we define:
We now define the parameter space considered. For \(\lambda > 0\), we defined capped-\(\ell_1\) balls as follows:
where
for \(\Omega = (\omega_{ij})_{1\leq i, j\leq p}\). The authors take \(\lambda\) on the order \(\sqrt\frac{\log p}{n}\) in this paper.
Intuitively, this parameter space is a mix of the \(\ell_0\) norm, which measures sparsity, and the \(\ell_1\) norm, which is the sum of absolute values, imposed on each row. For \(\lambda\) very small, we recover the pure \(\ell_0\) norm; this special case is the parameter space used in Theorem 1.
Appealing to the graphical model representation of the multivariate normal distribution, in which a nonzero entry \(\omega_{ij}\) in the precision matrix implies the existence of an edge between nodes \(i\) and \(j\), we may observe that when \(|\omega_{ij}|\) is zero or larger than \(\lambda\), \(s_\lambda\) is equivalent to the maximum node degree of the graph (the degree of each node is equivalent to the number of nonzero entries on each row, or column). Finally, we note that the spectrum (eigenvalues) of the matrix are bounded, a fact upon which the later analysis relies.
The authors then prove a theorem that gives an error bound on the estimates. Their approach is to:
- Compare the estimates to the oracle MLE, giving a concentration bound on the distances between them.
- Show that
$$ \kappa_{ij}^{ora} = \sqrt{n} \frac{\omega_{ij}^{ora} - \omega_{ij}} {\sqrt{\omega_{ii}\omega_{jj} + \omega_{ij}^2}} $$is asymptotically standard normal, which implies the oracle MLE is asymptotically normal with mean \(\omega_{ij}\) and variance \(n^{-1}\sqrt{\omega_{ii}\omega_{jj} + \omega_{ij}^2}\).
By coupling the actual estimator to the oracle MLE and then proving nice properties for the oracle MLE, we can work towards nice properties for the actual estimator.
First, we state a few conditions, which will be useful in our analysis of Theorem 2. When these conditions hold for certain fixed constant \(C_0\), \(\varepsilon_\Omega \rightarrow 0\), and all \(\delta \geq 1\), the asymptotic normality and efficiency properties will hold, as we will see in the analysis of Theorem 2.
The first condition is:
Observing that \(\XX_{A^c}\left(\hat\beta_{A^c, A} - \beta_{A^c, A}\right)\) is equivalent to \(\norm{\epsilon_A - \hat\epsilon_A}^2\), we may interpret this as a concentration bound on the deviation of the residual estimates from the oracle residuals.
The next condition is:
with \(\bar \DD = \diag\left(\frac{\XX^\top\XX}{n}\right)\). These two statements are essentially a risk bounds on the lasso estimator, which will be discussed and proved in a future post.
The final condition is, for \(\theta_{ii}^{ora} = \frac{\norm{\XX_i - \XX{A^c}\beta_{A^c, i}}^2}{n}\),
with a certain complexity measure \(s\) of the precision matrix \(\Omega\), assuming the spectrum of \(\Omega\) is bounded, and \(n \geq \frac{(s\log p)^2}{c_0}\) for a sufficiently small \(c_0 > 0\).
Statement
Theorem 2. Let \(\hat\Theta_{A, A}\) and \(\hat\Omega_{A, A}\) be estimators defined in (2) and (3) respectively. Let \(\delta \geq 1\). Suppose \(s \leq \frac{c_0n}{\log p}\) for a sufficiently small constant \(c_0 > 0\).
- Suppose that conditions (7), (8), (9) hold with \(C_0\) and
\(\varepsilon_\Omega\). Then
\begin{align} \max_{G^*(M, s, \lambda)}\max_{A:A=\{i, j\}} \PP\left\{ \norm{\hat\Theta_{A, A} - \Theta_{A, A}^{ora}}_\infty > C_1 s \frac{\log p}{n} \right\}\leq 6\varepsilon_\Omega p^{-\delta + 1} + \frac{4p^{-\delta + 1}}{\sqrt{2\log p}} \end{align}and\begin{align} \max_{G^*(M, s, \lambda)}\max_{A:A=\{i, j\}} \PP\left\{ \norm{\hat\Omega_{A, A} - \Omega_{A, A}^{ora}}_\infty > C_1' s \frac{\log p}{n} \right\}\leq 6\varepsilon_\Omega p^{-\delta + 1} + \frac{4p^{-\delta + 1}}{\sqrt{2\log p}} \end{align}where \(\Theta^{ora}_{A, A}\) and \(\Omega^{ora}_{A, A}\) are the oracle estimators and \(C_1\) is a positive constant depending only on \(\{C_0, \max_{m\in A = \{i, j\}}\theta_{mm}\}\).
- Let \(\lambda = (1 + \varepsilon)\sqrt{\frac{2\delta\log p}{n}}\) with
\(\varepsilon > 0\) be the \(\lambda\) parameter in the scaled lasso
estimation problem, and let \(\hat\beta_{A^c, A}\) be the scaled lasso
estimator, or the LSE after the scaled lasso selection. Then (4),
(5), and (6), and thus (7) and (8) hold for all \(\Omega \in
\mathcal{G}^*(M, s, \lambda)\) with a certain constant \(C_0\) depending on
\(\{\varepsilon, c_0, M\}\) only and
\begin{align} \max_{\Omega \in \mathcal{G}^*(M, s, \lambda)} \varepsilon_\Omega = o(1) \end{align}
- There exist constants \(D_1\) and \(\vartheta \in (0, \infty)\), and four
marginally standard normal random variables \(Z1, Z_{kl}\), where \(kl = ii,
ij, jj\), such that whenever \(|Z_{kl}| \leq \vartheta\sqrt{n}\) for all \(kl\),
we have
\begin{align} \left|\kappa_{ij}^{ora} - Z'\right| \leq \frac{D_1}{\sqrt{n}}\left(1 + Z_{ii}^2 + Z_{ij}^2 + Z_{jj}^2\right) \end{align}where \(Z'\), which can be defined as a linear combination of \(Z_{kl}\).
Intuitively, statement (1) says that there is a very low probability that the maximum entrywise deviation of the actual estimator from the oracle MLE is larger than a constant that we can control. Statement (2) shows that the conditions are met such that statement (1) holds. Statement (3) says that the rescaled oracle MLE behaves more or less asymptotically normally.
Once we show these statements about the estimator relative to oracle MLEs, we will prove statements in Theorem 3 relating the oracle MLEs to the true parameter values, and by the triangle inequality, we will have bounds on the distances between our estimates on the truth.
Proof for Theorem 2(i)
The values of \(\theta_{ii}, \theta_{jj}\) are uniformly bounded, which implies that the desired concentration bound (7) follows from (4) for \(\theta^{ora}_{ii}\) and \(\theta^{ora}_{jj}\).
Therefore, we only need to be concerned about bounding \(\theta^{ora}_{ij}\). Recall that we define \(\bar \DD = \diag\left(\frac{\XX^\top \XX}{n}\right)\) and that \(\XX_{A^c}\) is independent of \(\epsilon_A\). First, we show the following.
Claim.
for all \(m \in A\).
Proof. First, we observe that \(\XX\bar\DD^{-\frac{1}{2}}\) is essentially \(\XX\) with its columns scaled to unit length in Euclidean norm. The fact that the mean of the distribution is zero follows from the fact that the columns of \(\XX\) are assumed to be centered. To show the variance, we observe that we may express
Dividing \(\XX\bar\DD^{-1/2}\) by \(n\) gives the desired variance.
It then follows from the union bound and Mill's Inequality that:
Now, let's compare our covariance estimates to the oracle MLE:
Recalling that
we have:
with probability at least \(1 - 2p^{-\delta + 1}\epsilon_\Omega - 2p^{-\delta + 1}(2\log p)^{-1/2}\) by the union bound, implying (7).
Given that the spectrum of \(\Theta_{A, A}\) is bounded, the functional \(\zeta_{kl}(\Theta_{A, A}) = \left(\Theta_{A, A}^{-1}\right)_{kl}\) is Lipschitz in a neighborhood of \(\Theta_{A, A}\) for \(k, l \in A\), and thus the bound on distances between the precision matrix estimates and the oracle MLE for the precision matrix in (8) follows from (7).
Proof for Theorem 2(ii)
The proof for part (ii), though fairly straightforward, depends on Theorem 10(i), Theorem 11(ii), and Proposition 1, and so we will return to this later.
This part of the Theorem essentially gives and proves conditions under which conditions (4), (5), and (6) hold, which in turn imply (7) and (8) for all \(\Omega \in \Gg^*(M, s, \lambda)\) up to a constant. This part of the Theorem also establishes that \(\varepsilon_\Omega\) is \(o(1)\) for all \(\Omega\) in the parameter space, implying that it has a negligible impact on the concentration bounds (7) and (8).
Proof for Theorem 2(iii)
To prove the coupling inequality in (10), we first define a random vector \(\eta^{ora} = \left(\eta_{ii}^{ora}, \eta_{ij}^{ora}, \eta_{jj}^{ora}\right)\), where
By the KMT inequality, for which the authors cite Mason and Zhou (2012) for the one-dimensional case and Einmahl (1989) for the multidimensional case, there exist constants \(D_0, \vartheta \in (0, \infty)\) and a random Gaussian vector \(Z = (Z_{ii}, Z_{ij}, Z_{jj}) \sim \Nn(0, \breve\Sigma)\) where \(\breve\Sigma = \cov(\eta^{ora})\), such that \(|Z_{kl}| \leq \vartheta\sqrt{n}\) for all \(kl\) implies
Let us now define \(\Theta = (\theta_{ii}, \theta_{ij}, \theta_{jj})\), consider the function
and take its Taylor expansion, which gives us:
where
We observe that Taylor's Theorem gives us a uniform upper bound on the coefficients of the remainder terms:
where \(B\) is a sufficiently small compact ball centered at \(\Theta\). The upper bound holds when \(\Theta^{ora}\) is in \(B\), an assumption which can be satisfied by picking a small enough value \(\vartheta\) in the assumption \(\norm{\eta^{ora}}_\infty \leq \vartheta \sqrt{n}\). Note that here \(D^\alpha\) is not a constant, but a second order derivative. This bound follows from the fact that equality in the Taylor expansion is satisfied when evaluating the second order term with some \(\xi\) between \(\Theta\) and \(\Theta^{ora}\), so the coefficients are naturally bounded above by the maximum value of the second order coefficients in \(B\), as it encloses both \(\Theta\) and \(\Theta^{ora}\). The bottom line is that when the oracle MLE's value is sufficiently close to the truth, a linear approximation is basically good enough.
With this linear approximation to \(\omega_{ij}\) as a function of \(\theta_{ij}\) (and a quadratic correction term we can control), we can then express this relationship in terms of \(\kappa_{ij}\) and \(\eta_{ij}\), which are simply \(\omega_{ij}, \theta_{ij}\) rescaled by constants, respectively:
where \(h_1, h_2, h_3, D_\beta\) are constants. Subtracting \(Z' = h_1Z_1 + h_2Z_2 + h_3Z_3 \sim \Nn(0, 1)\) from both sides and applying Hölder's Inequality, we obtain:
Applying the KMT inequality and the fact that \(\norm{\eta^{ora}}^2 \leq C_4 (Z_{ii}^2 + Z_{ij}^2 + Z_{jj}^2)\) for some large constant \(C_4\), we complete the proof: