Discussion: Optimal Rates of Convergence for Covariance Matrix Estimation, Part 1
For the past couple of days, I have been reading Optimal Rates of Convergence for Covariance Matrix Estimation. To aid my digestion of this paper, I will be writing about it on this blog. This post covers the authors' justification of Theorem 2. I highly recommend referring to the paper; I will try to fill in gaps of the proof, but must leave out some details for brevity.
Background
The authors in this paper establish optimal rates of convergence for estimating the covariance matrix under the operator norm (which, in Euclidean distance, coincides with the spectral norm) and the Frobenius norm; i.e., they provide lower bounds on the max risk and show that the minimax upper bound of their tapering estimators achieves this rate.
First, they construct the following parameter space of \(p\times{p}\) covariance matrices:
This definition states that for the covariance matrices in this space, the sum of the absolute values of the entries \(k\) away from the diagonal decay with \(k^{-\alpha}\), where \(\alpha\) is a smoothing parameter. The bound will play an important proof in the later proofs.
Theorem. The minimax risk of estimating the covariance matrix \(\Sigma\) over the class \(\mathcal{P}_\alpha\) given above satisfies
where \(\mathcal{P}_\alpha = \mathcal{P}_\alpha(M_0, M, \rho)\) is the set of all distributions of \(X\) that satisfies both Equations \eqref{eq:parameter-space} and \eqref{eq:subgaussian}
\(\Sigma\) is then estimated by tapering the maximum likelihood estimator:
where \(\sigma^*_{ij}\) are the entries in the maximum likelihood estimator \(\Sigma^*\) and the weights are given by:
The bandwidth is \(k\) on either side along the diagonal; shrinkage begins on each side at \(k/2\). As a technical note, such a tapering estimator may be rewritten as a sum of block matrices along the diagonal; this is used in the proofs to attain concentration bounds via random matrix theory.
Minimax Upper Bound under the Operator Norm
The authors assume that the \(X_i\)'s are subgaussian; that is, there exists \(\rho > 0\) such that:
for all \(t > 0\) and \(\Vert{v}\Vert_2=1\).
Theorem. The tapering estimator \(\hat\Sigma_k\) defined in Equation \eqref{eq:tapering-estimator} with \(p \geq n^{\frac{1}{2\alpha+1}}\) satisfies
for \(k = O(n), \log p = O(n)\) and some constant \(C > 0\).
To prove this, the authors assume \(\mu = 0\) and analyze
rather than the maximum likelihood estimator
as \(\bar X \bar X^\top\) is a higher order term and can be neglected in the analysis of the rate. As such, we defined a new tapering estimator:
We observe that we may bound the risk from above by the bias and variance:
This inequality can easily be derived by thinking about how to bound \((a + b)^2\).
Bias
To prove the bound on the bias, the authors note that the operator norm of a symmetric matrix is upper bounded by its \(\ell_1\) norm. Therefore, we have:
This inequality holds by construction (see Equation \eqref{eq:parameter-space}).
Variance
The authors rely on random matrix theory to bound the variance. Of particular note is
Lemma 2. Define the submatrix \(M_l^{(m)}\):
then we have the following bound:
They also provide a concentration bound on the operator norm of this submatrix.
Lemma 3. There is a constant \(\rho_1 > 0\) such that
for all \(0 < x < \rho_1\) and \(1-m \leq l \leq p\).
Therefore, by Lemma 2, we have:
The next few steps are quite tricky to understand.
I believe that we can recall the definition of \(N^{(m)}\):
for some \(l\). We note that we may bound the operator norm of a submatrix by the operator norm of the full matrix by recalling the definition of an operator norm
and seeing that for any vector \(v\), \(\Vert A_\mathrm{sub} v \Vert \leq \Vert Av\Vert\). Therefore, we may establish that:
at this point, we may split the norm by the triangle inequality:
and bound the operator norm by the Frobenius norm:
Finally, we may apply Cauchy-Schwarz:
We now bound the \(\sqrt{\mathbf{P}(N^{(m)} > x)}\) term. By setting \(x=4\sqrt{\frac{\log{p}+m}{n\rho_1}}\), and recalling Lemma 3, we have:
We are able to bound the Frobenius norm:
by observing that the squared Frobenius norm decouples the entries of the matrix, and we are able to bound each of the \(p^2\) entries by a constant, which hides among the other constants.
With all these pieces together, we may conclude:
where \(C\) is an evolving constant.
Putting It All Together
Having walked through the steps of bounding the bias:
and variance:
for arbitrary \(\Sigma\), we can put these terms together to establish the bound on the worse case risk of the tapering estimator (Theorem 2):
for \(k = O(n), \log p = O(n)\) and some constant \(C > 0\).