The Question

I submitted MR !2337 to Eigen, adding a GramSchmidtQR class that implements Modified Gram-Schmidt (MGS) QR decomposition. The implementation stores Q and R as explicit dense matrices, unlike Eigen’s HouseholderQR which stores Q as a compact sequence of Householder reflectors.

Within minutes, Rasmus Munk Larsen — an Eigen maintainer — asked:

What is the motivation for adding this? MGS QR is generally slower and no more accurate than blocked Householder QR.

Fair question. Instead of arguing from theory, I wrote a benchmark.

The Algorithms

Householder QR works by applying a sequence of orthogonal reflections (Householder transformations) to zero out sub-diagonal entries column by column. Eigen’s implementation uses blocked Householder reflectors — it groups reflectors into panels and applies them as BLAS-3 (matrix-matrix) operations. Q is stored implicitly as a product of reflectors.

Modified Gram-Schmidt (MGS) orthogonalizes columns one at a time: for each column, project out components along all previously computed orthonormal vectors, then normalize. Q and R are stored as explicit dense matrices. This is inherently a BLAS-2 (matrix-vector) algorithm — each column requires dot products and axpy operations against all previous columns.

The key difference: blocked Householder exploits cache-friendly BLAS-3 operations that modern CPUs execute extremely efficiently. MGS is limited to BLAS-2 operations that have lower arithmetic intensity.

The Implementation

The GramSchmidtQR class I submitted:

template <typename InputType>
GramSchmidtQR& compute(const EigenBase<InputType>& matrix) {
    const Index rows = matrix.rows();
    const Index cols = matrix.cols();
    const Index size = (std::min)(rows, cols);

    m_q.resize(rows, rows);
    m_r.resize(rows, cols);
    m_r.setZero();

    MatrixType work = matrix.derived();

    for (Index i = 0; i < size; ++i) {
        // Orthogonalize column i against all previous columns
        for (Index j = 0; j < i; ++j) {
            Scalar rji = m_q.col(j).dot(work.col(i));
            m_r(j, i) = rji;
            work.col(i) -= rji * m_q.col(j);
        }
        // Normalize
        RealScalar norm = work.col(i).norm();
        m_r(i, i) = norm;
        if (norm > RealScalar(0))
            m_q.col(i) = work.col(i) / norm;
        else
            m_q.col(i).setZero();
    }

    // Complete Q to full orthonormal basis for tall matrices...
    // (omitted for brevity)
}

Simple, readable, correct. But fast?

The Benchmark

I compared MGS against HouseholderQR across:

  • Square matrices: 4×4 to 1024×1024
  • Tall matrices: 100×10 to 1000×100 (overdetermined systems)
  • Wide matrices: 10×100 to 50×500 (underdetermined)
  • Ill-conditioned matrices: condition numbers from 10⁴ to 10¹²
  • Both double and float precision

Compiled with g++ -O3 -march=native -DNDEBUG.

Try it yourself on Compiler Explorer — click the play button to run:

Or open it directly: godbolt.org/z/YE537ec18

Results

All results below are from a dedicated machine (g++ -O3 -march=native -DNDEBUG, double precision). The Godbolt link above reproduces the same ratios on shared infrastructure.

Performance: Speed

SizeMGS (μs)HH (μs)RatioWinner
4×40.20.52.9xMGS
8×80.21.25.6xMGS
16×161.13.13.0xMGS
32×325.38.41.6xMGS
64×6427.641.01.5xMGS
128×1282171520.70xHH
256×25616218570.53xHH
512×5121367855000.40xHH
1024×1024113725349110.31xHH

Ratio = HH_time / MGS_time. Values > 1 mean MGS is faster.

At n = 128 Householder takes the lead and never looks back. By 1024×1024, it’s 3.3x faster. The solve benchmark shows the same crossover — MGS is 4.3x faster at 8×8 but 2.7x slower at 512×512.

Performance: Tall Matrices

This is where MGS falls apart:

SizeMGS (μs)HH (μs)Ratio
100×1093.82.50.03x
500×50104871780.02x
1000×109392416.40.0002x
1000×100905048540.009x

For a 1000×10 matrix, Householder is 5700x faster. This isn’t a typo.

The problem: my MGS implementation builds a full m×m Q matrix (1000×1000) and completes the remaining 990 columns by orthogonalizing standard basis vectors. Householder stores Q as 10 compact reflectors and never materializes the full matrix. Even with a thin-Q optimization, the core BLAS-2 vs BLAS-3 gap would remain.

Accuracy: Orthogonality Under Ill-Conditioning

Both algorithms reconstruct A = QR to machine precision (~10⁻¹⁶). The real difference is orthogonality of Q — how close Q^T Q is to the identity:

Condition (κ)MGS ‖Q^TQ − I‖HH ‖Q^TQ − I‖MGS degradation
Well-conditioned2.6e-145.8e-15~5x worse
10⁴5.9e-135.8e-15100x worse
10⁸3.7e-096.1e-15600,000x worse
10¹²3.2e-056.2e-155 billion x worse

This is the killer. Householder’s orthogonality stays at ~10⁻¹⁵ regardless of condition number. MGS degrades as O(κ · ε) — a well-known theoretical result the data confirms precisely. If you need Q to actually be orthogonal — for projections, basis computations, or numerical stability downstream — MGS becomes unreliable as κ grows. The same pattern holds for single-precision float, with an even earlier onset of degradation.

Why MGS Loses

Three fundamental factors:

1. BLAS-2 vs BLAS-3. MGS processes one column at a time with dot products and rank-1 updates (BLAS-2). Blocked Householder groups reflectors into panels and applies them with matrix-matrix multiplications (BLAS-3). On modern CPUs with deep cache hierarchies, BLAS-3 achieves much higher throughput because it reuses data in cache. The gap grows with matrix size — exactly what we see.

2. Full Q materialization. My implementation stores Q as a dense m×m matrix. For tall matrices, this means computing and storing columns that may never be needed. Householder stores Q as min(m,n) compact reflectors and only materializes columns on demand.

3. Orthogonality loss. MGS’s orthogonality error grows as O(κ · ε), where κ is the condition number and ε is machine precision. Householder achieves O(ε) regardless of κ. This is a fundamental algorithmic property, not an implementation deficiency.

Where MGS Wins

For completeness, there is a niche:

  • Small matrices (n ≤ 64): 1.5–5.6x faster due to lower overhead. No reflector storage machinery, no blocked panel application — just straightforward dot products and column updates.
  • Explicit Q/R output: matrixQ() returns a plain MatrixXd, not a HouseholderSequence. If you need Q as an actual matrix for subsequent Q * B or Q.transpose() * C operations, MGS gives you that directly. With Householder, you’d evaluate the implicit Q into a dense matrix anyway — adding a separate step.
  • Small system solves: 1.6–4.3x faster for n ≤ 64 when you need both decomposition and solve.

But this niche is narrow. Most real-world QR use cases involve matrices larger than 64×64, or ill-conditioned systems, or tall overdetermined systems — all cases where Householder is strictly superior.

The Decision

I initially closed the MR, reasoning that the maintenance burden wasn’t justified for a method that only wins on small matrices. But then Rasmus Munk Larsen — the Eigen maintainer who originally challenged the MR — reviewed the benchmark data and commented:

Thanks for the careful investigation.

And:

I don’t think supporting both APIs is a problem. MGS does have its uses as you point out.

He then reopened the MR for further review. The benchmark data made the case: MGS has a legitimate niche for small matrices and explicit Q access. The original feature request (issue #2495) made compelling arguments about API ergonomics, and the data showed the tradeoffs clearly enough for the maintainer to reconsider.

Takeaways

  1. Benchmark before you argue. The theoretical O(2mn²) flop count is the same for both algorithms. The difference is entirely in constants — cache behavior, BLAS level, and implementation overhead. Theory said “comparable”; data said “not even close at scale.”

  2. BLAS-3 blocking is not optional at scale. Any algorithm that processes one column at a time (BLAS-2) will lose to one that processes panels of columns together (BLAS-3) once matrices exceed cache size. This applies far beyond QR — it’s why blocked LU, blocked Cholesky, and tiled algorithms dominate modern numerical linear algebra.

  3. Orthogonality matters more than reconstruction error. Both algorithms reconstruct A = QR to machine precision. But MGS’s Q drifts from orthogonal as condition number grows. If you’re using Q for anything beyond reconstructing A — projections, basis extraction, iterative refinement — that drift compounds.

  4. Let the data speak. When a maintainer challenges your approach, respond with benchmarks, not arguments. The data confirmed MGS is slower at scale — but also showed where it wins. That honest assessment is what ultimately got the MR reopened.

  5. Know your niche. Small-matrix MGS QR is legitimately faster. If I were building a robotics system that decomposes hundreds of thousands of 6×6 matrices per second, MGS would be the right choice. But that’s not what Eigen’s QR module is for — it serves the general case, and the general case is Householder.