What do you mean by the Lanczos algorithm?   Are you talking about the minimization routine, 
rpcg_lanzos (RPCG)?  The Lanczos vectors are stored in a NetCDF and use the adjoint model iterations over all outer and inner loops.
I recently profiled 
WC13 with 
W4DPSAS: 
Nouter=2 and 
Ninner=50 on 4 CPUs on my Mac:
 
Code: Select all
Elapsed CPU time (seconds):
 Node   #  0 CPU:    2734.718
 Node   #  1 CPU:    2735.069
 Node   #  2 CPU:    2735.295
 Node   #  3 CPU:    2735.149
 Total:             22543.769
 Nonlinear model elapsed time profile, Grid: 01
  Allocation and array initialization ..............         0.094  ( 0.0009 %)
  Ocean state initialization .......................         0.015  ( 0.0001 %)
  Reading of input data ............................         0.450  ( 0.0041 %)
  Processing of input data .........................         0.596  ( 0.0055 %)
  Computation of vertical boundary conditions ......         0.086  ( 0.0008 %)
  Computation of global information integrals ......         1.635  ( 0.0149 %)
  Writing of output data ...........................         1.442  ( 0.0132 %)
  Model 2D kernel ..................................        37.881  ( 0.3462 %)
  2D/3D coupling, vertical metrics .................        76.908  ( 0.7030 %)
  Omega vertical velocity ..........................        60.869  ( 0.5564 %)
  Equation of state for seawater ...................       118.920  ( 1.0870 %)
  Atmosphere-Ocean bulk flux parameterization ......         0.971  ( 0.0089 %)
  GLS vertical mixing parameterization .............        28.591  ( 0.2613 %)
  3D equations right-side terms ....................         3.249  ( 0.0297 %)
  3D equations predictor step ......................         7.552  ( 0.0690 %)
  Pressure gradient ................................         2.072  ( 0.0189 %)
  Harmonic mixing of tracers, geopotentials ........         3.532  ( 0.0323 %)
  Harmonic stress tensor, S-surfaces ...............         1.700  ( 0.0155 %)
  Corrector time-step for 3D momentum ..............         3.660  ( 0.0335 %)
  Corrector time-step for tracers ..................         3.338  ( 0.0305 %)
                                              Total:       353.561    3.2317
 Tangent linear model elapsed time profile, Grid: 01
  Ocean state initialization .......................         0.567  ( 0.0052 %)
  Reading of input data ............................        41.545  ( 0.3797 %)
  Processing of input data .........................       183.269  ( 1.6752 %)
  Computation of vertical boundary conditions ......         2.944  ( 0.0269 %)
  Computation of global information integrals ......        32.821  ( 0.3000 %)
  Writing of output data ...........................         2.365  ( 0.0216 %)
  Model 2D kernel ..................................      2042.182  (18.6667 %)
  2D/3D coupling, vertical metrics .................        63.879  ( 0.5839 %)
  Omega vertical velocity ..........................        88.458  ( 0.8086 %)
  Equation of state for seawater ...................       249.450  ( 2.2801 %)
  3D equations right-side terms ....................       177.238  ( 1.6201 %)
  3D equations predictor step ......................       431.872  ( 3.9476 %)
  Pressure gradient ................................       119.952  ( 1.0964 %)
  Harmonic mixing of tracers, geopotentials ........       137.110  ( 1.2533 %)
  Harmonic stress tensor, S-surfaces ...............        77.808  ( 0.7112 %)
  Corrector time-step for 3D momentum ..............       273.983  ( 2.5044 %)
  Corrector time-step for tracers ..................       293.129  ( 2.6794 %)
                                              Total:      4218.573   38.5602
 Adjoint model elapsed time profile, Grid: 01
  Ocean state initialization .......................         0.425  ( 0.0039 %)
  Reading of input data ............................        42.638  ( 0.3897 %)
  Processing of input data .........................       197.069  ( 1.8013 %)
  Computation of vertical boundary conditions ......         3.994  ( 0.0365 %)
  Computation of global information integrals ......        94.125  ( 0.8604 %)
  Writing of output data ...........................         4.823  ( 0.0441 %)
  Model 2D kernel ..................................      3047.378  (27.8548 %)
  2D/3D coupling, vertical metrics .................       218.972  ( 2.0015 %)
  Omega vertical velocity ..........................       127.261  ( 1.1632 %)
  Equation of state for seawater ...................       170.072  ( 1.5546 %)
  3D equations right-side terms ....................       260.863  ( 2.3844 %)
  3D equations predictor step ......................       583.112  ( 5.3300 %)
  Pressure gradient ................................       169.630  ( 1.5505 %)
  Harmonic mixing of tracers, geopotentials ........       532.145  ( 4.8641 %)
  Harmonic stress tensor, S-surfaces ...............        86.664  ( 0.7922 %)
  Corrector time-step for 3D momentum ..............       363.944  ( 3.3267 %)
  Corrector time-step for tracers ..................       305.798  ( 2.7952 %)
                                              Total:      6208.914   56.7531
 Nonlinear model message Passage profile, Grid: 01
  Message Passage: 2D halo exchanges ...............         9.305  ( 0.0851 %)
  Message Passage: 3D halo exchanges ...............        61.940  ( 0.5662 %)
  Message Passage: 4D halo exchanges ...............         1.017  ( 0.0093 %)
  Message Passage: lateral boundary exchanges ......         1.622  ( 0.0148 %)
  Message Passage: data broadcast ..................         3.228  ( 0.0295 %)
  Message Passage: data reduction ..................         0.101  ( 0.0009 %)
  Message Passage: data gathering ..................         0.308  ( 0.0028 %)
  Message Passage: data scattering..................         1.469  ( 0.0134 %)
  Message Passage: point data gathering ............         0.073  ( 0.0007 %)
                                              Total:        79.063    0.7227
 Tangent linear model message Passage profile, Grid: 01
  Message Passage: 2D halo exchanges ...............       285.993  ( 2.6141 %)
  Message Passage: 3D halo exchanges ...............       101.907  ( 0.9315 %)
  Message Passage: 4D halo exchanges ...............        56.638  ( 0.5177 %)
  Message Passage: lateral boundary exchanges ......         6.171  ( 0.0564 %)
  Message Passage: data broadcast ..................        42.248  ( 0.3862 %)
  Message Passage: data reduction ..................         2.566  ( 0.0235 %)
  Message Passage: data gathering ..................         0.822  ( 0.0075 %)
  Message Passage: data scattering..................         9.092  ( 0.0831 %)
  Message Passage: point data gathering ............         0.589  ( 0.0054 %)
                                              Total:       506.028    4.6254
 Adjoint model message Passage profile, Grid: 01
  Message Passage: 2D halo exchanges ...............       392.205  ( 3.5850 %)
  Message Passage: 3D halo exchanges ...............       163.995  ( 1.4990 %)
  Message Passage: 4D halo exchanges ...............        58.984  ( 0.5392 %)
  Message Passage: lateral boundary exchanges ......         1.265  ( 0.0116 %)
  Message Passage: data broadcast ..................        34.569  ( 0.3160 %)
  Message Passage: data reduction ..................         3.558  ( 0.0325 %)
  Message Passage: data gathering ..................         5.227  ( 0.0478 %)
  Message Passage: data scattering..................         9.032  ( 0.0826 %)
  Message Passage: point data gathering ............         1.186  ( 0.0108 %)
                                              Total:       670.022    6.1244
 All percentages are with respect to total time =        10940.230
We need to be careful when changing ROMS profiling because of how the MPI phase is profiled.  You cannot just add another number to the 
wclock routines.   I currently looking at this, let me know what do you want to separate and I will take a lock at it.  The routine 
rpcg_lanzos is called few times in 4DVAR but I think that the impact is very minimal when compared with the tangent linear and adjoint models that are iterated 100 times each in the above statistics.