Efficient Inference Using Large Language Models with Limited Human Data: Fine-Tuning then Rectification

Date: 2026/2/3 11:28:41

Domain: social_sciences/economics

Taxonomy: academic/research_paper

Filter: Active comments

Overall Feedback

Here are some high-level reactions to the document.

Theoretical consistency of the variance optimization

The derivation of the optimal allocation relies on an asymptotic simplification that might introduce tension with the finite-sample goals of the paper. Section 4.1 decomposes the variance into two terms: $\frac{1}{n-s}\mathrm{Var}(Y-f^{(s)}(X))$ and $\frac{1}{m}\mathrm{Var}(f^{(s)}(X))$. The subsequent optimization problem (3) and Theorem 4.1 proceed by minimizing only the first term, justified by the assumption that "$m$ is large."

However, fine-tuning modifies the predictor $f^{(s)}$, which necessarily changes the second term, $\frac{1}{m}\mathrm{Var}(f^{(s)}(X))$. If the unlabeled set size $m$ is finite (as it is in the Wine Reviews experiment), ignoring this term implies that the derived $s^*$ may not actually minimize the true calculable variance. It would be compelling to see either a formal finite-$m$ analog to the optimization problem or empirical evidence demonstrating that the variations in the second term are indeed negligible compared to the reduction in the first term across the relevant range of $s$.

Accounting for the label cost of scaling law estimation

The central claim of efficiency rests on the ability to pinpoint the optimal split $s^*$ using the scaling law parameters $(a, \alpha, b)$. Currently, the empirical analysis in Section 6.2 and 6.3 treats these parameters as fixed inputs derived from a separate 10,000-sample dataset. This prevents a deployment-faithful evaluation of the method's efficiency: the label cost of estimating the scaling curve is excluded from the experimental budget, yet obtaining these estimates is a prerequisite for the method.

To robustly support the "cost savings" claims in the Abstract and Table 3, the evaluation would benefit from an end-to-end simulation where the scaling law is learned within the allocated budget $n$ (perhaps using the ramp-up procedure mentioned in Appendix B). Comparing this fully loaded cost against the baselines is necessary to demonstrate that the method generates net savings when starting from scratch (cold start) rather than relying on an oracle.

Sensitivity of the allocation rule to estimation error

Related to the estimation of scaling parameters, there is a question regarding the sensitivity of the decision rule $s^(\hat a, \hat b, \hat\alpha)$. Since the objective function in Figure 2a is U-shaped, the cost of missing the optimal $s^$ could be significant. In a realistic setting with small valid sets, estimations of the scaling exponent $\hat\alpha$ are likely to be noisy.

The manuscript cites Appendix D.1 regarding concentration, but it would be valuable to see a direct sensitivity analysis of the main methodological prescription. Specifically, how much does $s^$ shift given standard errors in $\hat\alpha$ or $\hat b$? If the allocation rule relies on precise parameter estimates that are difficult to obtain with small $n$, the "optimal" allocation might be theoretically sound but practically risky. A bootstrap approach to selecting a conservative $s^$ might strengthen the practical applicability of the framework.

Clarifying the performance gap between MSE and residual variance

The paper reports a striking performance gap between the proposed loss function and the standard MSE baseline (18–54% variance reduction in Table 3). However, Remark 1 notes that with additive shifts and ideal training, minimizing MSE and maximizing residual variance should essentially lead to the same solution. This raises a question about the driver of the empirical gap: is the proposed loss function structurally superior, or is the MSE baseline under-performing due to regularization, early stopping, or architecture constraints?

To firmly establish the "new loss" as a general principle for train-for-inference contexts, it is important to document why the MSE baseline fails to drive down residual variance to the same degree. Ensuring the MSE baseline is "best-effort" (e.g., capable of learning the correct mean shift via an unregularized intercept/bias term) would protect the results from being interpreted as a fix for a specific constrained training setup rather than a fundamental theoretical advantage.

Operationalizing the M-estimation extension

Section 5 outlines the extension to M-estimation effectively at a theoretical level, particularly regarding the D-optimal/trace scalarizations. However, as currently presented, the method may be difficult for a reader to implement. Unlike the mean estimation case, where the objective is clear, minimizing the determinant or trace of a covariance matrix involving $\psi(X, f^{(s)}(X))$ presents non-trivial challenges for stochastic gradient training, particularly regarding numerical stability and the unknown nature of $\theta^*$. Expanding this section to provide an explicit algorithmic recipe—perhaps involving plug-in estimates for $\hat\theta$ or an iterative procedure—would match the usability of the earlier sections.

Pending

Detailed Comments (12)

#1 Validation of allocation rule relies on external data Pending

Our empirical analysis validates the fine-tuning scaling law and confirms that our proposed optimal allocation rule reliably identifies the optimal sample allocation.

The concern is that the abstract phrase “confirms that our proposed optimal allocation rule reliably identifies the optimal sample allocation” may be read as an end-to-end validation under a limited-data budget, whereas Section 6.3 in fact uses scaling-law parameters $(\hat a,\hat\alpha,\hat b)$ that were pre-estimated from a separate set of 10,000 labeled reviews. In a real “limited human data” application, one would typically have to estimate $(a,\alpha,b)$ from within the same budget $n$ (for example, via the ramp-up procedure in Appendix B), and the stability of $s^*$ under that constraint is not directly tested empirically. It could help readers if the abstract or the beginning of Section 6 explicitly stated that the allocation-validation experiment assumes scaling parameters calibrated on external data, and that robustness to estimating these parameters from the limited budget is argued theoretically and via the scaling-law fit rather than demonstrated in a fully budget-internal experiment.

#2 Theoretical distinction between MSE and Variance objectives Pending

Unlike the conventional objective that minimizes the mean squared prediction errors, we propose to minimize the variance of the prediction errors as the fine-tuning objective, which is optimal for the downstream rectification stage.

Statement “Unlike the conventional objective that minimizes the mean squared prediction errors, we propose to minimize the variance of the prediction errors … which is optimal for the downstream rectification stage” might initially suggest a fundamental theoretical divergence between MSE and residual-variance minimization. After reading Remark 1, it is clear that in model classes with a free additive shift and in large samples, the two losses have essentially the same population minimizer, and that the advantage of the variance loss is in its direct alignment with the PPI variance expression and its practical behavior under finite-sample, regularized, or constrained training. To preempt confusion for readers steeped in classical OLS theory, it could be helpful if the abstract or early in Section 4.1 briefly acknowledged this asymptotic equivalence and emphasized that the gain from the variance objective is practical/task-aligned rather than a different population solution under standard regression assumptions.

#3 Incorrect independence justification in variance derivation Pending

where the third equality is because $(Y_i - f^{(s)}(\boldsymbol{X}_i))$ and $f^{(s)}(\tilde{\boldsymbol{X}}_j)$ are independent.

At first, the statement “where the third equality is because ((Y_i - f^{(s)}(\boldsymbol{X}_i))) and (f^{(s)}(\tilde{\boldsymbol{X}}_j)) are independent” made me think that you were asserting unconditional independence between these two terms, even though the fine-tuned predictor (f^{(s)}) is itself obtained from data and hence random. Then I understood that in this variance derivation you are effectively working conditional on the realized predictor (f^{(s)}) (as is standard in prediction-powered inference and as you make explicit later in Proposition E.1), so that the only randomness comes from the labeled PPI subset and the unlabeled sample, which are independent. In that conditional view the variance decomposition and Equation (2) are correct. It might nonetheless be helpful to signal this conditioning explicitly—for example, by stating that the independence claim is conditional on (f^{(s)}) or by writing the variance as (\operatorname{Var}(\cdot \mid f^{(s)}))—to avoid readers mistakenly interpreting the argument as relying on unconditional independence.

#4 Mismatch between motivating business problem and statistical task Pending

We validate our framework through an empirical study using the Wine Reviews dataset to estimate the average ratings in the population. This task mirrors a day-to-day challenge faced by product teams. When retailers select which stock-keeping units (SKUs) to sell in their stores, they usually need to understand the perceived product quality at scale

At first reading, the link you draw between the empirical task and the business motivation can feel tighter than it really is. You empirically study estimation of a single population mean rating, but the text immediately frames this as “mirroring” SKU selection, which in practice is driven by heterogeneity and ranking across items (roughly, conditional quantities like $\mathbb{E}[Y\mid X_i]$) rather than by a global average alone.

Later sections make clear that (i) the empirical example is a stylized aggregate-quality measurement problem, and (ii) your general $M$-estimation framework, including the regression and choice-model examples in Section 5, is what connects most directly to assortment and SKU-level decisions. To avoid potential misinterpretation, it would help if the introductory passage here framed the wine-rating application more explicitly as an instance of “measuring aggregate perceived quality efficiently under a tight labeling budget,” and reserved the SKU-selection discussion for the broader $M$-estimation setting, where the link to ranking and selection is more direct.

#5 Incorrect variance derivation and independence assumption Pending

$$ \begin{aligned} & \mathrm{E}\left[(\widehat{\mu}-\mathrm{E}[Y])^{2}\right] \ = & \mathrm{E}\left[\left(\frac{1}{n-s} \sum_{i=1}^{n-s}\left(Y_{i}-f^{(s)}\left(\boldsymbol{X}{i}\right)\right)+\frac{1}{m} \sum{j=1}^{m} f^{(s)}\left(\tilde{\boldsymbol{X}}{j}\right)-\mathrm{E}[Y]\right)^{2}\right] \ = & \mathrm{E}\left[\left(\frac{1}{n-s} \sum{i=1}^{n-s}\left(Y_{i}-f^{(s)}\left(\boldsymbol{X}{i}\right)+\mathrm{E}\left[f^{(s)}(\boldsymbol{X})\right]-\mathrm{E}[Y]\right)+\frac{1}{m} \sum{j=1}^{m}\left(f^{(s)}\left(\tilde{\boldsymbol{X}}{j}\right)-\mathrm{E}\left[f^{(s)}(\boldsymbol{X})\right]\right)\right)^{2}\right] \ = & \frac{1}{n-s} \sum{i=1}^{n-s} \mathrm{E}\left[\left(Y_{i}-f^{(s)}\left(\boldsymbol{X}{i}\right)+\mathrm{E}\left[f^{(s)}(\boldsymbol{X})\right]-\mathrm{E}[Y]\right)^{2}\right]+\frac{1}{m} \sum{j=1}^{m} \mathrm{E}\left[\left(f^{(s)}\left(\tilde{\boldsymbol{X}}_{j}\right)-\mathrm{E}\left[f^{(s)}(\boldsymbol{X})\right]\right)^{2}\right] \ = & \frac{1}{n-s} \operatorname{Var}\left(Y-f^{(s)}(\boldsymbol{X})\right)+\frac{1}{m} \operatorname{Var}\left(f^{(s)}(\boldsymbol{X})\right) \end{aligned} $$

where the third equality is because $\left(Y_{i}-f^{(s)}\left(\boldsymbol{X}{i}\right)\right)$ and $f^{(s)}\left(\tilde{\boldsymbol{X}}{j}\right)$ are independent.

The variance derivation in Section 4.1 raised two related issues for me.

First, in the displayed calculation, the third equality appears to have incorrect coefficients. If we write (A_i = Y_i - f^{(s)}(X_i) + \mathrm{E}[f^{(s)}(X)] - \mathrm{E}[Y]), then [ \mathrm{E}\Big[\big(\tfrac{1}{n-s}\sum_{i=1}^{n-s} A_i\big)^2\Big] = \tfrac{1}{(n-s)^2}\sum_{i,k}\mathrm{E}[A_iA_k]. ] Under independence, this is (\operatorname{Var}(A)/(n-s)). The third line of your display instead has (\frac{1}{n-s}\sum_{i}\mathrm{E}[A_i^2]), which evaluates to (\operatorname{Var}(A)). To be consistent with the final line, the prefactor in the third line should be (1/(n-s)^2) (and analogously (1/m^2) for the second term).

Second, the sentence “the third equality is because ((Y_i - f^{(s)}(X_i))) and (f^{(s)}(\tilde X_j)) are independent” seems to rely on viewing the fine‑tuned predictor (f^{(s)}) as fixed. Conditional on the training data that produced (f^{(s)}), the PPI labeled and unlabeled samples are independent, so the cross‑terms do vanish and the standard PPI variance formula follows. If, however, one were to consider variance unconditionally over both PPI data and the randomness in fine‑tuning, additional covariance terms involving the randomness of the mean prediction (E_X[f^{(s)}(X)]) would appear. It would be helpful to clarify explicitly that the variance expression you use is the variance conditional on the realized fitted model (and fine‑tuning data), in line with the PPI literature.

#6 Incorrect independence assumption in variance derivation Pending

where the third equality is because $(Y_i - f^{(s)}(\boldsymbol{X}_i))$ and $f^{(s)}(\tilde{\boldsymbol{X}}_j)$ are independent.

The variance derivation in Section 4.1 attributes the decomposition [ \operatorname{Var}(\widehat\mu) = \frac{1}{n-s}\operatorname{Var}(Y-f^{(s)}(X)) + \frac{1}{m}\operatorname{Var}(f^{(s)}(X)) ] to the fact that $(Y_i - f^{(s)}(\boldsymbol{X}_i))$ and $f^{(s)}(\tilde{\boldsymbol{X}}_j)$ are independent. This point can be conceptually delicate because $f^{(s)}$ itself is a random object trained on a random fine‑tuning sample.

If variance is taken unconditionally over both the PPI/unlabeled samples and the randomness in the training of $f^{(s)}$, then $(Y_i - f^{(s)}(\boldsymbol{X}_i))$ and $f^{(s)}(\tilde{\boldsymbol{X}}j)$ are only conditionally independent given $f^{(s)}$, not independent in the joint probability space. In that fully unconditional view, extra covariance terms involving the random mean prediction $\mu{f^{(s)}} = \mathrm{E}[f^{(s)}(X)\mid f^{(s)}]$ appear, and one obtains [ \operatorname{Var}(\widehat\mu) = \frac{1}{n-s}\operatorname{Var}(Y-f^{(s)}(X)) + \frac{1}{m}\operatorname{Var}(f^{(s)}(X))

\Big(\frac{1}{n-s} + \frac{1}{m}\Big)\operatorname{Var}(\mu_{f^{(s)}}). ] In this sense, your equation (2) corresponds to an upper bound on the unconditional variance.

On the other hand, if one follows the usual PPI convention and analyzes $\operatorname{Var}(\widehat\mu)$ conditional on the realized fine‑tuned model $f^{(s)}$, then the independence claim is correct and the decomposition holds exactly with conditional variances. In that case it would be helpful to make the conditioning explicit (both in Section 4.1 and in Proposition E.1), or to clarify that variance and expectations are taken with respect to the sampling of $(X,Y)$ and $\tilde X$ for a given $f^{(s)}$. Otherwise, readers may interpret equation (2) as an unconditional identity and be puzzled by the independence step.

#7 Notation conflict in Linear Regression example Pending

Note that in our LLM framework, $X$ typically denotes the raw textual query given to LLMs. Here, with a slight abuse of notation, we let $\boldsymbol{X}$ be the feature vectors extracted from the text, e.g., product profile and price.

At first, the note in Example 4 that “in our LLM framework, $X$ typically denotes the raw textual query” but that “here… we let $\boldsymbol{X}$ be the feature vectors extracted from the text” made me wonder what exactly is being used as the input to $f^{(s)}$ in the $M$-estimation/PPI formulation, and how this relates to the covariates in the loss $l(\boldsymbol{X},Y;\boldsymbol{\theta})$. Then I understood that $X$ is meant to stand for the model’s input representation, which in applications like linear regression can be a feature vector (possibly obtained from the raw text), and that $f^{(s)}$ should be interpreted as the full surrogate mapping from this representation to $Y$, so the mathematics is consistent.

Given that you explicitly acknowledge a “slight abuse of notation” here, it might still help readers if you added a brief clarification (either in Section 3 or just before/after Example 4) that in such examples the symbol $X$ is being overloaded: the underlying raw text $T$ can first be mapped to features $\phi(T)$, and it is this feature representation (still denoted $X$) that serves both as the covariate in $l(\boldsymbol{X},Y;\boldsymbol{\theta})$ and as the input to $f^{(s)}(\boldsymbol{X})$ in the prediction-powered $M$-estimator.

#8 Conflation of estimator and estimand Pending

implies that $\boldsymbol{\theta}=\mathrm{E}\left[\boldsymbol{X} \boldsymbol{X}^{\top}\right]^{-1} \mathrm{E}[\boldsymbol{X} Y]$, which is the population ordinary least squares estimator.

The sentence refers to $\boldsymbol{\theta} = \mathrm{E}[\boldsymbol{X}\boldsymbol{X}^{\top}]^{-1}\mathrm{E}[\boldsymbol{X}Y]$ as “the population ordinary least squares estimator.” In standard usage, that quantity is the population OLS parameter (or best linear predictor), while the OLS estimator is its sample analog, a random statistic based on $(\mathbf{X},\mathbf{Y})$. To avoid any minor confusion between estimator and estimand, it would be clearer to describe this object explicitly as the population regression quantity that the sample OLS estimator targets.

#9 Clarity of consistency argument for M-estimator Pending

It is easy to see that $\widehat{\boldsymbol{\theta}}$ as defined in (7) is a consistent estimator. To see this, we focus on the right-hand side of (7) when both $(n-s) \rightarrow+\infty$ and $m \rightarrow+\infty$. ... where the equality is because $\boldsymbol{X}{i}$ from the small labeled dataset and $\tilde{\boldsymbol{X}}{j}$ from the large unlabeled dataset are sampled from the same distribution. Since $l$ has a unique minimizer, Theorem 2.7 of Newey and McFadden [1994] ensures that $\widehat{\boldsymbol{\theta}}$ is consistent.

The consistency argument for $\widehat{\boldsymbol{\theta}}$ in Section 5 is quite compressed. The text shows that, for each fixed $\boldsymbol{\theta}$, the sample criterion in (7) converges in probability to $E[l(\boldsymbol{X},Y;\boldsymbol{\theta})]$, and then invokes Theorem 2.7 of Newey and McFadden (1994) together with uniqueness of the minimizer of $E[l(\boldsymbol{X},Y;\boldsymbol{\theta})]$.

However, consistency of the argmin typically requires a uniform law of large numbers over $\boldsymbol{\theta}$, not just pointwise convergence. Moreover, the M-estimation criterion depends on the fine-tuned predictor $f^{(s)}$, which is itself estimated from data. In your setup, this first-stage data are disjoint from the $(\boldsymbol{X}_i,Y_i)$ and $\tilde{\boldsymbol{X}}_j$ used in (7), so it is natural to condition on the trained $f^{(s)}$ and treat it as fixed in the M-estimation step. Under that interpretation, standard uniform LLN and identification conditions on $l(\cdot)$ (as in Newey–McFadden’s theorem) are sufficient to obtain consistency, and the population limit of the criterion is still $E[l(\boldsymbol{X},Y;\boldsymbol{\theta})]$ for any realization of $f^{(s)}$.

It would help the reader if this conditioning argument and the required “standard regularity conditions” were stated explicitly. In particular, clarifying that the consistency claim is made for a fixed choice of $s$ (and hence a fixed predictor trained on independent data), and that you rely on the uniform-LLN assumptions in Theorem 2.7, would make the logical link between your sketch and the theorem much clearer.

#10 Justification for ignoring the second variance term Pending

In the case when m is much larger than n-s, the second term can be ignored, and we focus our attention primarily on the first term.

The step where the asymptotic variance is simplified by discarding the term [ \frac{1}{m}\operatorname{Var}\big(\boldsymbol{\psi}(\tilde{\boldsymbol{X}},f^{(s)}(\tilde{\boldsymbol{X}});\boldsymbol{\theta}^*)\big) ] is currently justified only by the condition $m \gg n-s$. In general, however, the relative importance of the two variance components depends on both sample sizes and their magnitudes. A natural scalar measure of the ratio of their contributions is of the form [ \frac{|\tfrac{1}{m}\operatorname{Var}(\boldsymbol{\psi}{\text{pred}})|}{|\tfrac{1}{n-s}\operatorname{Var}(\boldsymbol{\psi}{\text{res}})|} \approx \frac{|\operatorname{Var}(\boldsymbol{\psi}{\text{pred}})|}{|\operatorname{Var}(\boldsymbol{\psi}{\text{res}})|}\cdot\frac{n-s}{m}, ] for an appropriate matrix norm. As the predictor improves with more fine-tuning, the residual score variance $\operatorname{Var}(\boldsymbol{\psi}{\text{res}})$ decreases, while $\operatorname{Var}(\boldsymbol{\psi}{\text{pred}})$ will typically remain bounded away from zero. In regimes where $\operatorname{Var}(\boldsymbol{\psi}_{\text{res}})$ is very small and $m/(n-s)$ is only moderately large, the second term may not be negligible relative to the first.

Since your subsequent allocation rule is derived by minimizing an approximation that keeps only the first term, it would be helpful to spell out the conditions under which this simplification is intended to hold—e.g., that unlabeled data are sufficiently more abundant than labeled data and that the residual variance retains a positive floor so that $(n-s)^{-1}\operatorname{Var}(\boldsymbol{\psi}{\text{res}})$ dominates $m^{-1}\operatorname{Var}(\boldsymbol{\psi}{\text{pred}})$ across the relevant range of $s$. Making this assumption explicit would clarify the scope of the approximation without changing your main results.

#11 Incorrect variance decomposition in Proposition E.1 Pending

Taking the expectation over $f^{(s)}$ (which is equivalent to the definition of the variance terms in the problem setup), we obtain the final result.

The last step of the proof of Proposition E.1, beginning with “Taking the expectation over $f^{(s)}$…”, appears to conflate conditional and unconditional variance. From the derivation you obtain [ \operatorname{Var}(\widehat{\mu}^{(s)}\mid f^{(s)})=\frac{1}{n-s}\operatorname{Var}(Y-f^{(s)}(X)\mid f^{(s)})+\frac{1}{m}\operatorname{Var}(f^{(s)}(X)\mid f^{(s)}). ] Applying the Law of Total Variance then gives [ \operatorname{Var}(\widehat{\mu}^{(s)})=\mathbb{E}!\left[\operatorname{Var}(\widehat{\mu}^{(s)}\mid f^{(s)})\right], ] but in general [ \mathbb{E}!\left[\operatorname{Var}(Z\mid f^{(s)})\right]\neq\operatorname{Var}(Z) ] for $Z=Y-f^{(s)}(X)$, because [ \operatorname{Var}(Z)=\mathbb{E}[\operatorname{Var}(Z\mid f^{(s)})]+\operatorname{Var}(\mathbb{E}[Z\mid f^{(s)}]). ]

Either (i) the proposition is meant to describe the variance conditional on the realized model $f^{(s)}$, in which case the result should be stated as [ \operatorname{Var}(\widehat{\mu}^{(s)}\mid f^{(s)})=\frac{1}{n-s}\operatorname{Var}(Y-f^{(s)}(X)\mid f^{(s)})+\frac{1}{m}\operatorname{Var}(f^{(s)}(X)\mid f^{(s)}), ] with the final “expectation over $f^{(s)}$” step removed; or (ii) if an unconditional variance over both data and fine-tuning randomness is intended, the expression should retain the expectation over $f^{(s)}$ (and, strictly speaking, include the additional $\operatorname{Var}(\mathbb{E}[Y-f^{(s)}(X)\mid f^{(s)}])$ term). Clarifying which probability space each variance is taken over would avoid this ambiguity and ensure consistency with the variance expression used earlier in the paper.

#12 Conflation of marginal and expected conditional variance Pending

$$ \begin{aligned} \operatorname{Var}\left(\widehat{\mu}^{(s)} \mid f^{(s)}\right) & =\operatorname{Var}\left(\left.\frac{1}{n-s} \sum_{i}\left(Y_{i}-f^{(s)}\left(X_{i}\right)\right) \right\rvert, f^{(s)}\right)+\operatorname{Var}\left(\left.\frac{1}{m} \sum_{j} f^{(s)}\left(\tilde{X}_{j}\right) \right\rvert, f^{(s)}\right) \ & =\frac{1}{n-s} \operatorname{Var}\left(Y-f^{(s)}(X) \mid f^{(s)}\right)+\frac{1}{m} \operatorname{Var}\left(f^{(s)}(X) \mid f^{(s)}\right) \end{aligned} $$

Taking the expectation over $f^{(s)}$ (which is equivalent to the definition of the variance terms in the problem setup), we obtain the final result.

Statement “Taking the expectation over $f^{(s)}$ (which is equivalent to the definition of the variance terms in the problem setup)” initially made me unsure exactly what randomness is included in quantities like $\operatorname{Var}(Y-f^{(s)}(X))$ and $\operatorname{Var}(f^{(s)}(X))$, since by the law of total variance marginal and conditional variances differ when $f^{(s)}$ is random. But after reading the full argument in Appendix E, it became clear that the variance of $\widehat{\mu}^{(s)}$ is computed as $\mathbb{E}[\operatorname{Var}(\widehat{\mu}^{(s)}\mid f^{(s)})]$ and that the “variance terms” used throughout are intended to be these expected conditional variances (or, equivalently, variances taken conditional on a realized $f^{(s)}$).

To avoid possible misinterpretation, it would help to state this convention explicitly—either by noting early on that all variance terms are conditional on the fitted $f^{(s)}$, or by defining the objects that appear in the scaling law as $\mathbb{E}[\operatorname{Var}(Y-f^{(s)}(X)\mid f^{(s)})]$ and $\mathbb{E}[\operatorname{Var}(f^{(s)}(X)\mid f^{(s)})]$. This would make it fully transparent why no additional “variance of the conditional mean” term appears in $\operatorname{Var}(\widehat{\mu}^{(s)})$.