Here are some high-level reactions to the document.
Theoretical consistency of the variance optimization
The derivation of the optimal allocation relies on an asymptotic simplification that might introduce tension with the finite-sample goals of the paper. Section 4.1 decomposes the variance into two terms: $\frac{1}{n-s}\mathrm{Var}(Y-f^{(s)}(X))$ and $\frac{1}{m}\mathrm{Var}(f^{(s)}(X))$. The subsequent optimization problem (3) and Theorem 4.1 proceed by minimizing only the first term, justified by the assumption that "$m$ is large."
However, fine-tuning modifies the predictor $f^{(s)}$, which necessarily changes the second term, $\frac{1}{m}\mathrm{Var}(f^{(s)}(X))$. If the unlabeled set size $m$ is finite (as it is in the Wine Reviews experiment), ignoring this term implies that the derived $s^*$ may not actually minimize the true calculable variance. It would be compelling to see either a formal finite-$m$ analog to the optimization problem or empirical evidence demonstrating that the variations in the second term are indeed negligible compared to the reduction in the first term across the relevant range of $s$.
Accounting for the label cost of scaling law estimation
The central claim of efficiency rests on the ability to pinpoint the optimal split $s^*$ using the scaling law parameters $(a, \alpha, b)$. Currently, the empirical analysis in Section 6.2 and 6.3 treats these parameters as fixed inputs derived from a separate 10,000-sample dataset. This prevents a deployment-faithful evaluation of the method's efficiency: the label cost of estimating the scaling curve is excluded from the experimental budget, yet obtaining these estimates is a prerequisite for the method.
To robustly support the "cost savings" claims in the Abstract and Table 3, the evaluation would benefit from an end-to-end simulation where the scaling law is learned within the allocated budget $n$ (perhaps using the ramp-up procedure mentioned in Appendix B). Comparing this fully loaded cost against the baselines is necessary to demonstrate that the method generates net savings when starting from scratch (cold start) rather than relying on an oracle.
Sensitivity of the allocation rule to estimation error
Related to the estimation of scaling parameters, there is a question regarding the sensitivity of the decision rule $s^(\hat a, \hat b, \hat\alpha)$. Since the objective function in Figure 2a is U-shaped, the cost of missing the optimal $s^$ could be significant. In a realistic setting with small valid sets, estimations of the scaling exponent $\hat\alpha$ are likely to be noisy.
The manuscript cites Appendix D.1 regarding concentration, but it would be valuable to see a direct sensitivity analysis of the main methodological prescription. Specifically, how much does $s^$ shift given standard errors in $\hat\alpha$ or $\hat b$? If the allocation rule relies on precise parameter estimates that are difficult to obtain with small $n$, the "optimal" allocation might be theoretically sound but practically risky. A bootstrap approach to selecting a conservative $s^$ might strengthen the practical applicability of the framework.
Clarifying the performance gap between MSE and residual variance
The paper reports a striking performance gap between the proposed loss function and the standard MSE baseline (18–54% variance reduction in Table 3). However, Remark 1 notes that with additive shifts and ideal training, minimizing MSE and maximizing residual variance should essentially lead to the same solution. This raises a question about the driver of the empirical gap: is the proposed loss function structurally superior, or is the MSE baseline under-performing due to regularization, early stopping, or architecture constraints?
To firmly establish the "new loss" as a general principle for train-for-inference contexts, it is important to document why the MSE baseline fails to drive down residual variance to the same degree. Ensuring the MSE baseline is "best-effort" (e.g., capable of learning the correct mean shift via an unregularized intercept/bias term) would protect the results from being interpreted as a fix for a specific constrained training setup rather than a fundamental theoretical advantage.
Operationalizing the M-estimation extension
Section 5 outlines the extension to M-estimation effectively at a theoretical level, particularly regarding the D-optimal/trace scalarizations. However, as currently presented, the method may be difficult for a reader to implement. Unlike the mean estimation case, where the objective is clear, minimizing the determinant or trace of a covariance matrix involving $\psi(X, f^{(s)}(X))$ presents non-trivial challenges for stochastic gradient training, particularly regarding numerical stability and the unknown nature of $\theta^*$. Expanding this section to provide an explicit algorithmic recipe—perhaps involving plug-in estimates for $\hat\theta$ or an iterative procedure—would match the usability of the earlier sections.
Detailed Comments (12)
Our empirical analysis validates the fine-tuning scaling law and confirms that our proposed optimal allocation rule reliably identifies the optimal sample allocation.
Unlike the conventional objective that minimizes the mean squared prediction errors, we propose to minimize the variance of the prediction errors as the fine-tuning objective, which is optimal for the downstream rectification stage.
where the third equality is because $(Y_i - f^{(s)}(\boldsymbol{X}_i))$ and $f^{(s)}(\tilde{\boldsymbol{X}}_j)$ are independent.
We validate our framework through an empirical study using the Wine Reviews dataset to estimate the average ratings in the population. This task mirrors a day-to-day challenge faced by product teams. When retailers select which stock-keeping units (SKUs) to sell in their stores, they usually need to understand the perceived product quality at scale
$$ \begin{aligned} & \mathrm{E}\left[(\widehat{\mu}-\mathrm{E}[Y])^{2}\right] \ = & \mathrm{E}\left[\left(\frac{1}{n-s} \sum_{i=1}^{n-s}\left(Y_{i}-f^{(s)}\left(\boldsymbol{X}{i}\right)\right)+\frac{1}{m} \sum{j=1}^{m} f^{(s)}\left(\tilde{\boldsymbol{X}}{j}\right)-\mathrm{E}[Y]\right)^{2}\right] \ = & \mathrm{E}\left[\left(\frac{1}{n-s} \sum{i=1}^{n-s}\left(Y_{i}-f^{(s)}\left(\boldsymbol{X}{i}\right)+\mathrm{E}\left[f^{(s)}(\boldsymbol{X})\right]-\mathrm{E}[Y]\right)+\frac{1}{m} \sum{j=1}^{m}\left(f^{(s)}\left(\tilde{\boldsymbol{X}}{j}\right)-\mathrm{E}\left[f^{(s)}(\boldsymbol{X})\right]\right)\right)^{2}\right] \ = & \frac{1}{n-s} \sum{i=1}^{n-s} \mathrm{E}\left[\left(Y_{i}-f^{(s)}\left(\boldsymbol{X}{i}\right)+\mathrm{E}\left[f^{(s)}(\boldsymbol{X})\right]-\mathrm{E}[Y]\right)^{2}\right]+\frac{1}{m} \sum{j=1}^{m} \mathrm{E}\left[\left(f^{(s)}\left(\tilde{\boldsymbol{X}}_{j}\right)-\mathrm{E}\left[f^{(s)}(\boldsymbol{X})\right]\right)^{2}\right] \ = & \frac{1}{n-s} \operatorname{Var}\left(Y-f^{(s)}(\boldsymbol{X})\right)+\frac{1}{m} \operatorname{Var}\left(f^{(s)}(\boldsymbol{X})\right) \end{aligned} $$
where the third equality is because $\left(Y_{i}-f^{(s)}\left(\boldsymbol{X}{i}\right)\right)$ and $f^{(s)}\left(\tilde{\boldsymbol{X}}{j}\right)$ are independent.
where the third equality is because $(Y_i - f^{(s)}(\boldsymbol{X}_i))$ and $f^{(s)}(\tilde{\boldsymbol{X}}_j)$ are independent.
Note that in our LLM framework, $X$ typically denotes the raw textual query given to LLMs. Here, with a slight abuse of notation, we let $\boldsymbol{X}$ be the feature vectors extracted from the text, e.g., product profile and price.
implies that $\boldsymbol{\theta}=\mathrm{E}\left[\boldsymbol{X} \boldsymbol{X}^{\top}\right]^{-1} \mathrm{E}[\boldsymbol{X} Y]$, which is the population ordinary least squares estimator.
It is easy to see that $\widehat{\boldsymbol{\theta}}$ as defined in (7) is a consistent estimator. To see this, we focus on the right-hand side of (7) when both $(n-s) \rightarrow+\infty$ and $m \rightarrow+\infty$. ... where the equality is because $\boldsymbol{X}{i}$ from the small labeled dataset and $\tilde{\boldsymbol{X}}{j}$ from the large unlabeled dataset are sampled from the same distribution. Since $l$ has a unique minimizer, Theorem 2.7 of Newey and McFadden [1994] ensures that $\widehat{\boldsymbol{\theta}}$ is consistent.
In the case when m is much larger than n-s, the second term can be ignored, and we focus our attention primarily on the first term.
Taking the expectation over $f^{(s)}$ (which is equivalent to the definition of the variance terms in the problem setup), we obtain the final result.
$$ \begin{aligned} \operatorname{Var}\left(\widehat{\mu}^{(s)} \mid f^{(s)}\right) & =\operatorname{Var}\left(\left.\frac{1}{n-s} \sum_{i}\left(Y_{i}-f^{(s)}\left(X_{i}\right)\right) \right\rvert, f^{(s)}\right)+\operatorname{Var}\left(\left.\frac{1}{m} \sum_{j} f^{(s)}\left(\tilde{X}_{j}\right) \right\rvert, f^{(s)}\right) \ & =\frac{1}{n-s} \operatorname{Var}\left(Y-f^{(s)}(X) \mid f^{(s)}\right)+\frac{1}{m} \operatorname{Var}\left(f^{(s)}(X) \mid f^{(s)}\right) \end{aligned} $$
Taking the expectation over $f^{(s)}$ (which is equivalent to the definition of the variance terms in the problem setup), we obtain the final result.