Scaling laws are typically fit using a family of models with a narrow range of frozen hyper-parameter choices. In this work we study scaling laws using a wide range of architecture and hyper-parameter choices, and highlight their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: the most comprehensive open-source scaling law dataset to date, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters; these models have been trained with different learning rates, cooldown schedules, and architectural shapes. Our checkpoints enable more complex studies of scaling, such as a law that predicts language modeling performance as a function of model width and depth. By examining the various facets of our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting.
Our 22 Gemstone models range from 50M to 2B parameters, spanning 11 widths from 256 to 3072 and 18 depths from 3 to 80. For the main set of training runs, we train each model for 350B tokens of Dolma data with a context length of 2048. We open source checkpoints for all models at 2 billion token intervals. We also perform two ablations, over cooldown and optimal learning rate, meaning there are over 4,000 checkpoints in total.
We fit a lower convex hull to our loss curves. This hull is only supported by a sparse set of optimal models. This naturally excludes sub-optimal models that lie above the convex hull of optimality, and makes the resulting scaling law far more robust to the choice of model sampling. We see that the tokens per parameter prescription of our approach 1 fitting is also close to constant, like Chinchilla, but slightly higher, suggesting more tokens should be used per parameter in the model. We also record the average time per step to create a GPU Hours axis which can be used to fit time optimal
We consider a perturbation of the standard scaling law with additional terms to account for the impact of model width and depth:
$L(width (w),depth (d),parameters (p),tokens (T)) = \frac{A}{w^{\alpha}}+\frac{B}{d^{\beta}}+\frac{C}{p^{\gamma}}+\frac{D}{T^{\zeta}}+\varepsilon$
this allows us to optimize over the width and depth terms when obtaining prescriptions.
We see the prescribed width-depth ratio increases slowly as FLOPs is quickly increased, something observed in prior work.
In Figure 3 (right), we see that the optimal tokens per parameter follows more closely to the prescription found by Kaplan; the prescribed tokens per parameter decreases as the number of FLOPs increases.
@article{mcleish2024gemstones
title={Gemstones: A Model Suite for Multi-Faceted Scaling Laws},
author={Sean McLeish and John Kirchenbauer and David Yu Miller and Siddharth Singh and Abhinav Bhatele and Micah Goldblum and Ashwinee Panda and Tom Goldstein},
journal={arXiv preprint arXiv:2502.06857},
year={2025},
url={https://arxiv.org/abs/2502.06857},
}