Scaling laws are typically fit using a family of models with a narrow range of frozen hyper-parameter choices. In this work we study scaling laws using multiple architectural shapes and hyperparameter choices, highlighting their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: an open-source scaling law dataset, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters and diverse architectural shapes; including ablations over learning rate and cooldown. Our checkpoints enable more complex studies of scaling, such as analyzing the relationship between width and depth. By examining our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting.
Our 22 Gemstone models range from 50M to 2B parameters, spanning 11 widths from 256 to 3072 and 18 depths from 3 to 80. For the main set of training runs, we train each model for 350B tokens of Dolma data with a context length of 2048. We open source checkpoints for all models at 2 billion token intervals. We also perform two ablations, over cooldown and optimal learning rate, meaning there are over 4,000 checkpoints in total.
We fit a lower convex hull to our loss curves. This hull is only supported by a sparse set of optimal models. This naturally excludes sub-optimal models that lie above the convex hull of optimality, and makes the resulting scaling law far more robust to the choice of model sampling. We see that the tokens per parameter prescription of our approach 1 fitting is also close to constant, like Chinchilla, but slightly higher, suggesting more tokens should be used per parameter in the model. We also record the average time per step to create a GPU Hours axis which can be used to fit time optimal
We plot the average benchmark accuracy (length normalized) over the FineWeb benchmarks for the Gemstones at 200, 250, 300 and 350 billion tokens. We see that the 1B scale models ($1280 \times 36$, $2560\times 8$, and $1792\times 18$) yield increasing accuracy with depth when constrained to approximately the same FLOP budget (vertically aligned points). Recent work suggests deeper layers in networks "do less" than shallower ones and can be pruned away, but our downstream evaluations suggest that there are also advantages to additional model depth.
@article{mcleish2024gemstones,
title={Gemstones: A Model Suite for Multi-Faceted Scaling Laws},
author={Sean McLeish and John Kirchenbauer and David Yu Miller and Siddharth Singh and Abhinav Bhatele and Micah Goldblum and Ashwinee Panda and Tom Goldstein},
journal={arXiv preprint arXiv:2502.06857},
year={2025},
url={https://arxiv.org/abs/2502.06857},
}