Gemstones 💎: A Model Suite for Multi-Faceted Scaling Laws

Sean McLeish^♠️    John Kirchenbauer^♠️    David Yu Miller^♠️    Siddharth Singh^♠️    Abhinav Bhatele^♠️    Micah Goldblum^♣    Ashwinee Panda^♠️
Tom Goldstein^♠️
♠️ University of Maryland    ♣ Columbia University

arXiv GitHub 🤗 Models

Colab

Abstract

Scaling laws are typically fit using a family of models with a narrow range of frozen hyper-parameter choices. In this work we study scaling laws using multiple architectural shapes and hyperparameter choices, highlighting their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: an open-source scaling law dataset, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters and diverse architectural shapes; including ablations over learning rate and cooldown. Our checkpoints enable more complex studies of scaling, such as analyzing the relationship between width and depth. By examining our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting.

The Gemstones

Our 22 Gemstone models range from 50M to 2B parameters, spanning 11 widths from 256 to 3072 and 18 depths from 3 to 80. For the main set of training runs, we train each model for 350B tokens of Dolma data with a context length of 2048. We open source checkpoints for all models at 2 billion token intervals. We also perform two ablations, over cooldown and optimal learning rate, meaning there are over 4,000 checkpoints in total.

Figure 1: Distribution of prior scaling law models, industry models, and our models in terms of width and depth. Prior work (purple and green) and industry models (blue and orange) mostly lie on a fixed width-depth line. If we want to prescribe the optimal width-depth ratio, we need to select models with different widths and depths (our models, black).

Approach 1

We fit a lower convex hull to our loss curves. This hull is only supported by a sparse set of optimal models. This naturally excludes sub-optimal models that lie above the convex hull of optimality, and makes the resulting scaling law far more robust to the choice of model sampling. We see that the tokens per parameter prescription of our approach 1 fitting is also close to constant, like Chinchilla, but slightly higher, suggesting more tokens should be used per parameter in the model. We also record the average time per step to create a GPU Hours axis which can be used to fit time optimal

Figure 2: Approach 1 prescriptions. Row one: Validation loss over FLOPs (left) and GPU hours (right). We use Approach 1 to find the optimal points on the convex hull in each setting, marked with black crosses. Row two: We fit a line to the tokens per parameter of empirically optimal models and find a slightly higher, but still constant, tokens per parameter prescription than Chinchilla. Chinchilla's Approach 1 creates 250 logarithmically-spaced FLOPs bins per order of magnitude, and in red we plot the minimizers over these bins, and the scaling law fitted to these minimizers. Clearly, their Approach 1 is not well-suited for our data, and our convex hull approach is better when we select fewer models to fit our law on.

Buried Treasure: Unearthing Value in Depth

We plot the average benchmark accuracy (length normalized) over the FineWeb benchmarks for the Gemstones at 200, 250, 300 and 350 billion tokens. We see that the 1B scale models ($1280 \times 36$, $2560\times 8$, and $1792\times 18$) yield increasing accuracy with depth when constrained to approximately the same FLOP budget (vertically aligned points). Recent work suggests deeper layers in networks "do less" than shallower ones and can be pruned away, but our downstream evaluations suggest that there are also advantages to additional model depth.

Figure 3: Benchmark Performance Increases with Depth. We benchmark all models using the 200, 250, 300 and 350 billion token checkpoints. Models show increasing accuracy with depth when constrained to approximately the same FLOP budget (vertically aligned points). This relationship between depth and accuracy can also be observed in many individual benchmarks.

Reference

Please kindly cite our paper if you use our code or results:

@article{mcleish2024gemstones,
  title={Gemstones: A Model Suite for Multi-Faceted Scaling Laws}, 
  author={Sean McLeish and John Kirchenbauer and David Yu Miller and Siddharth Singh and Abhinav Bhatele and Micah Goldblum and Ashwinee Panda and Tom Goldstein},
  journal={arXiv preprint arXiv:2502.06857},
  year={2025},
  url={https://arxiv.org/abs/2502.06857},
}