Gemstones 💎: A Model Suite for Multi-Faceted Scaling Laws

Sean McLeish♠️    John Kirchenbauer♠️    David Yu Miller♠️    Siddharth Singh♠️    Abhinav Bhatele♠️    Micah Goldblum    Ashwinee Panda♠️   
Tom Goldstein♠️   
♠️ University of Maryland    ♣ Columbia University

Abstract

Scaling laws are typically fit using a family of models with a narrow range of frozen hyper-parameter choices. In this work we study scaling laws using a wide range of architecture and hyper-parameter choices, and highlight their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: the most comprehensive open-source scaling law dataset to date, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters; these models have been trained with different learning rates, cooldown schedules, and architectural shapes. Our checkpoints enable more complex studies of scaling, such as a law that predicts language modeling performance as a function of model width and depth. By examining the various facets of our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting.

The Gemstones

Our 22 Gemstone models range from 50M to 2B parameters, spanning 11 widths from 256 to 3072 and 18 depths from 3 to 80. For the main set of training runs, we train each model for 350B tokens of Dolma data with a context length of 2048. We open source checkpoints for all models at 2 billion token intervals. We also perform two ablations, over cooldown and optimal learning rate, meaning there are over 4,000 checkpoints in total.

sampling

Figure 1: Distribution of prior scaling law models, industry models, and our models in terms of width and depth. Prior work (purple and green) and industry models (blue and orange) mostly lie on a fixed width-depth line. If we want to prescribe the optimal width-depth ratio, we need to select models with different widths and depths (our models, black).

Approach 1

We fit a lower convex hull to our loss curves. This hull is only supported by a sparse set of optimal models. This naturally excludes sub-optimal models that lie above the convex hull of optimality, and makes the resulting scaling law far more robust to the choice of model sampling. We see that the tokens per parameter prescription of our approach 1 fitting is also close to constant, like Chinchilla, but slightly higher, suggesting more tokens should be used per parameter in the model. We also record the average time per step to create a GPU Hours axis which can be used to fit time optimal

sampling

Figure 2: Approach 1 prescriptions. Row one: Validation loss over FLOPs (left) and GPU hours (right). We use Approach 1 to find the optimal points on the convex hull in each setting, marked with black crosses. Row two: We fit a line to the tokens per parameter of empirically optimal models and find a slightly higher, but still constant, tokens per parameter prescription than Chinchilla. Chinchilla's Approach 1 creates 250 logarithmically-spaced FLOPs bins per order of magnitude, and in red we plot the minimizers over these bins, and the scaling law fitted to these minimizers. Clearly, their Approach 1 is not well-suited for our data, and our convex hull approach is better when we select fewer models to fit our law on.

Approach 3

We consider a perturbation of the standard scaling law with additional terms to account for the impact of model width and depth:
$L(width (w),depth (d),parameters (p),tokens (T)) = \frac{A}{w^{\alpha}}+\frac{B}{d^{\beta}}+\frac{C}{p^{\gamma}}+\frac{D}{T^{\zeta}}+\varepsilon$
this allows us to optimize over the width and depth terms when obtaining prescriptions. We see the prescribed width-depth ratio increases slowly as FLOPs is quickly increased, something observed in prior work. In Figure 3 (right), we see that the optimal tokens per parameter follows more closely to the prescription found by Kaplan; the prescribed tokens per parameter decreases as the number of FLOPs increases.

sampling

Figure 3: Approach 3 laws with the parametrization shown below including width and depth as terms. We see the prescribed optimal width-depth ratio increases with the FLOPs (left) budget and the optimal tokens per parameter decreases as the FLOPs budget increases (right). We see slight bumpiness in the lines due to the integer constraints we enforce on the attention heads.

Reference

Please kindly cite our paper if you use our code or results:
          @article{mcleish2024gemstones
                    title={Gemstones: A Model Suite for Multi-Faceted Scaling Laws}, 
                    author={Sean McLeish and John Kirchenbauer and David Yu Miller and Siddharth Singh and Abhinav Bhatele and Micah Goldblum and Ashwinee Panda and Tom Goldstein},
                    journal={arXiv preprint arXiv:2502.06857},
                    year={2025},
                    url={https://arxiv.org/abs/2502.06857},
                }