The authors evaluate synthetic math datasets with inter-model variability to assess their alignment with downstream tasks, such as solving math problems on a real benchmark. They use the GSM8K-Synthetic dataset and measure the correlation between performance on the synthetic task and the downstream task, finding a strong logarithmic relationship between the two. The strongest correlation is with downstream GSM8K performance, followed closely by MMLU, suggesting that the synthetic dataset taps into the same math reasoning capabilities required to do well on the real benchmark. This approach can be used to sanity check whether a synthetic dataset is engaging the same set of skills as the target task, and the authors plan to explore using these signals to improve the quality of the trained model in future work.