Synthetic data can never contain more information than the statistical model fro...

viraptor · 2024-10-14T05:51:21 1728885081

It's been already proven possible https://arxiv.org/abs/2203.14465

og_kalu · 2024-10-14T00:30:17 1728865817

>Synthetic data can never contain more information than the statistical model from which it is derived: it is simply the evaluation of a non-deterministic function on the model parameters. And the model parameters are simply a function of the training data.

The Information in the data isn't just about the output but its rate of occurrence/distribution. If what your base model has learnt is only enough to have the occasional flash of brilliance say 1 out of 40 responses and you are able to filter out these responses and generate as much as you like then you can very much 'bootstrap a better model' by training on these filtered results. You are only getting a function of the model's parameters if you train on its unfiltered, unaltered output.