The point is that the article states "averaged across 4 diverse customer tasks, ...

The point is that the article states "averaged across 4 diverse customer tasks, fine-tunes based on our new model are slightly stronger than GPT-4, as measured by GPT-4 itself" and then proves it with nothing tangible, just the 4 selected metrics where it performs the best. I mean obviously a finetuned 7B LLM could perform, let's say, text summarization well. The question is what happens if that text contains code, domain-specific knowledge where some facts are less relevant than the other, etc., and that isn't going to be answered by any metric alone. Fundamentally, with enough diverse metrics, each based on a different dataset, the one with the biggest overlap of the dataset for finetuning will perform really well, and the rest, well, not so well.

Bsically, the statistic means that there's a set of data for which that particular (finetuned) network performs slightly better than GPT-4, and everywhere else, pretty bad. It's just not generalizable to everything while GPT-4 is. It's just as good as saying "calculators outperform GPT-4 at counting". Like, yes, they probably do, but I would like to see - is it applicable and practical, or did you just train a LLM to write all the names in Polish alphabetically really well? And that's why qualitative approach for evaluation LLMs is just better.