You've described the ideal use case - a single feature flag, short lived, to let select users test one isolated piece of functionality until it's made generally available. Feature flags used in this way are wonderful.
But there are numerous ways to use feature flags incorrectly - typically once you have multiple long-lived flags that interact with each other, you've lost the thread. You no longer have one single application, you have n_flags ^ 2 applications that all behave in subtlety different ways depending on the interaction of the flags.
There's no way around it - you have to test all branches of your code somehow. "Just let the users find the bugs" doesn't work in this case since each user can only test their unique combination of flags. I've regularly seen default and QA tester flag configurations work great, only to have a particular combination fail for customers.
The only solution is setting up a full integration test for every combination of flags. If that sounds tedious (and it is), the solution is to avoid feature flags, not to avoid testing them!
> The only solution is setting up a full integration test for every combination of flags.
I've long been wondering whether there are tools that help with that. Like they measuring a test suite's code coverage but for feature toggle permutations. Either you test those permutations explicitly or you rule them out explicitly.
Long lived feature flags are totally fine, they're more like operational flags than anything. The Fowler article is pretty good at classifying them. Depending on the type of flag (longevity/dynamism) the design will vary. https://martinfowler.com/articles/feature-toggles.html
An essential property of a feature flag is that it is short-lived, existing only for the duration of the roll-out of the feature. In the language of your linked article, feature flags are 1-to-1 with "release toggles" and not really any other kind of toggle.
The problem is when you use feature flags for customer-bespoke reasons or to enable paid features. Then they’re always there and have to be tested in combinations which sucks.
Yeah, those things are called "user settings". If you need them, you need them, but pretending they are feature flags and trying to port the flags development methods into your settings will lead to nothing but tears.
Echoing sibling comments, feature flags are about managing the deployment of new product capabilities, and should always be short-lived. They're not an appropriate choice for any kind of long-lived capability, like anything that's per-customer, or paid vs. non-paid, or etc. Using feature flags for those kinds of things is a classic design mistake.
But there are numerous ways to use feature flags incorrectly - typically once you have multiple long-lived flags that interact with each other, you've lost the thread. You no longer have one single application, you have n_flags ^ 2 applications that all behave in subtlety different ways depending on the interaction of the flags.
There's no way around it - you have to test all branches of your code somehow. "Just let the users find the bugs" doesn't work in this case since each user can only test their unique combination of flags. I've regularly seen default and QA tester flag configurations work great, only to have a particular combination fail for customers.
The only solution is setting up a full integration test for every combination of flags. If that sounds tedious (and it is), the solution is to avoid feature flags, not to avoid testing them!