You asked

Why do my winning tests stop improving results after rollout?

The problem is that the lift measurement window captures the peak of user response which includes a novelty component that inflates the number and after rollout performance regresses toward a sustained level substantially below the initial lift.

Symptom

Experiment initial lift of twenty-two percent has decayed to four percent sustained post-rollout lift with a Lift Retention Rate of 0.18 indicating novelty-driven results.

Cause

A/B test measurement windows capture novelty peak when users first encounter changed experience and close before behavioral habituation completes causing initial lift to substantially overstate the durable performance improvement.

Impact

Organizations overestimate long-term experiment impact and misallocate resources toward short-lived gains — Microsoft ExP documents sustained impact fifty to eighty percent lower than initial reported lift — roadmap plans built from peak lifts predict five times more impact than experiments actually deliver.

Full diagnostic context

Experiment initial lift of twenty-two percent has decayed to four percent sustained post-rollout lift with a Lift Retention Rate of 0.18 indicating novelty-driven results.