A spaced-repetition engine that surfaces each word just as you start to forget it.
Drops’ review queue leaned on recency, so it kept resurfacing words you’d nailed and let the ones you were forgetting slip away. I rebuilt it around a memory-strength model that tracks how well you actually remember each term and brings it back in the window where review sticks best. Rolled out to all users, with no downside on any key metric.
- engagement
- +15.6%
- for new users
- +37%
- revenue / user
- +13%
- rolled out
- All users
The review queue ranked by recency, not by what you were actually forgetting.
Review, the Dojo, is one of the most-used surfaces in Drops: every active learner comes back to it to keep vocabulary alive. But the logic underneath leaned too heavily on recently-seen terms and had no real adaptive depth. It didn’t prioritise the words you hadn’t seen in weeks, or the ones you kept getting wrong. So words you’d clearly mastered kept reappearing, while terms quietly slipping out of memory fell into the long tail.
The surrounding experience made it worse. “Mastered” implied a word was done, when it could still fade. Raw review counts set targets that were high, reset too quickly, and left people feeling like they were losing ground rather than making it. Review worked, but it wasn’t smart, and it didn’t earn a daily return.
Model how memory actually fades, then surface each word in its optimal review window.
I rebuilt the review logic around a memory-strength model grounded in the forgetting curve. Every term carries a stability (how durably it’s been learned) and a retention score that decays the longer it’s been since you last saw it. Instead of ranking by recency, the system estimates how well you remember each word right now, and ranks by that.
That gives a principled answer to what to review and when. There’s a window where review sticks best: a term that’s neither too fresh nor already lost. So the engine watches for terms entering it, builds each session from a mix of urgent and optimal-window terms, and brings anything you get wrong straight back within the next few interactions. Get a word right and it moves further out; get it wrong and it returns sooner.
- STEP.01 Model the forgetting curve Give every term a retention score that decays from its last review and a stability that captures how durably it’s learned. Memory fades on its own; the maths handles it, with no hand-tuned decay rules.
- STEP.02 Find the optimal window Three retention bands: strong (review can wait), the optimal review window, and needs-review (high risk of forgetting). Surface a term as it crosses into the window where review does the most good.
- STEP.03 Compose the session Mix urgent terms with optimal-window ones so a session is challenging but achievable. Get one wrong and it returns within the next few interactions, with stability adjusted so late or failed reviews count for less.
- STEP.04 Frame it as a daily habit Set the engine inside a wider review revamp: a clearer term lifecycle (learning → learned → needs review → learned) and a simple daily pulse in place of discouraging counts, so review feels short, satisfying and worth returning to. Always free.
Rolled out to every user, with no downside on any metric.
- sessions / user
- +6.5%
- sessions · new users
- +25%
- purchase conversion
- +4%
- held steady
- Streaks
The engagement signal was clear. Across all users, Dojo interactions per user rose 15.6% and sessions 6.5%, both statistically significant, while overall engagement and streaks held steady. The gains came without taking anything away elsewhere. On the strength of that, we turned the test off and rolled it out to everyone.
New users moved most: +37% Dojo interactions, +25% sessions and +22% review sessions, with more of them activating straight into a paid plan. Monetisation moved with the learning: purchase conversion up 4% and revenue per user up 13%. The pattern is consistent: when review actually adapts to what you’re forgetting, people learn more, come back more, and value the product enough to pay.
I’d tailor the model per cohort from the start.
New users moved roughly two and a half times more than existing ones: +37% against +16%. That makes sense: the tuning that delights someone a week in isn’t what re-engages a veteran with thousands of strong terms already behind them. The model even has to fall back to weakest-retention terms for learners who’ve all but finished a language. Next time I’d treat the two cohorts as separate tuning problems rather than one curve fit to both.
Model the user’s reality, don’t just tune heuristics.
The old queue was a stack of heuristics (a recency penalty here, a hand-tuned weight there) that mostly worked and quietly failed at the edges. Replacing them with a model of the thing we actually cared about, how well you remember a word, made the system both simpler to reason about and far more effective. When a heuristic is doing important work, it’s usually worth the effort to replace it with a model of the underlying reality.
The other reminder: the numbers came from what review chose to surface, not from how the screen looked. The most leveraged surface in a product people already love is usually the logic deciding what it shows, not the page itself.