I've worked on a big personalisation project that produced a distance score between 250,000 products for 5 million customers. That was 1.25 trillion records generating 37TB of scoring data.
The capabilities of the team working on the project and the capabilities of the prototype system were far in advance of the imagination of the key decision makers. The technical hurdles were the easy part.
It took a significant amount of work to sell the idea to stakeholders and, God bless them they rose to the challenge. Ironically the project was killed by fear of the unknown in some sections of the IT department and not by lack of faith from the commercial teams.
Preventing people seeing adverts for products they have already seen can be done using HyperLogLog to estimate whether a customer has already seen that product.
HyperLogLog bears some studying.
- Good for estimating membership of a set
- Good for estimating count distinct
- Gives truly insane rates of data compression
- May give false positives but won't give false negatives
- Accurate to around 2%. 2% error rate is fine for personalisation.
AWS RedShift already uses HyperLogLog when you prefix a count distinct with the approximate keyword
select approximate count(distinct pricepaid) from sales;
In short, the technology and capability put forward in your editorial already exists.