Why is A/B testing particularly relevant for e-retail media?

  • Reading time : 5 minutes
  • By Alexandra Caillet, Head of Business Insights and Measurement chez Lucky cart

In a booming market with a growing number of campaigns conducted on the e-Commerce websites of French retailers, properly measuring performance is becoming critical to be able to assess effectiveness and investment profitability.

To measure the performance of a campaign, various methods exist, and specifically for this market : period comparison and A/B testing are the two most commonly used ones.

The period comparison method.

The period comparison method consists of looking at the figures observed over the period of the campaign, and comparing them with another period, which serves as a reference. The period of comparison used can vary between different options: period preceding the campaign of the same duration as the campaign; campaign period over the past year; observation over a longer period and then averaged to smooth out possible effects of fluctuations, etc.

The main bias with this method is linked to the fact that we are potentially comparing events that are not comparable with each other in the first place. Non-comparable because the two periods compared and analysed may include effects that cannot be identified and statistically isolated (for example: promotional effects, product seasonality, media pressure, or even simply product distribution, etc.). This method can therefore provide some elements of explanation for variations linked to the campaign, but remains limited if one wishes to take the exercise further. In addition, the method enables to identify possible explanatory hypotheses, but does not make it possible to isolate the facts specifically and directly linked to the campaign itself.

If we take a concrete illustrative example: in the context of the various confinements linked to Covid and the resulting booming traffic on the websites, we quickly understand that the comparison of periods introduces a significant bias in the variations observed. The specific period of the Covid is one example among many others, but we can also have periods in which there were exogenous variations. All these elements are difficult to measure if we confine ourselves to this simple comparison.

The A/B testing method.

This leaves us with the question of measuring the performance via the A/B testing method.

When we look for a definition of what the A/B testing method is, we find the following: ” A/B testing is a marketing technique that consists of proposing several variants of the same object that differ according to a single criterion in order to determine the version that gives the best results with consumers” (source: Wikipedia). Beyond the measurement used in a purely marketing context, it is also a method used in a broader spectrum, with a real scientific guarantee, and this, for many years. It is, among other things, the method behind the famous “placebo effect” in scientific tests, because it allows the assertion that all things are equal.

How does this translate into practice, and what are the implications, especially for e-retail media?

If we refer solely to the marketing and literal definition of A/B Testing, and apply it to the e-retail media market, we quickly realize that the “variant” used corresponds to the fact of exposing or not a shopper to a given campaign. The “same object” here is the campaign that is activated during a certain period of time. The main difference with the method described above is that in this case, we can compare elements within a single period, hence removing biases.

Having said that, when we look more closely at the statistical implications of this exposed/unexposed method, this random exposure may introduce bias when analyzing results returned after the campaign. Indeed, it is quite possible that the differences observed between the two studied populations are not directly related to whether or not they were exposed to the campaign. This may be due to the fact that the two populations initially had completely different behaviours. In addition, the choice made between the buyers selected to be exposed or not, is made like a “draw”, and therefore does not take into account the differences between these buyers upstream. If, for example, we choose not to expose 5% of shoppers, we quickly realize that this sample is not significant enough to represent a solid benchmark for comparison, with all the biases that this entails. This aspect of significance is mainly due to the introduction of a difference between consumers that is not controlled before the selection process.

Thus, in order to have the most reliable and least biased A/B testing method, these populations, whether exposed or not, must be as similar as possible, and this, over time. Why ? In order to avoid uncontrolled statistical noise and thus having results that are effectively the most comparable and similar after the campaign, in order to measure the impact of the latter as closely as possible.

Lucky cart’s use of advanced A/B testing…

In this context, how does Lucky cart use the data sources to which it has access to determine these populations?

The approach defined is to use the purchase historical data, using purchase receipts provided directly by retailers, for each of the populations to which the campaign will be exposed or not.

How does this historical data allow us to eliminate a very significant part of the bias?

The past purchases made by each of the populations make it possible to determine in advance whether or not the behaviour is similar. The fact that the populations identified are as close as possible in terms of consumption, before the campaign, makes it possible to justify if a difference is then observed in the population that was exposed to the campaign. The aim is therefore to have twin profiles, as their purchasing behavior has been identified as identical to within a hundredth of a cent. An identical purchase history, both on the references highlighted during the campaign, but also on a broader framework in terms of category purchases for example. Statistical tests and calibration are then carried out to ensure that their behavior has indeed remained identical over time, and is not simply the result of chance at a given moment.

Since the two groups started out with identical purchasing behavior, if there is a difference in purchases after exposure to the campaign, then, statistically, it can be said that the fact of having been exposed has had a real, solidly measured influence over not having been exposed.

… To match the needs of Machine Learning.

In this context, the importance of the quality level of data returned can be measured. The use of purchase history is the key to eliminating bias. With this access to shoppers’ purchase data, we can vary the spectrum of possibilities in a wide field. We can play on hyper-specific parameters, with personalization at the level of the individual, up to a much wider mass effect, as we can address a massive audience.

The very powerful possibilities of scale variations are made possible by the iterative process based on machine learning, created and developed by the Lucky Cart’s Data team.

Therefore, the higher the quality of the data we have access to, the better the learning and bias reduction. It is all of these elements together that will make machine learning effective and efficient, and ultimately, the most accurate and reliable performance measurement possible.

More articles