Dealing with Highly-skewed and Multi-modal Financial Datasets for A/B tests
Doing statistical tests on skewed data is hard. Doing statistical tests on skewed data with multiple modalities is even harder!
Chances are, if you’re working with financial data (say, investment or purchase data), you see that many of your customers spend low amounts of money, but a relatively small number of your customers spend a ton of money, resulting in a heavy positive skew. Your data likely looks something like this:
Image taken from here.
I’ve encountered this problem previously — and in some cases the problem was amplified because the data was not only irreparably skewed (no correction normalized the data), but it was also multi-modal, meaning it looked like the far-right image here:
Image taken from here.
My team and I decided to develop our own unique solution this problem. First, since the data was highly skewed, we could not run any frequentists tests on the means. In order to deal with skewness, we used the Wilcoxon-Mann-Whitney (WMW) rank sum test. WMW is a test for the hypothesis of equal medians. However, it is sensitive to differences in spread besides location, which means it could give a small p-value even when the means and the medians are equal (reference here). This raises a problem that is well-illustrated in this example, which I will copy & paste here for convenience:
…the Mann–Whitney U test does not test for inequality of medians, but rather for difference of distributions. Consider another hare and tortoise race, with 19 participants of each species, in which the outcomes are as follows, from first to last past the finishing post:
H H H H H H H H H T T T T T T T T T T H H H H H H H H H H T T T T T T T T T
If we simply compared medians, we would conclude that the median time for tortoises is less than the median time for hares, because the median tortoise here (in bold) comes in at position 19, and thus actually beats the median hare (in bold), which comes in at position 20. However, the value of U is 100 (using the quick method of calculation described above, we see that each of 10 tortoises beats each of 10 hares, so U = 10×10). Consulting tables, or using the approximation below, we find that this U value gives significant evidence that hares tend to have lower completion times than tortoises (p < 0.05, two-tailed). Obviously these are extreme distributions that would be spotted easily, but in larger samples something similar could happen without it being so apparent. Notice that the problem here is not that the two distributions of ranks have different variances; they are mirror images of each other, so their variances are the same, but they have very different skewness.
We decided to solve this problem by pairing the WMW test with a Bayesian hypothesis test that compares the means. We initially thought of going solely with the Bayesian test; however, the classic Bayesian method assumes a normal distribution, and modeling the prior with a heavy-tailed distribution can lead to an excessively long posterior interval (as discussed here). Pairing the two tests allowed us to effectively build a strainer: only tests that were able to throw “positive” results in both tests had our utmost confidence that both the distribution and the “typical” user were shifting in the same direction.
Then came the problem of multi-modality. Our data had four modalities — we saw four separate clusters of investing behavior. We realized that even if the distribution was shifting in the right direction with an A/B test, we weren’t sure which cluster was responsible for the shift (were all users shifting? or was a single cluster of responsible for the distribution change?). There was also the possibility that we could be hurting one of these clusters in an isolated manner even when we observe an overall positive shift.
In order to solve this problem, we decided to run the test that I just described on all four clusters separately, in cases where the overall test threw a positive or negative result. We designated an area around the peaks (based on st. devs) to “belong” to a cluster, and sliced the overall data set into four clusters based on these areas. We then compared the distributions between control and treatment. Many of our tests showed that the clusters moved together, or the results were too vague to interpret, but we did occasionally run into cases where one cluster was moving independently from the rest of the data. This resulted into further deep-dives that were quiet interesting!
If you’ve dealt with similar issues and developed other solutions, or if you have thoughts on this approach, feel free to contact me!