*Identifying the Right Statistical Tests for Analyzing Behavioural Market Research Based on A/B Testing Methodology*

**Introduction**

In the rapidly evolving business landscape, innovation transcends the mere creation of new products. True innovation is about designing solutions that not only captivate but also comprehensively meet consumer needs. This is where the art of consumer validation becomes crucial, not just for launching groundbreaking products but also for refining the ones already in the market.

Enter the realm of pretotyping. Originating from the traditional A/B testing framework, which compares two different groups' responses to varied offerings, pretotyping has revolutionized the way businesses validate market demand before full-scale product development. Among its most strategic techniques is the painted door test. With painted door tests, it is possible to measure and compare the market demand for different product characteristics and use this information to make a decision in favour of one over the other. This method effectively gauges the demand for a (new) brand, feature, or value proposition, identifying the most appropriate target audience and optimal pricing strategy.

**With the method, you can collect close to infinite behavioural consumer data.**

**But, how do you make sense of all the data?**

In this comprehensive guide, we will delve into the statistical tests that underpin pretotyping tests, helping you navigate the intricate world of behavioural-based decision-making. Before we lead you through the analysis of painted door tests step-by-step, we briefly introduce the metrics and statistical tests used for painted door tests.

**Key Metrics For Painted Door Tests**

When implementing painted door tests, choosing the right metrics is crucial for accurate analysis. First, you need to determine whether your primary metric will be quantitative (numerical) or qualitative (categorical). **Quantitative variables** are further classified as either continuous, which can assume any value within a range (e.g. time spent on page (seconds)), or discrete, limited to specific numeric values, typically integers (e.g. page views per visit). On the other hand, **qualitative variables** are categorized as nominal, where the categories have no inherent order (e.g. colour scheme of button), or ordinal, where the categories do have a logical sequence (e.g. user satisfaction rating (1-5)).

Understanding these distinctions is vital because the type of variable you are measuring dictates the statistical test you should employ. For instance, common Key Performance Indicators (KPIs) like click-through rate, CTA clicks, conversion rate, and email confirmations are integral for gauging purchase intent. Each of these metrics can offer a snapshot of user engagement and the effectiveness of the tested elements, guiding strategic decisions in product development and marketing.

**2-Group Comparisons**

When comparing two groups (for testing two variants against each other) selecting the appropriate statistical test is essential. There are three commonly used options: the **Chi-Square Test**, the **Z-test**, and the **T-test**.

The Chi-Square Test is essential for handling categorical data, and evaluating whether there are significant associations between categories. Meanwhile, the Z-test is applicable when comparing proportions to determine if there are differences in percentage outcomes between the two groups. Lastly, the T-test is employed to evaluate the statistical differences in means between two sets of data, making it invaluable for assessing variations in continuous variables.

**A practical approach**

To illustrate the practical application of the statistical tests, let's delve into a fictional example within the home appliance industry, conducting a painted door test for a smart vacuum cleaner.

We will guide you through the entire 7-step process along the example.

Consider a scenario where the current pricing for the smart vacuum cleaner was set at 150 euros, and the team is contemplating a price increase to 180 euros.

**Step 1: Pose your research question**

Our primary query revolves around understanding whether the contemplated price increase will influence market demand, so essentially, the purchase intent of consumers.

Thus, our research question may sound like this:

*Is there a difference in the purchase intent for the smart vacuum cleaner at a price of 150 euros compared to a price of 180 euros?*

**Step 2: Define hypotheses and set proxies**

To analyze a problem in a structured manner, it's important to first establish hypotheses. Typically, you'll want to create a null hypothesis (H_{0}) and an alternative hypothesis (H_{A}). The null hypothesis is the default assumption that there's no difference or effect, while the alternative hypothesis suggests the opposite, that there is a difference or effect present in the population. To test these hypotheses, we need to determine a Key Performance Indicator (KPI) that represents purchase intent. For our case, the number of clicks on the call-to-action (CTA) button, such as "order now," can be considered a reliable proxy to understand the level of purchase intent.

Thus, we formulate the following hypotheses:

H_{0}: There is no significant association between the pricing change (150 euros to 180 euros) and the frequency of CTA clicks.

H_{A}: The pricing change to 180 euros results in a significant difference in the frequency of CTA clicks.

Now in practical application, we do not only want to understand if the purchase intent is different, but if the consumers’ purchase intent decreases with the introduction of a higher price. To dig even deeper into the data, we can use the average number of CTA clicks per day on either of the landing pages. This leads us to the following hypotheses:

H_{0}: There is no significant difference in the average number of CTA clicks between the current price (150 euros) and the proposed increased price (180 euros).

H_{A}: The price increase to 180 euros will result in a decrease in the average number of CTA clicks for the smart vacuum cleaner.

Although total buying intents (= CTA clicks) per landing page can be an adequate indicator of a product's success, it may also be beneficial to consider the ratio of people who made a purchase compared to the total number of visitors to the landing page. Generally, it will provide a more accurate insight into the landing page's effectiveness, and since both landing pages are alike except for the price, thus each of the price points’ persuasiveness, in converting visitors into customers.

From this, we derive the following hypotheses:

H_{0}: There is no significant difference in conversion rates on the CTA button clicks between the current price (150 euros) and the proposed increased price (180 euros).

H_{A}: The price increase to 180 euros will result in a decrease in conversion rates on the CTA button clicks for the smart vacuum cleaner.

While these three hypotheses and their respective proxies do not yet provide you with a holistic overview of your consumers' true purchase intent, they are important. We picked these three hypotheses on purpose to lead you through different possible statistical tests. In practice, different test setups can demand different hypotheses. If you want to learn more about how to create a hypothesis for your Pretotyping tests, consider reading the following article.

So, here we are. Ready to set up the test. Almost.

**Step 3: Calculate sample size**

Prior to conducting the test, it's essential to define the minimum effect size you aim to detect between the two landing page variants, be it in terms of the difference in total CTA clicks or conversion rates. Establishing this parameter allows you to determine the required sample size for achieving statistical significance in identifying the specified effect size.

For this specific research, we calculated the need for a sample size of around 6,800 people in total divided into the two landing pages to detect a minimum difference of 25% in their conversion rates at an estimated conversion rate of 7%. This would mean you would be able to detect significant differences that are +/-1.75% different from 7%. Feel free to reach out to understand more about calculating sample sizes.

To acquire the sample, Horizon usually uses Meta and Google Ads, leading the target audience to the different landing page variants.

**Step 4: Collect data**

To effectively gather data for your experiment, we initiate by setting up two identical landing pages that only differ in their pricing strategies. This method allows us to directly compare how different price points influence visitor behaviour and conversion rates.

Before we tap into the data itself, it's crucial to pay attention to your technical setup and ensure that you're collecting data correctly. There's nothing worse than setting up a test and realizing later on that the data was not collected accurately.

Learn more about how to set up a painted door test correctly in this article.

This will help you avoid common mistakes and optimize your test setup, whether you're using Horizon software or a custom configuration.

In this scenario, let’s assume you collected the following data over two weeks of running an A/B test with two landing pages:

Zoomed in on the data over the 14-day period, the table looks like this:

The data appears accurate and leads us directly into the next step, where the actual magic happens.

**Step 5: Perform the statistical tests**

You probably learned in school or university that before running a statistical test, it is crucial to determine the level of significance.

In statistical hypothesis testing, the significance level (α) is the threshold at which you decide whether to reject the null hypothesis. A common choice is 0.05, meaning that you are willing to accept a 5% chance of making a Type I error — rejecting a true null hypothesis. This level helps set the critical value or critical region for the test.

**Hypothesis 1**

Let’s have a look back at our hypotheses. The first hypothesis was as follows:

H_{0}: There is no significant association between the pricing change (150 euros to 180 euros) and the frequency of CTA clicks.

H_{A}: The pricing change to 180 euros results in a significant difference in the frequency of CTA clicks.

We made it fairly simple in this one. We put in a signal word: frequency.

In other words, or in more generalized scientific language, our hypotheses can be rephrased as follows:

H_{0}: The distribution of the observed frequencies is equal to the distribution of the expected frequencies under independence.

H_{A}: The distribution of the observed frequencies is not equal to the distribution of the expected frequencies under independence.

*Chi-Square Test*

*Chi-Square Test*

When it comes to frequencies or categorial data, the Chi-Square Test of independence is our test to go. This statistical method is particularly useful for analyzing categorical data and determining whether the distribution of one categorical variable is independent of another.

By constructing a contingency table, we can compare the observed and expected frequencies associated with the pricing change (Landing Page A (150 €) or Landing Page B (180 €)) and CTA click outcomes (Yes or No). By analyzing this relationship, we can determine whether the observed distribution of CTA clicks is consistent with what we would expect if there was no correlation between the pricing change and CTA click outcomes.

To investigate it further, we will set up a 2x2 contingency table with the categories "Landing Page" (A (150€) or B (180€)) and "CTA Click" (Yes or No):

To fill in the table, you can withdraw the number of CTA clicks per landing page from the given data. To calculate the number of “No CTA clicks”, you simply take the total number of landing page visitors and subtract them from the number of CTA clicks on this landing page.

Next, we'll use the Chi-Square test formula to calculate the test statistic.

No worries about the maths in the following part. You can also just use software like SPSS, R, Stata or any other statistical program you are familiar with. If you are using the Horizon software, you are even able to see directly in the software the results of your test without the need to use further statistical software.

But if all this is not accessible to you or you want to impress your colleagues with your knowledge, feel free to stick around the following part. Otherwise, just skip to the ‘Results Interpretation’ of this test.

The formula for the Chi-Square statistic is:

Where:

O_{ij }is the observed frequency in each cell of the contingency table

E_{ij} is the expected frequency in each cell

Before you can fill in the values in the formula, you need to calculate the expected frequency for each cell. The formula is as follows:

Where:

*R*_{i} is the total for row *i*

*C*_{j} is the total for column *j*

*N* is the grand total of all observations

This would give us the following expected values:

For Landing Page A, CTA Click: *E*_{11}=207.5

For Landing Page A, No CTA Click: *E*_{12}=2792.5

For Landing Page B, CTA Click: *E*_{21}=207.5

For Landing Page B, No CTA Click: *E*_{22}=2792.5

Let’s calculate χ^{2} by filling in the values from the 2x2 contingency table.

Now that we have the Chi-Square statistic, the next step is to compare it with the critical value from the Chi-Square distribution with appropriate degrees of freedom to determine the p-value and assess statistical significance. The degrees of freedom in this case would be (*rows*-1)(*columns*-1)=1. You can use a Chi-Square distribution table or a statistical software package to find the critical value and determine the p-value. If the p-value is below your chosen significance level (commonly 0.05), you would reject the null hypothesis.

Using a Chi-Square distribution table (you can find our very own one here) or statistical software, you find that the critical value at a 0.05 significance level with 1 degree of freedom is approximately 3.84.

Now, we compare the calculated Chi-Square statistic with the critical value. We observe that the calculated Chi-Square statistic (18.7) is greater than the critical value (3.84), which leads us to reject the null hypothesis. This means there is a significant association between the landing page, and CTA clicks at the 0.05 significance level.