A/B Testing at Scale: How We Engineered a Solution for Millions of Users with AWS

Unacademy is a unicorn tech startup and India's largest online learning platform, boasting over 50 million users. It has attracted investors such as Meta, Sequoia, and Tiger Global. As a Software Engineer in the Web Platform Team at Unacademy, I've had the privilege of working on some challenging problems at scale. Let us look at one of them which was particularly interesting: A/B Testing at Scale.

Let's start with the basics. If you're new to this topic, I'm sure you'll learn something new by the end of this article. Feel free to skip ahead if you're already familiar with the concepts, but I highly recommend checking out the interactive demo.

Basics of A/B Testing

What is A/B testing?

A/B testing is like a science experiment for websites or apps. Imagine you have two versions of a webpage, A (known as Control variant) and B (known as Test variant). Version A might have a green button, and version B has a red button. You want to know which color button more people will click on.

So, you show version A to some of your visitors and version B to others. Then you watch and see which version gets more clicks. If more people click the green button on version A, you might decide to use the green button for everyone because it seems to work better.

In simple words, A/B testing is a way to compare two versions of something to see which one does a better job.

Fun Fact: As we speak, companies like Amazon, Netflix, and Instagram are quietly running A/B tests, shaping your experience without you even realizing it.

Can I see it in action?

Yes, get ready to be amazed! 🤩 I've added what I call the 'Playground'—an interactive code editor for React, running directly in your browser in real-time without a backend server.

Playground: A/B Test in React

Try hitting the refresh button multiple times to see if you encounter a different variant each time.

import React, { useState, useEffect } from 'react';

const ABTest = () => {
// State to hold which variant to show
const [variant, setVariant] = useState('');

useEffect(() => {
// Randomly choose a variant on component mount
const variants = ['A', 'B'];
const selectedVariant = variants[Math.floor(Math.random() * variants.length)];
setVariant(selectedVariant);
}, []);

return (

<div>
    {variant === 'A' && <VariantA />}
    {variant === 'B' && <VariantB />}
</div>
); };

export default ABTest;

const VariantA = () => <>

<h1>
    Wow! You're in the 50% of the people in the world seeing{' '}
    <span style={{ color: 'royalblue' }}>Variant A.</span>
</h1>
<p>Try hitting the refresh button on the bottom, multiple times.</p>

</>;

const VariantB = () => <>

  <h1>
    Yay! You're in the 50% of the people in the world seeing{" "}
    <span style={{ color: "green" }}>Variant B.</span>
  </h1>
  <p>Try hitting the refresh button on the bottom, multiple times.</p>
  
</>;

Open Sandbox

Wow, why is it useful?

This is great for businesses because:

Improves Customer Experience: By testing different options, businesses can find out what their customers prefer and make their websites or products more enjoyable to use.
Increases Sales: If a business knows which version of a webpage leads to more sales or sign-ups, they can use that version for everyone, potentially making more money.
Reduces Risks: Before making big changes, like redesigning a website, businesses can test small changes to see how people react. This way, they avoid making big investments that might not pay off.
Informs Decisions: Instead of relying on gut feelings, businesses can make informed decisions that are backed up by actual user behavior.

Hence, A/B testing helps businesses understand what works best, leading to happier customers and more sales.

The Button That Made Millions: Amazon found that by making their website just 1 second faster, sales increased by 1%. This is a classic example of how seemingly minor changes tested through A/B testing can lead to significant financial gains.

And, how does it work?

As you can see in the code above, we simply used the Math.random() on line 10 function to determine which variant to show you. It's as simple as that to keep the process as fair as possible.

In the industry, companies may use a similar random function to determine user buckets or rely on third-party services that achieve the same result. Once a user has been assigned to a specific bucket (say, Variant A), they must always see the content for Variant A, even on page refresh. They should never see content from any other variant until the experiment is concluded.

Developing A/B tests

For this article, we will consider that the frontend is built using a SPA framework like React, but the concepts remain the same, regardless of the frontend stack.

Old School Approach (Traditional)

In this approach, when a user visits a website, their bucketing is done on the client-side or via a backend API. A variable is then set in their browser's localStorage to determine which variant they should see. This value is subsequently used to ensure the user sees the same variant on future visits.

Playground: A/B Test causing Flash of Unstyled Content (FOUC)

Try refreshing the playground, and you'll notice you see the same variant every time, but with a "loading..." flash. If you want to view your localStorage, you can do so by inspecting the browser and going to the Application tab.

import React, { useState, useEffect } from 'react';

// Placeholder image URLs for demonstration
const imageUrlA = 'https://fakeimg.pl/150x150/0000FF/808080?text=Variant+A';
const imageUrlB = 'https://fakeimg.pl/150x150/FF0000/FFFFFF?text=Variant+B';

const VariantA = () => <img src={imageUrlA} alt="Variant A" />;
const VariantB = () => <img src={imageUrlB} alt="Variant B" />;

const ABTestComponent = () => {
const [variant, setVariant] = useState('loading');

useEffect(() => {
// Simulate reading from localStorage with a delay
setTimeout(() => {
const storedVariant = localStorage.getItem('userVariant') || (Math.random() < 0.5 ? 'A' : 'B');
localStorage.setItem('userVariant', storedVariant);
setVariant(storedVariant);
}, 500); // 500ms second delay to simulate a flicker effect
}, []);

return (

<div>
    {variant === 'loading' && <div>Loading variant...</div>}
    {variant === 'A' && <VariantA />}
    {variant === 'B' && <VariantB />}
</div>
); };

export default ABTestComponent;

Open Sandbox

Pros of this approach

This approach is fine in most cases but has it's cons in large-scale applications.

Simplicity: localStorage provides a straightforward API that's easy to use for storing and retrieving simple data, making it easy for developers.
Persistence: Data stored in localStorage remains between sessions, allowing web applications to remember information or user preferences even after the browser is closed and reopened.

Drawbacks of this approach

Delay due to Hydration: In SPAs, data from localStorage can influence the web-app using client-side JavaScript, only after hydration is complete. This results in a Flash of Unstyled Content (FOUC), impacting the initial user experience. This effect is especially prominent if you're conducting A/B tests on the top of the landing page (above the fold).

Hydration is like watering the “dry” HTML with the “water” of interactivity and event handlers.
~ Dan Abramov (React core team member)

Dip in Web Performance and SEO: FOUC can also lead to a significant drop in web performance metrics, notably Cumulative Layout Shift (CLS), which can negatively impact SEO and, consequently, the overall business.

The above playground simulates this behaviour by adding a delay to the localStorage read operation, which causes a flicker effect. Think of this, as the delay caused for the JavaScript to run due to hydration in SPAs, and then set the variant.

New-age Approach ✨

Given that Unacademy had millions of visitors coming to its web application, we needed an approach that addressed the drawbacks related to SEO and web performance metrics.

Things we considered while designing the new approach:

We had to determine which variant each user would see before they even loaded the webpage, which could help us skip the whole hydration issue.
Because we use AWS Cloudfront, a CDN, to cache our pages, they're served straight from the cache without touching our servers. This meant we needed to sort out the user's variant before the request even made it to AWS Cloudfront.
Our logic needs to run efficiently for millions of users without experiencing any downtime.

Like with all tech solutions, every approach involves some trade-offs that we discuss below. We aimed to optimize for user experience and revenue, ensuring the best user experience with zero impact on revenue.

Tech Architecture

We make use of something called a Serverless function, which you can think of as code executing close to the user, but without the need for a server to run it. We will utilize the power of AWS Lambda@Edge, a serverless compute service that helps us achieve this.

Think of AWS Lambda@Edge like a vending machine for your favorite snack that's placed right next to your room, instead of you having to walk to the store. Whenever you want a snack, it's quickly available. Similarly, Lambda@Edge puts the code needed for a website or app to work right near the user, so everything loads faster and smoother, without needing a whole computer server set up by you.

I will guide you through this architecture with a simple example.

Step 1: Setting up the Webpages

Let's say we're conducting A/B Tests on the /home route of our web application.

Create different versions of your webpage. This could be as simple as creating a new subroute, such as /home/variant-b. /home acts as the control group (default or Variant A), and /home/variant-b as the test group (Variant B). This setup is a one-time step.
Make the necessary changes in your components on /home/variant-b to reflect the new variant. For instance, you might change an image or color of a button.
Deploy to staging environment and test the new variant by visiting the URL directly. For instance, to test /home/variant-b, simply visit that URL to view the new variant.

Note: This approach may lead to some code duplication, a trade-off we accept for the benefits it offers. This duplication can be minimized by developing components to be reusable across variants.

We accept this trade-off because A/B tests are typically short-term, and the duplicated page will be removed at the experiment's end.

Step 2: Setting up AWS

If you're new to AWS Lambda@Edge, you can refer to the official AWS documentation in the next step.

Create a new behavior in AWS Cloudfront for /home and /home/variant-b, applying the same caching policies and settings as other pages, according to team requirements.
Deploy a new AWS Lambda@Edge function to the /home route as an Viewer Request. This allows us to intercept client requests.
Deploy another AWS Lambda@Edge function to the /home route as an Viewer Response. This enables us to set the cookie before sending the response back to the client.
Finally, deploy a third AWS Lambda@Edge function to the /home/variant-b route as an Viewer Request. This is used to redirect users back to the /home route if they access it directly.

Viewer Request: These functions run before the request is sent to the origin server (in our case, the CDN). They can modify the request before it reaches the origin server.

Viewer Response: These functions run after the request is sent from the origin server (in our case, the CDN). They can modify the response before it is sent back to the client.

Step 3: Coding up the AWS Lambda@Edge Logic

Viewer Request Function on `/home` Route:

Implement logic to determine the user's variant. Check if the variant cookie already exists. If it does, proceed with that variant; otherwise, determine the variant.
This can be done using a random function or a third-party service. Ensure the third-party service is fast and reliable, as it could be a performance bottleneck.
Once the variant is determined, select the appropriate page from the CDN. Modify the origin to fetch from either /home or /home/variant-b based on the variant. This ensures users see /home in their browser URL, but the content varies.

Viewer Response Function on `/home` Route:

If the variant is already in the cookie, skip this step. Otherwise, set the variant in the cookie before sending it back to the client.

Viewer Request Function on `/home/variant-b` Route:

Redirect users to the /home route.

AWS Official Documentation

If you're new to AWS Lambda@Edge, I recommend checking out the links below. They lead to the official AWS documentation for Lambda@Edge, offering a solid starting point.

Feel free to comment on this post if you have any questions. I'd be happy to help.

source: AWS Developer Documentation

Pros of this approach

Runs close to the user at edge locations, reducing latency.
No servers required, helping us scale to millions.
Guaranteed execution on every request, ensuring the user's variant is determined before it even reaches AWS Cloudfront.
Eliminates flash of unstyled content (FOUC) since the variant is determined before the page loads.

Drawbacks of this approach

Cost Implications: AWS Lambda@Edge charges for each request and the time it takes to execute the function. However, according to our calculations, the cost is relatively low compared to the benefits it offers. Optimizing the logic and limiting execution to only the routes where an A/B test is active can further reduce costs.
Code Duplication: As discussed above, this approach may lead to some minor code duplication, a trade-off we accept for the benefits it offers. Remember to delete the duplicated code after the A/B test is concluded.

Conclusion

This approach helped our product teams significantly cut down feature launch times, without impacting user experience or revenue. In fact, it helped us arrive at decisions earlier and with more confidence, leading to a better user experience and increased revenue 💪

Fun Fact: AWS Lambda@Edge can be used for more than just A/B testing. Use cases such as internationalization (showing different content based on geolocation), image optimization, and header modification fit perfectly with its capabilities.

That's pretty much it! Kudos if you've made it this far, and I hope this article helps you learn how to implement A/B tests efficiently at scale.

Feel free to comment on this post if you have any questions. I'd be happy to help. In case of any mistakes, please let me know via the Contact page so I can correct them.

If you learned something new from this article, please share this article for some good karma and leave a reaction or a comment below. It would mean a lot to me 😇

A/B Testing at Scale: How We Engineered a Solution for Millions of Users with AWS

Basics of A/B Testing

What is A/B testing?

Can I see it in action?

Playground: A/B Test in React

Wow, why is it useful?

And, how does it work?

Developing A/B tests

Old School Approach (Traditional)

Playground: A/B Test causing Flash of Unstyled Content (FOUC)

Pros of this approach

Drawbacks of this approach

New-age Approach ✨

Tech Architecture

Step 1: Setting up the Webpages

Step 2: Setting up AWS

Step 3: Coding up the AWS Lambda@Edge Logic

Viewer Request Function on /home Route:

Viewer Response Function on /home Route:

Viewer Request Function on /home/variant-b Route:

AWS Official Documentation

Pros of this approach

Drawbacks of this approach

Conclusion

Comments

Viewer Request Function on `/home` Route:

Viewer Response Function on `/home` Route:

Viewer Request Function on `/home/variant-b` Route: