Exploring the Impact of Marketing Strategiese: A Deep Dive into EDA and A/B Testing

ABTest

Python

Author

Sujan Bhattarai

In the dynamic world of marketing, understanding customer behavior and optimizing strategies is paramount for success. As data scientists, we play a crucial role in extracting actionable insights from vast datasets. In this blog post, we will explore how exploratory data analysis (EDA) and A/B testing can be employed to enhance marketing campaigns, using a real-world marketing dataset. #### Understanding the Dataset Our dataset, collected from a marketing campaign, comprises various attributes, including user demographics, marketing channels, and conversion outcomes. The initial steps in our analysis involve loading and preprocessing the data to ensure its integrity and suitability for analysis

Exploring the Impact of Marketing Strategies Through Data Science: A Deep Dive into EDA and A/B Testing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats  # for AB testing

# Load the dataset
marketing = pd.read_csv("data/marketing.csv")
print(marketing.head(5))

      user_id date_served  ... subscribing_channel is_retained
0  a100000029      1/1/18  ...           House Ads        True
1  a100000030      1/1/18  ...           House Ads        True
2  a100000031      1/1/18  ...           House Ads        True
3  a100000032      1/1/18  ...           House Ads        True
4  a100000033      1/1/18  ...           House Ads        True

[5 rows x 12 columns]

To facilitate our analysis, we map categorical marketing channels to numeric values. This transformation enables more efficient computations and visualizations.

# Mapping channels to numeric values
channel_dict = {
    "House Ads": 1, "Instagram": 2,
    "Facebook": 3, "Email": 4, "Push": 5
}

# Map the channel to channel code
marketing['channel_code'] = marketing['subscribing_channel'].map(channel_dict)

Performing Exploratory Data Analysis (EDA)

EDA is a critical step in understanding our data. By calculating key marketing metrics such as total users, retention rates, and conversion rates, we can assess the effectiveness of different marketing channels.

To automate the EDA process, we create a function to calculate conversion rates based on user segments. This function allows us to quickly assess the performance of different marketing strategies.

def conversion_rate(dataframe, column_names):
    #total number of converted users
    column_conv = dataframe[dataframe['converted']== True].groupby(column_names)['user_id'].nunique()
    #total number of users
    column_total= dataframe.groupby(column_names)['user_id'].nunique()
    #conversion rate
    conversion_rate = column_conv/column_total
    #fill missing values with zero
    conversion_rate = conversion_rate.fillna(0)
    return conversion_rate

Using this function, we can calculate conversion rates by age group and visualize the results to identify trends in user engagement.

# calculate conversion rate by age group
age_group_conv = conversion_rate(marketing, ['date_served', 'age_group'])

#unstack and create a dataframe
age_group_df = pd.DataFrame(age_group_conv.unstack(level = 1))

#visualize conversion by age group
age_group_df.plot()
# a function that should produce plot based on grouping based on one column
def plotting_conv(dataframe):
    for column in dataframe:
        # Plot column by dataframe's index
        plt.plot(dataframe.index, dataframe[column])
        plt.title('Daily ' + str(column) + ' Conversion Rate', size=16)
        plt.xlabel('Date', size=14)
        
        # Set x-axis labels to vertical
        plt.xticks(rotation=90)

        # Show plot
        plt.show()
        plt.clf()

#calculate conversion rate by dateserved and age group
age_group_conv = conversion_rate(marketing, ['date_served', 'age_group'])

#unstakc age_group_conv and craete a dataframe
age_group_df = pd.DataFrame(age_group_conv.unstack(level=1))

#plot the results
plotting_conv(age_group_df)

Analyzing the Impact of the Day of the Week

Understanding temporal patterns in marketing effectiveness can provide valuable insights. We examine whether the day of the week affects conversion rates, especially for House Ads, which often see varying performance based on timing.

#calculate conversion rate by date_served and channel
daily_conv_channel = conversion_rate(marketing, ['date_served', 'marketing_channel'])
print(daily_conv_channel)

date_served  marketing_channel
1/1/18       Email                1.000000
             Facebook             0.117647
             House Ads            0.084656
             Instagram            0.106667
             Push                 0.083333
                                    ...   
1/9/18       Email                0.500000
             Facebook             0.120690
             House Ads            0.127389
             Instagram            0.152542
             Push                 0.054054
Name: user_id, Length: 155, dtype: float64

#add day of week column to marketing
marketing['date_served'] = pd.to_datetime(marketing['date_served'])

<string>:2: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.

marketing['DoW_served'] = marketing['date_served'].dt.dayofweek

#calculate conversion rate by day of week
DoW_conversion = conversion_rate(marketing, ['DoW_served', 'marketing_channel'])

#unstack channels
DoW_df = pd.DataFrame(DoW_conversion.unstack(level = 1))

#plot conversion rate by day of week
DoW_df.plot()
plt.title('Conversion rate by day of week')
plt.ylim(0)

(0.0, 0.9487847222222222)

plt.show()

Segmenting Users for Deeper Insights

To better understand how language preferences impact conversion, we segment our data by the displayed language and examine conversion rates.

#Isolate hte rows where marking channel is house ads
house_ads = marketing[marketing['marketing_channel']== 'House Ads']

#calculate the conversion by date served and language displayed
conv_lang_channel = conversion_rate(house_ads, ['date_served', 'language_displayed'])

#unstack conv_lang_channel
conv_lang_df = pd.DataFrame(conv_lang_channel.unstack(level=1))

#use plotting fucntions to display results
plotting_conv(conv_lang_df)

We can also assess whether users received messages in their preferred language, revealing further insights into potential barriers to conversion.

# Add the new column for language correctness
house_ads['is_correct_lang'] = np.where(house_ads['language_preferred'] == house_ads['language_displayed'], 'Yes', 'No')

<string>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

# Group by date served and language correctness
language_check = house_ads.groupby(['date_served', 'is_correct_lang'])['is_correct_lang'].count()
language_check_df = pd.DataFrame(language_check.unstack(level=1)).fillna(0)

# Print results and calculate percentage of correct language
language_check_df['pct'] = language_check_df['Yes'] / language_check_df.sum(axis=1)

# Plot results
plt.plot(language_check_df.index.values, language_check_df['pct'])
plt.title('Correct Language Display Rate Over Time')
plt.xlabel('Date')
# Set x-axis labels to vertical
plt.xticks(rotation=90)

(array([17532., 17536., 17540., 17544., 17548., 17552., 17556., 17560.,
       17563.]), [Text(17532.0, 0, '2018-01-01'), Text(17536.0, 0, '2018-01-05'), Text(17540.0, 0, '2018-01-09'), Text(17544.0, 0, '2018-01-13'), Text(17548.0, 0, '2018-01-17'), Text(17552.0, 0, '2018-01-21'), Text(17556.0, 0, '2018-01-25'), Text(17560.0, 0, '2018-01-29'), Text(17563.0, 0, '2018-02-01')])

plt.ylabel('Percentage')
plt.show()

#calcualte pre_error conversion rate
house_ads_bug = house_ads[house_ads['date_served'] < '2018-01-11']
lang_conv = conversion_rate(house_ads_bug, ['language_displayed'])

#Index other language conversion rate against English
spanish_index = lang_conv['Spanish']/lang_conv['English']
arabic_index = lang_conv['Arabic']/lang_conv['English']
german_index = lang_conv['German']/lang_conv['English']

print("Spanish index:", spanish_index)

Spanish index: 1.681924882629108

print("Arabic index:", arabic_index)

Arabic index: 5.045774647887324

print("German index:", german_index)

German index: 4.485133020344287

#Group house_ads by date and language
converted = house_ads.groupby(['date_served', 'language_preferred']).agg({'user_id': 'nunique', 'converted': 'sum'})

#unstacked convereted
converted_df = pd.DataFrame(converted.unstack(level = 1))
# Create English conversion rate column for affected period
converted_df['english_conv_rate'] = converted_df.loc['2018-01-11':'2018-01-31'][('converted','English')]/converted_df.loc['2018-01-11':'2018-01-31'][('user_id','English')]

# Create expected conversion rates for each language
converted_df['expected_spanish_rate'] = converted_df['english_conv_rate']*spanish_index
converted_df['expected_arabic_rate'] = converted_df['english_conv_rate']*arabic_index
converted_df['expected_german_rate'] = converted_df['english_conv_rate']*german_index

# Multiply number of users by the expected conversion rate
converted_df['expected_spanish_conv'] = converted_df['expected_spanish_rate']/100*converted_df[('user_id','Spanish')]
converted_df['expected_arabic_conv'] = converted_df['expected_arabic_rate']/100*converted_df[('user_id','Arabic')]
converted_df['expected_german_conv'] = converted_df['expected_german_rate']/100*converted_df[('user_id','German')]

# use .loc to slice only the relevant dates
converted = converted_df.loc['2018-01-11':'2018-01-31']

#sum expect subscribers for each language
expected_subs = converted['expected_spanish_conv'].sum() + converted['expected_arabic_conv'].sum() + converted['expected_german_conv'].sum()

#calculate how many subscribers we actually got
actual_subs = converted[('converted', 'Spanish')].sum()  + converted[('converted', 'Arabic')].sum() + converted[('converted', 'German')].sum()
#substract how many subscribers we got despite the bug
lost_subs = expected_subs - actual_subs
print(lost_subs)

-25.495425075261792

A/B Testing: Evaluating Email Campaigns

To assess the effectiveness of our marketing strategies, we conduct an A/B test on our email campaigns. By segmenting our users into control and personalization groups, we can evaluate which approach yields better conversion rates.

# Subset the DataFrame for email marketing
email = marketing[marketing['marketing_channel'] == 'Email']

# Group the email DataFrame by variant
alloc = email.groupby(['variant'])['user_id'].nunique()

# Plot test allocation
alloc.plot(kind='bar')
plt.title('Email Personalization Test Allocation')
plt.ylabel('# Participants')
plt.show()

# Group marketing by user_id and variant
subscribers = email.groupby(['user_id', 'variant'])['converted'].max()
subscribers_df = pd.DataFrame(subscribers.unstack(level=1))

# Calculate conversion rates
control = subscribers_df['control'].dropna()
personalization = subscribers_df['personalization'].dropna()

print('Control conversion rate:', np.mean(control))

Control conversion rate: 0.2814814814814815

print('Personalization conversion rate:', np.mean(personalization))

Personalization conversion rate: 0.3908450704225352

To evaluate the effectiveness of personalization, we define a function to calculate the lift in conversion rates between the two groups.

def lift(a,b):
    # Calcuate the mean of a and b 
    a_mean = np.mean(a)
    b_mean = np.mean(b)
    
    # Calculate the lift using a_mean and b_mean
    lift = (b_mean-a_mean)/a_mean
  
    return str(round(lift*100, 2)) + '%'
  
# Print lift() with control and personalization as inputs
print(lift(control, personalization))

38.85%

Segmenting for Targeted Insights in A/B Testing

Finally, we can perform segmented A/B testing based on various user demographics, such as language displayed and age group, allowing us to uncover nuanced insights into user behavior.

def ab_segmentation(segment):
    for subsegment in np.unique(marketing[segment].values):
        print(f'Segment: {subsegment}')
        
        email = marketing[(marketing['marketing_channel'] == 'Email') & (marketing[segment] == subsegment)]
        email.loc[:, 'converted'] = pd.to_numeric(email['converted'], errors='coerce')
        
        subscribers = email.groupby(['user_id', 'variant'])['converted'].max()
        subscribers = pd.DataFrame(subscribers.unstack(level=1))
        
        control = pd.to_numeric(subscribers['control'], errors='coerce').dropna()
        personalization = pd.to_numeric(subscribers['personalization'], errors='coerce').dropna()

        if len(control) > 0 and len(personalization) > 0:
            print('Lift:', lift(control, personalization))
            print('T-statistic:', stats.ttest_ind(control, personalization), '\n\n')
        else:
            print('Not enough data to perform t-test for', subsegment)

# Segmenting based on language displayed and age group
ab_segmentation('language_displayed')

Segment: Arabic
Lift: 50.0%
T-statistic: TtestResult(statistic=np.float64(-0.5773502691896255), pvalue=np.float64(0.5795840000000001), df=np.float64(8.0)) 


Segment: English
Lift: 39.0%
T-statistic: TtestResult(statistic=np.float64(-2.2183598646203215), pvalue=np.float64(0.026991701290720503), df=np.float64(486.0)) 


Segment: German
Lift: -1.62%
T-statistic: TtestResult(statistic=np.float64(0.19100834180787182), pvalue=np.float64(0.8494394170062678), df=np.float64(42.0)) 


Segment: Spanish
Lift: 166.67%
/Users/sujanbhattarai/.virtualenvs/r-reticulate/lib/python3.9/site-packages/scipy/stats/_axis_nan_policy.py:531: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.
  res = hypotest_fun_out(*samples, **kwds)
T-statistic: TtestResult(statistic=np.float64(-2.3570226039551585), pvalue=np.float64(0.04015671811047753), df=np.float64(10.0))

ab_segmentation('age_group')

Segment: 0-18 years
Lift: 121.4%
T-statistic: TtestResult(statistic=np.float64(-2.966044912142212), pvalue=np.float64(0.003872449439129706), df=np.float64(89.0)) 


Segment: 19-24 years
Lift: 106.24%
T-statistic: TtestResult(statistic=np.float64(-3.0317943847866697), pvalue=np.float64(0.0030623836114689195), df=np.float64(105.0)) 


Segment: 24-30 years
Lift: 161.19%
T-statistic: TtestResult(statistic=np.float64(-3.861539544326876), pvalue=np.float64(0.00018743381094867337), df=np.float64(114.0)) 


Segment: 30-36 years
Lift: -100.0%
T-statistic: TtestResult(statistic=np.float64(3.185906464414798), pvalue=np.float64(0.002323848743176535), df=np.float64(58.0)) 


Segment: 36-45 years
Lift: -85.23%
T-statistic: TtestResult(statistic=np.float64(2.431790127931851), pvalue=np.float64(0.017975686009788244), df=np.float64(61.0)) 


Segment: 45-55 years
Lift: -72.22%
T-statistic: TtestResult(statistic=np.float64(2.0654991273179326), pvalue=np.float64(0.04306233968820124), df=np.float64(62.0)) 


Segment: 55+ years
Lift: -100.0%
T-statistic: TtestResult(statistic=np.float64(3.326565456420339), pvalue=np.float64(0.0016358623456360468), df=np.float64(51.0))

Conclusion

Through EDA and A/B testing, data scientists can derive meaningful insights from marketing datasets, informing strategic decisions and optimizing marketing campaigns. By leveraging data analysis, we empower organizations to better understand their audiences and enhance their marketing efforts. This comprehensive approach not only drives engagement but also boosts conversion rates, ultimately leading to a more effective marketing strategy.

In an era where data is king, the role of the data scientist is not just to analyze data but to transform insights into actionable strategies that propel businesses forward.