Statistical Data Analysis in Cross-Cultural Research

Mon, Mar 22, 2021 8 min read

Code

I use survey data collected from Amazon Mechanical Turk and Reddit user groups (all personal data have been removed) in a study to examine the impact of cultural localization on web-based account creation between American and Korean users. I use the experiment data to display basic statistical tests in Python.

Research Question:

Is there a difference in providing personal information between USA and Korean Internet users
within two different use scenarios: online banking and shopping?

I use the following tests:

Pearson Correlation Coefficient
T-Test
Mann-Whitney Test
One-Way Analysis of Variance (ANOVA)

Two-Way ANOVA

import os
import pandas as pd
import numpy as np
import seaborn as sns
import scipy
from matplotlib import pyplot
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
import statsmodels.formula.api as smf
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM
import pdb  # for debugging
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# set color
sns.set_color_codes('pastel')

Setup & Querying Data

It is first critical to understand the dataframe to play around and make analysis. Usually, long-format data is desired (or at least I’m used to it) for using Python and Seaborn for data visualization. Long format is basically when each variable is represented as a column, and each observation or event is a row. Below, we read in, and query the data.

Useful commands:

df.head(): by default, shows first five rows of df
df.columns(): prints all the columns in df
df.describe(): provides summary description of df

pd.read_csv(data, usecols=['col1', 'col2', ...,]): can be used to filter columns

# read in data.csv file as df & see data structure
df = pd.read_csv('data.csv')

# query data by scenario and culture
bank = df.query("scenario == 'Bank'").copy()
shop = df.query("scenario == 'Shop'").copy()
kor = df.query("culture == 'Korea'").copy()
usa = df.query("culture == 'USA'").copy()

# an example of the data structure
usa.head()

	UserGuid	culture	scenario	interface	complete	first	last	phone	dob	...	address	password	username	reason	total	total_possible	percent
0	0	USA	Bank	A	-	1	1	1	2	...	3	3	1	-	14	27	0.518519
1	0	USA	Shop	A	-	0	0	0	0	...	0	3	1	-	5	24	0.208333
2	0	USA	Bank	B	-	1	1	1	2	...	3	3	1	-	14	27	0.518519
3	0	USA	Shop	B	-	0	0	0	0	...	0	3	1	-	5	24	0.208333
4	1	USA	Bank	A	-	1	0	0	0	...	0	0	0	-	1	27	0.037037

5 rows × 24 columns

1. Pearson Correlation Coefficient

When we want to ask “how strongly correlated are the two variables?”, we can use Perason’s Correlation. It is used to measure statistical relationship or association between two continuous variables that are linearly related to each other. The coefficient value “r” ranges from -1 (negative relation) to 1 (perfectly positive). 0 would mean that there is no relationship at all.

Properties of Pearson Correlation

The units of the values do not affect the Pearson Correlation.
- i.e. Changing the unit of value from cm to inches do not affect the r value
The correlation between the two variables is symmetric:
- i.e. A -> B is equal to B -> A

** Use Spearman’s Correlation when the two variables have non-linear relationship (e.g. a curve instead of a straight line).

Code Implementation

We use scipy package to calculate the Pearson Correlation. The method will return two values: r and p value.

# let's look at the correlation of information provided by different scenarios: online banking vs. shopping
# bank['percent'] will return an array of percentage values

r, p = scipy.stats.pearsonr(bank['percent'], shop['percent'])  
print('r: ' + str(r.round(4)))
print('p: ' + str(p.round(4)))

r: 0.7592
p: 0.0

From the results above, we can see there is a strong positive relationship between the amount of information provided in banking and shopping. i.e. Providing information in banking would affect how a user provides personal information in shopping.

2. T-Test

When comparing the means of two groups, we can use a t-test. It takes into account of the means and the spread of the data to determine whether a difference between the two would occur by chance or not (determined by the p-value being less than 0.05 usually). In a t-test, there should be only two independent variables (categorical/nominal variables) and one dependent continuous variable.

Properties of t-test

The data is assumed to be normal (If the distribution is skewed, use Mann-Whitney test).
T-test yields t and p value:
2a. The higher the t, the more difference there is between the two groups. The lower the t, the more similar the two groups are.
2b. T-value of 2 means the groups are twice as different from each other than they are within each other
2c. The lower the p-value, the better (meaning that it is significant and the difference did not occure by chance). P-value of 0.05 means that there is 5 percent happening by chance

Code Implementation

We use scipy package again to run a t-test. Before we decide which test to run, we can quickly plot and see the distribution like below.

sns.distplot(df[df['scenario'] == 'Bank'].percent)

<matplotlib.axes._subplots.AxesSubplot at 0x1c238f61d0>

png

The distribution looks relatively normal. We can run a t-test to see whether there is a difference between the total amount of information provided by the users from each use scenario: i.e. banking vs. shopping

# we run a t-test to see whether there ia a difference in the amount of information provided in each scenario
t, p = scipy.stats.ttest_ind(df[df['scenario'] == 'Bank'].percent, df[df['scenario'] == 'Shop'].percent)
print('t: ' + str(t.round(4)))
print('p: ' + str(p.round(6)))

t: 4.8203
p: 2e-06

The result above shows that there is a significant difference in the amount of information provided between two use scenarios with t-value being high, and p-value being very small. However, we don’t actually know which scenario yields more information than the other. The t-test only tells there is a significant difference.

To find out, we can create a little fancy distribution plot with some box plots:

banking = df[df['scenario'] == 'Bank'].percent
shopping = df[df['scenario'] == 'Shop'].percent

# let's plot box-dist plot combined
f, (ax_box1, ax_box2, ax_dist) = plt.subplots(3, sharex=True,
                                              gridspec_kw= {"height_ratios": (0.3, 0.3, 1)})

# add boxplots at the top
sns.boxplot(banking, ax=ax_box1, color='g')
sns.boxplot(shopping, ax=ax_box2, color='m')
ax_box1.axvline(np.mean(banking), color='g', linestyle='--')
ax_box2.axvline(np.mean(shopping), color='m', linestyle='--')
plt.subplots_adjust(top=0.87)
plt.suptitle('Amount of information provided by use scenario', fontsize = 17)

# add distplots below
sns.distplot(banking, ax=ax_dist, label='Banking', kde=True, rug=True, color='g', norm_hist=True, bins=2)
sns.distplot(shopping, ax=ax_dist, label='Shopping', kde=True, rug=True, color='m', norm_hist=True, bins=2)

ax_dist.axvline(np.mean(banking), color='g', linestyle='--')
ax_dist.axvline(np.mean(shopping), color='m', linestyle='--')
plt.legend()
plt.xlabel('Percentage of information', fontsize=16)
ax_box1.set(xlabel='')
ax_box2.set(xlabel='')

[Text(0.5, 0, '')]

png

From the graph above, we see that the mean of the banking is greater than the mean of shopping. This shows us that regardless of cultural background, users are more likely to provide personal information in the banking scenario.

3. Mann-Whitney Test

The Mann-Whitney Test allows you to determine if the observed difference is statistically significant without making the assumption that the values are normally distributed. You should have two independent variables and one continuous dependent variable.

Code Implementation

We can run the test on the same banking vs. shopping scenario.

t, p = scipy.stats.mannwhitneyu(df[df['scenario'] == 'Bank'].percent, df[df['scenario'] == 'Shop'].percent)
print('t: ' + str(t.round(4)))
print('p: ' + str(p.round(6)))

t: 14795.5
p: 4.1e-05

4. One-Way Analysis of Variance (ANOVA)

ANOVA is similar to a t-test, but it is used when there are three or more independent variables (categorical). It assumes normal distribution (use Kruskal-Wallis if abnormal?). One-way ANOVA compares the means between the variables to test whether the difference is statistically significant. However, it does not tell you which specific groups were statistically different from one another. Thus, a post-hoc analysis is required.

Code Implementation

The result below suggests that there is a statistical difference in the means of the three variables.

# we can create a third variable, and compare the var1, var2, and var3 with one-way ANOVA
var3 = df[df['culture'] == 'USA'].percent
scipy.stats.f_oneway(banking, shopping, var3)

F_onewayResult(statistic=11.171874914065159, pvalue=1.7072783704546878e-05)

5. Two-Way ANOVA

A two-way ANOVA can be used when you want to know how two independent variables have an interaction effect on a dependent variable. CAVEAT: a two-way ANOVA does not tell which variable is dominant.

Code Implementation

Below in the code, we see if there is an interaction effect between culture and scenario use cases on the total amount of information provided. For example, would Americans be more willing to provide personal information than Koreans? If so, does the use case (either banking vs. shopping) affect at all?

# we give in a string value of each variable, and the interaction variable 'culture:scenario'

model = ols('percent ~ culture + scenario + culture:scenario', data=df).fit()
sm.stats.anova_lm(model, typ=2)

	sum_sq	df	F	PR(>F)
culture	0.000344	1.0	0.007439	0.931312
scenario	1.070130	1.0	23.159298	0.000002
culture:scenario	0.032834	1.0	0.710576	0.399772
Residual	17.928461	388.0	NaN	NaN

Conclusion

From the table above, only scenario has a sole effect on the total amount of information provided (depicted as percent in the dataframe). We see culture, and the interaction of culture and scenario do not have an effect on the amount of information that users provided.

The finding matches with the previous t-test and graph results, where users provided more information in the banking than they would in shopping.

Statistical Data Analysis in Cross-Cultural Research

Research Question:

Setup & Querying Data

Useful commands:

1. Pearson Correlation Coefficient

Properties of Pearson Correlation

Code Implementation

2. T-Test

Properties of t-test

Code Implementation

3. Mann-Whitney Test

Code Implementation

4. One-Way Analysis of Variance (ANOVA)

Code Implementation

5. Two-Way ANOVA

Code Implementation

Conclusion

Jin Jeon

UX Researcher

Related