At the time of this experiment, Udacity courses currently have two options on the home page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.
In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. This screenshot shows what the experiment looks like.
The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.
The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.
We have two initial hypothesis.
Unit of diversion: cookie
Sanity Check is useful when we want to make sure that the data filtered for experiment and control group is the same. This can be done using the right invariance metric. These three metrics shouldn't change because it's outside of the experiment, in a sense that these metric calculated all before the experiment begin.
Number of cookies who views the page should be the same when Udacity experiment. They haven't click the "Start Now" button and see "Free Trial Screener" experiment. So number of cookies can be used as invariant metrics. When users click the button, they also haven't yet see the experiment that Udacity does, so number of clicks shouldn't change between experiment and control groups.
Since the experiments only occurs after the users click the "Start Now" button, its click-through-probability also have to be the same for each experiment and control group. We know that number of cookies and number of clicks has to be the same, then click-thorough-probability also has to be the same.
Besides cookie-id, there is also user-id. But user-id is not a good invariant, because Udacity also open to unregistered users to view page until after click of a button.
We have two initial hypothesis.
For evaluation metrics, I choose Gross Conversion, Retention, and Net Conversion. All of these metrics are a good evaluation metrics since they change when the experiment change, and since each of the metrics has user-ids as the unit of analysis, should be much smaller standard error since Udacity also using it as the cookie of diversion.
Gross conversion is the number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. After the visitors click the button, they should see the screener, hence the warning. It should be makes other visitors that doesn't have serious commitment back down and cancel it right away.
Retention is number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. The experiment intend to focus the visitors that only want to make a serious commitment. The retention rate should be higher for experiment group than the control group.
Net conversion is number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. Net Conversion also true, since the experiment intend to see higher conversion rate for students to continue (at least make one payment) than the users that only click the button, that doesn't even see the warning experiment given.
Ouf of these metrics, Retention turns out have a longer duration, which is 118 days. This takes too long, and it's not something Udacity willing to give for the experiment. So Retention will be excluded.
The first part is what Gross Conversion does, we should expect after the experiment, Gross Conversion shows significantly reduce the number who left trial because they don’t have time commitment. The experiment group should be significantly different than control group. The second part is what Net Conversion does, we should expect after experiment, the metric shows insignificantly reduce the number of students who at least make one payment. The experiment group should not significantly different than control group.
Expect analytical variance match empirical variance because unit of analysis and unit of diversion is same.
To calculate standard deviation, we use this formula
Formula = np.sqrt(p * (1-p) / n)
and using baseline data below.
baselines= """Unique cookies to view page per day: 40000 Unique cookies to click "Start free trial" per day: 3200 Enrollments per day: 660 Click-through-probability on "Start free trial": 0.08 Probability of enrolling, given click: 0.20625 Probability of payment, given enroll: 0.53 Probability of payment, given click: 0.1093125""" lines = baselines.split('\n') d_baseline = dict([(e.split(':\t'),float(e.split(':\t'))) for e in lines])
Since we have 5000 sample cookies instead of the original 40000, we can adjust accordingly using calculate probability. For these two evaluation metric, we need number of users who click "Start Now" button, and calculated as
n = 5000 n_click = n * d_baseline['Click-through-probability on "Start free trial"'] n_click
Next, standard deviation for Gross conversion is
p = d_baseline['Probability of enrolling, given click'] round(np.sqrt(p * (1-p) / n_click),4)
and for Net Conversion,
p = d_baseline['Probability of payment, given click'] round(np.sqrt(p * (1-p) / n_click),4)
Gross Conversion and Net Conversion, their empirical variance should approximate analytical variance, because the unit of analysis and unit of diversion is the same, cookie-ids/user-ids.
The pageviews needed then will be: 685275 impression.
We feed it into sample size calculator.
We can use bigger number, so the minimum required cookies is sufficient. The sample size is only for one group, so output from the calculator must be doubled to get the enough pageviews. Since this only the user who clicks, we calculate number of pageviews using CTP. The pageviews needed then will be:
(27411 * 2) / d_baseline['Click-through-probability on "Start free trial"']
The fraction of experiment exposure to Udacity visitors will be 80%. The experiment isn't risky enough that may potentially leaked as blog news or article. It doesn't really big a news, as Udacity only want to put little warning to the users. Because only 40000 pageviews each day can be gathered, the duration will be 22 days.
This is where Retention metric fail for our evaluation metrics. It has a longer duration, which is 118 days. This takes too long, and it's not something Udacity willing to give for the experiment. So Retention will be excluded.
Number of Cookies:
Number of clicks on “Start free trial”:
Click-through-probability on “Start free trial”:
Since we have passed all of the sanity checks, we can continue to analyze the experiment.
We do sanity checks to ensure that both experiment and control groups have equal proportion. It's the metric that shouldn't change when experiment change, which is invariant metrics that we chose earlier. First let's see the data that we want to analyze both at control and experiment.
control = pd.read_csv('control_data.csv') experiment = pd.read_csv('experiment.csv')
|0||Sat, Oct 11||7723||687||134||70|
|1||Sun, Oct 12||9102||779||147||70|
|2||Mon, Oct 13||10511||909||167||95|
|3||Tue, Oct 14||9871||836||156||105|
|4||Wed, Oct 15||10014||837||163||64|
|0||Sat, Oct 11||7716||686||105||34|
|1||Sun, Oct 12||9288||785||116||91|
|2||Mon, Oct 13||10480||884||145||79|
|3||Tue, Oct 14||9867||827||138||92|
|4||Wed, Oct 15||9793||832||140||94|
Next, we count the total views and clicks for both control and experiment groups.
control_views = control.Pageviews.sum() control_clicks = control.Clicks.sum() experiment_views = experiment.Pageviews.sum() experiment_clicks = experiment.Clicks.sum()
For count like number of cookies and number of clicks in "Start free trial" button, we can do confidence interval around the fraction we expect in control group, and actual fraction as the observed outcome. Since we expect control and experiment to have equal proportion, we set the the expected proportion to be 0.5. Both invariant metrics, the confidence interval for sanity checks use the function below.
def sanity_check_CI(control,experiment,expected): SE = np.sqrt((expected*(1-expected))/(control + experiment)) ME = 1.96 * SE return (expected-ME,expected+ME)
Now for sanity checks confidence interval of number of cookies who views the page,
The actual proportion is
Since we know that 0.5006 is within the interval, then experiment pass sanity checks for number of cookies.
Next, we calculate confidence interval of number of clicks at "Start free trial" button.
And the actual proportion,
Again 0.5006 is within the interval, so our experiment also pass the sanity check.
For our sanity check with ctp, is a little different calculation. Using simple count earlier, we know that if we setup our experiment in a proper way, the true proportion of control group should be 0.5. Since we don't know the true proportion of ctp control group, we build confidence interval around the control group, and ctp experiment as observed outcome. If the experiment change and ctp experiment is outside ctp control confidence interval, then our experiment failed sanity checks. Thus we can't continue our analysis.
ctp_control = float(control_clicks)/control_views ctp_experiment = float(experiment_clicks)/experiment_views
# %%R c = 28378 n = 345543 CL = 0.95 pe = c/n SE = sqrt(pe*(1-pe)/n) z_star = round(qnorm((1-CL)/2,lower.tail=F),digits=2) ME = z_star * SE c(pe-ME, pe+ME)
And as you can see, click-through-probability of the experiment is still within the confidence interval of click-through-probability control groups. Since we have passed all of the sanity checks, we can continue to analyze the experiment.
get_gross = lambda group: float(group.dropna().Enrollments.sum())/ group.Clicks.sum() get_net = lambda group: float(group.dropna().Payments.sum())/ group.Clicks.sum()
Keep in mind that observed_difference can be negative
print('N_cont = %i'%control.dropna().Clicks.sum()) print('X_cont = %i'%control.dropna().Enrollments.sum()) print('N_exp = %i'%experiment.dropna().Clicks.sum()) print('X_exp = %i'%experiment.dropna().Enrollments.sum())
N_cont = 17293 X_cont = 3785 N_exp = 17260 X_exp = 3423
#%%R N_cont = 17293 X_cont = 3785 N_exp = 17260 X_exp = 3423 observed_diff = X_exp/N_exp - X_cont/N_cont # print(observed_diff) p_pool = (X_cont+X_exp)/(N_cont+N_exp) SE = sqrt( (p_pool*(1-p_pool)) * ((1/N_cont) + (1/N_exp))) ME = 1.96 * SE # print(p_pool) c(observed_diff-ME, observed_diff+ME)
The observed difference is outside the confidence interval. And the observed difference also above 0.01 dmin, minimum detectable effect. We should definitely launch.
print('N_cont = %i'%control.dropna().Clicks.sum()) print('X_cont = %i'%control.dropna().Payments.sum()) print('N_exp = %i'%experiment.dropna().Clicks.sum()) print('X_exp = %i'%experiment.dropna().Payments.sum())
N_cont = 17293 X_cont = 2033 N_exp = 17260 X_exp = 1945
#%%R N_cont = 17293 X_cont = 2033 N_exp = 17260 X_exp = 1945 observed_diff = X_exp/N_exp - X_cont/N_cont # print(observed_diff) p_pool = (X_cont+X_exp)/(N_cont+N_exp) SE = sqrt( (p_pool*(1-p_pool)) * ((1/N_cont) + (1/N_exp))) ME = 1.96 * SE # print(p_pool) c(observed_diff-ME, observed_diff+ME)
The observed difference is within the confidence interval so it's not statiscally significant and also not practically significant. We may fail or continue with our results.
Sign Test is also a test that must be confirmed with effect size test. I'm using Online Calculator to calculate the binomial p-value, whether the probability of experiment is higher than control groups. If we simulate it, what ar the odds. If the probability is so rare, that isn't likely due to chance, then the experiment succeed, provided significance level, which I choose to be 5%.
I'm using helper function, to compare probability day-to-day whether the metric in question is smaller for group than the experiment.
compare_prob = lambda col: ((control.dropna()[col] / control.dropna().Clicks) < (experiment.dropna()[col]/experiment.dropna().Clicks))
Count the gross conversion, I got,
False 19 True 4 dtype: int64
False 13 True 10 dtype: int64
I got p-value of 0.6776 for Net Conversion.
I would not use Bonferroni correction in this case. Bonferroni correction needs all metrics to be significantly different. This is not what we do in our experiment. We have Gross Conversion that need to be significant, and Net Conversion that need to be insignificant.
Gross Conversion is good because it passes Udacity’s practical significance boundary. This means it reduces the number of students who feel not committed (in time/cost). However, even though Net Conversion is not statistically significance, its confidence interval touch practical significance boundary, which is not how Udacity wants. Udacity could lose potential money if the experiment launch. So my recommendation is further experiment or cancel if Udacity doesn’t have a time.
Net Conversion: somehow pass, can loss potential money
decision: risky. delay for further experiment or cancel the launch.
So what does it take to decrease not-so-serious users without losing potential money? I see that on every course overview page in Udacity, they already given information about the hours spent on particular course. So really, warning them again about time commitment might be unnecessary. What we could do, is giving them an incentive after their enrollment. In an experiment, after the students enroll, they are given an information on the right side of the video material page. An incentive of offering free payment until they graduate. The deal is they have to be Udacity Code Reviewer. Udacity has this program. It gives reasonable payment per hour to whoever graduates reviewing the students' code. If they agree, they can click the button “Start debt program” below the information page.
They will be able to continue after 14-day boundary and finished the course. But in return, they have to be Code Reviewer, and finish the debt through payroll. They won't be given any salary until their debt finished.
Yes, it seems risky to Udacity. But if the users break on their agreement, for example not become Code Reviewer within two months, they will be automatically charged through their registered credit card. They will also automatically charged if they cancel the program. So it’s safe to assume that we have handled risk of potential runner, but this is not part of the experiment.
The hypothesis is that after they’re given an incentive, they become more serious and committed to complete the course. By doing this incentive, number of users who cancel early in the course is also significantly reduced, and boost them compared to ones which already committed.
The unit of diversion is an user-id. Like free trial, the same user-id can’t follow the debt program twice. User-id is more cross-platform and more represent as an user than a cookie. User-ids that don’t enroll in the program, is not tracked in the experiment. The number of user-ids that are in debt program, but cancel at the end of the free trial is also not tracked.
We can use Invariant metrics for this experiment for the follow-up:
And the evaluation metric:
We use user-ids as unit of diversion, expect all of the evaluation metrics to be practically significant.