Thursday, September 18, 2014

#IceBucketChallenge - A Viral Campaign

I know this blog is a little late to the game. In fact, I started writing it while it was still relevant... but then I got distracted. Such is life.


Background

You probably already know (or vaguely remember) what the Ice Bucket Challenge is. Basically if you're nominated you have to dump a bucket of ice water over your head, or else you have to donate money to some charity - most commonly ALSA, the Amyotrophic Lateral Sclerosis Association.

You then nominate 3 more people, who have 24 hours to do the same thing. In some variants, you also make a small donation even if you dump the ice water over your head, or else you make a bigger donation if you don't (some specify $100).

Anyway, what I was interested in is how the challenge spread - this was a 'viral' campaign in a very true sense. So why not try to model how the campaign spread as we would model the spread of a virus?


The Viral Model

I've written about this sort of thing several times before. The basic idea is this - the population is divided into three groups:

Susceptible (S) - The population that hasn't been exposed to a disease/virus, but is susceptible to infection
Infectious (I) - The people who have been exposed, and can infect other people
Recovered (R) - The people who have been infected, but have recovered. It's generally assumed that these people can't be re-infected. (But there are variations).

For the ice bucket challenge, we can look at a direct analogy as

S - Those who haven't been nominated
I - Those who have been nominated and are taking the challenge (so can nominate other people)
R - Those who have completed or those who declined the challenge (and can't nominate anyone else)

We can draw the model as a flowchart, showing how people move between the different groups,

Here, we've divided nominees into those who accept the challenge (S->I) and those who don't accept (S->R). So, now we can describe the model as a system of differential equations,


Where the factors (α, β, γ) describe what proportions of each group move to where.

The most interesting part is the 'S*I' terms in the first two equations - this basically says that the number of newly infected people is proportional to the number of susceptible people AND the number of infectious people.

This makes the system self-limiting, meaning that the number of infected people can't just grow to infinity - since that's not what we observe. Instead, what we see is an increase to some maximum, followed by a steady drop-off. For example,

We don't have hard data on how many people did the challenge over time, but we can get an idea of what happened by looking at the YouTube search numbers for the phrase 'ice bucket challenge' (above, via Google Trends).

I mean, it seems reasonable to assume some loose correlation between the number of people doing the challenge, the number of videos of people doing the challenge, and the number of searches for those videos. Searches peaked on August 21st, in case you were wondering.


A More Discrete Model

The downside to this 'S*I' term is that it makes the equations non-linear, meaning they can't be solved analytically - that is, you can't come up with an 'exact' equation for the number of people doing the challenge on a given day, for example.

But we can do a numerical simulation, instead.

In fact, this is a more reasonable way of looking at the system, since we're interested in a discrete time step of one day - i.e. the 24 hours nominees have to complete the challenge. For the simulation, we're also going to rounded the numbers of people in each group (after each time step) to whole numbers, since you can't (or shouldn't) divide a person into fractions.

Now, we need to define some parameters. First of all we need to define our start populations. We'll call the initial susceptible population S0 - this could be, for example, the Earth's population (which was around 7.16bn when I ran my simulations). We'll assume that one person is infected as a starting point  - the originator of the challenge (I0 = 1). And we'll assume that no-one else has done the challenge at the start, so R0 = 0.

For the constants (α,β,γ), the easiest one to define is γ - the 'recovery' rate. We assume that after the 24 hour challenge period nominees are no longer 'infectious', therefore γ = 1.

For α and β, we have the total rate of infection/nomination defined as (α+β). We're given that each challenge completer gets to nominate three new people, so we can define (α+β) such that in the first step (when there's only one infectious person) we have (α+β)*S0*I0 = (α+β)*S0 = 3. Therefore (α+β) = 3/S0.

Now, α and β are related to the proportions of nominees that accept and decline the challenge, respectively. So we can redefine the constants as α = a/S0 and β = b/S0, such that (a+b) = 3, or alternatively α = a/S0 and β = (3-a)/S0. So now we can look at 'a' as the average number of nominees who accept the challenge.

So to tie it all together, we have the system of (difference) equations,

And it's pretty straightforward to write some code that'll run through those equations.


So we've got our model, what now?

Having a model is all well and good, but why bother? Well, now we can start asking questions. For example, how fast does the campaign spread? How long will it take for the challenge to die out, and how many people will have taken part by that point? And what happens when we change the number of people who accept the challenge?

First of all, if we run the simulation (with a = 2.5) and plot I(n) - the number of people doing the challenge on any given day - we get something like this


[where I(n) has been normalised so that the maximum is 100, as Google Trends does].

This is about what we expected - a rise and a fall. Though it's worth noting the shape is a little different from the YouTube searches above; the simulation drops off quickly, while the search numbers have a longer tail. This could be because, even after the challenges are done, there's a latent interest in (re)watching the videos. Or it could just be that this model is not entirely accurate...


So what happens when we vary 'a'?

We can start by assuming everyone accepts the challenge (a = 3). In this case, eventually everyone in the world does the challenge - specifically within 21 days of the first challenge. But that isn't very realistic.

So let's say that on average 1.5 or 2 of those challenged accept. In these cases, the ice bucket phenomenon ends before everyone can be challenged. For a = 2, the challenge ends after 50 day, with ~13% of the population going unchallenged. On the other hand, for a = 1.5 the challenge takes significantly longer to end (over a year).

In fact it turns out that for any a < 1.6030165.. the challenge will take a significant amount of time to end - the number of 'infectious' people will eventually reach 1, and stay there until S <= S0/(2a) (remember we're rounding each group to the nearest whole number). For a <= 0.5 the challenge ends almost immediately.

Now, we know that for a = 3, the population S goes to zero (everyone is challenged), whereas for a = 2 the challenge stops before S can go to zero. So we have the question - what is the smallest value of 'a' for which S goes to zero? If you do a bit of interpolating, you can figure out that this critical value comes out at around 2.7405349.. At this value of 'a', the challenge ends after 23 days. In other words, if on average 2.74.. (~91%) of the people nominated accept the challenge, then after 23 days everyone in the world will have been challenged.

From what I've seen myself, it seems like nearer 1 in 3 people accept the challenge on average. So, as viral as the campaign was, it was never going to take over the world.

If you play around a bit, you'll find that the critical values of 'a' are dependent on the initial population (S0). But, as far as I can tell, it's not possible to derive these critical values analytically. (Answers in the comments if you can prove otherwise).


Why wait to be nominated..?

At this point, you're probably thinking this isn't a very realistic model. And you'd be right. So lets make it a bit more complicated.

In particular, lets add spontaneous participation - that is, people who aren't directly nominated, but who see all the other people doing the challenge, and decide they want to take part too.

We'll assume that this participation is proportional to those who have already done the challenge (I and R). So to start with, we need to separate the 'recovered' group into those who actually did the challenge (R), and those who declined (D).

The new model looks something like this,



With difference equations,

In the flowchart, we've introduced this new factor 'δ', the 'inspiration' rate. If we re-define it, like we did for the infection rate, as d/S0, then 'd' can be loosely interpreted as the average number of people inspired to take part by each person who's already done the challenge.

Now we can look at how this 'd' factor affects the things we looked at before - how long it takes for the challenge to end, etc.

Let's start by assuming that the average number of nominees accepting the challenge, 'a', is 1 (out of 3). What value of 'd' do we need for S to go to zero? Do a little interpolating again and you get d = 0.8668653 - that is, if each person who pours a bucket of water over their head inspires (on average) 0.867.. people to do the same, then everyone in the world will participate, within 26 days of the challenge starting. For a = 2, we need d = 0.4046021 for S goes to 0. And so on...

What's a plausible inspiration rate? For a normal person, probably zero, while for a celebrity... I don't know. But on average 'd' is probably very close to zero. I mean, we know for a fact that significantly less than the entire population of Earth has been nominated/taken part in the challenge.

If you plot I(n) for a = 1 and d = 0.1 you get something like this,


In this case, you have that same rise and fall - but this time, you have a longer tail, like we see in the YouTube searches. Is this proof that this iteration of the model is more accurate? Maybe. But as I pointed out before, the YouTube searches don't necessarily accurately represent the number of people taking the challenge over time.


Social Pressure

When a friend does a thing for charity, then publicly calls you out to take part, there's a certain amount of social pressure to comply. I mean, if a friend dumped water over their head (just because), then asked you to do the same thing, probably you'd look at them like they were a crazy person. Anyway, that kind of social pressure is implicitly included in the 'infection' rate - more social pressure, bigger 'a'.

Instead, the sort of social pressure I'm talking about here is the kind that goes: "I should accept the challenge because so many other people have already done it". Or alternatively, "it's okay for me to accept the challenge, since so many other people have already done it".

In other words, the more people accept and complete the challenge, the more likely a nominated person is to accept too. Mathematically, we can introduce this with a term in 'S*I*R'. Or alternatively, we can keep the term as 'a*S*I', but make the factor 'a' a function of R -> a(R).

Anyway, if you're interested, you can try investigating that yourself. Or try adapting the model in some other ways. But beyond a point you can end up complicating a model more than improving it.


A Network Theory Approach to Nominating

So I was eventually nominated for the challenge. But I'm a wimp, so I declined to dump ice water over my head, instead making a donation to the Motor Neuron Disease Association (the UK equivalent of ALSA).

For my nominations, I wanted to try and maximise spread. So I nominated the 3 of the people in my Twitter network who are the most active and well connected, and who I thought would be up for accepting the challenge. Plus, as a secondary effect, I figured they might nominate other people in my Twitter network - the network theory equivalent of wishing for more wishes.

In the end, one ignored the nomination, one acknowledged but didn't accept, and one accepted (in the form of a donation) but didn't nominate anyone else. So I guess that theory didn't quite pan out. But I did at least encourage more charitable giving.


So Yeah

If we had real world data we could maybe test the accuracy of these models. But even without, we can get a sense of how the challenge behaves - for example, we find that there's a critical 'challenge acceptance ratio' that determines whether the viral campaign will go 'pandemic'.

In theory, you could apply this sort of model to any viral campaign, or just anything that spreads 'virally'. The nice thing about the Ice Bucket Challenge in particular, though, is that it has well defined rules for how the challenge spreads from one person to the next.

So, yeah..


Oatzy.


[I need an editor, my pronouns are all over the place.]