Wednesday, June 29, 2011

Tumbling, Appendix I: Making Connections

Where Things Get Interesting

Recall in Part 3 (under 'Notes and Improvements'), I mentioned the possibility of creating graphs of how a post spreads - ie who reblogged who.

As it turns out it was possible. And in fact, not that difficult.

You can find the modified code here.

In addition to the outputs the previous iteration generated, this updated version also creates a 'graph.csv' file, which contains all the links between reblogging users. This file can then be imported into, for example, Gephi, to create a visualisation of the graph.


Modified Model

The updated model works thing like this
We have the set, F, of people who reblogged in the previous generation. For each of these users we iterate over their followers, with the reaction code as before.

But we now have the addition that, if the follower reblogs, a new property - user.reblogged - is set to the person they reblogged from. That (reblogging) user is then added to the temporary group f_i to be used in the next iteration.

The simulation ends when f_i is empty - ie, no-one reblogs.


Pretty as a Picture

So, keeping the initial conditions the same as for Part 3, here is an example output
And here if the accompanying 'spread over eccentricity' graph.
Compare for those visualisations in Part 1

I have to say, I am extremely pleased with this. I mean, just look at it! It's amazing.

Here's another example
And finally, here's a graph for a population size, N = 20,000 & f0 = 1,000
More reblogs overall, and lots more clustering in this case. Pretty awesome, right?


Little Niggles

The only things that really bother me now:

1) The Like count seems a little high
This is probably a result of the distribution/variables chosen in creating approval thresholds.

2) There's only a small number of reblogs coming from OP.
This relates to the 'reblogs from tags' thing discussed in Part 3.

3) Small population
At some point, when I've got the time to run the simulation, I'd like to try it with larger populations/larger f0

But none of these things bother me too much, and I'm pleased with what I've got.


So that's basically that.


Oatzy.

Tuesday, June 28, 2011

Tumbling, Part Three: Stochastic Spread

No SIR

So having written the previous posts, I was wondering if this spread could be modeled. We could use the SIR equations, but this runs into the problems I mentioned before - i.e. it assumes regular infection rate, approximately uniform population distribution, etc - and as the graphs showed, this is not the case with Tumblr (or Twitter, for that matter).

There are, in fact, modifications to the SIR model to make it work with irregular population distributions. But I'm not entirely familiar with these, so I'm kinda going my own way with it and hoping for the best.

I should point out that, while SIR models are typically time linked, in this case we're working in 'spread over eccentricity' (as discussed in Part 1). The effect of this is that the model won't reflect real-time spread, but rather how wide and far from the source it goes.


Scattered Clusters

The majority of users will have between ~100-200 followers. The numbers of 'leaders' then decreases exponentially as the follower counts get bigger - with very few users having follower numbers in the thousands. Or so I'm choosing to assume; I have no data to support this.

So for this model, the users' follower counts are generated randomly using a log-normal distribution.  This seems like the most plausible fit for the real-world.


For the implementation of the model, all users are generated prior to the actual spread simulation - creating a 'user pool' of the entire population (size N). Users are then randomly select from this pool as and when needed.

Pre-generating the users creates consistency, and makes it easier to keep track of who has already seen the post - assigning a new property to each user, user.exposed, which equals 1 if they have seen the post, and 0 otherwise.


Model Mechanics

You can download the code for the completed model here (Python)


So the basics of the model work like this:
Starting with the original poster (OP), who has F = f0 followers:

a) Select F users at random from the 'user pool'

Then, for each user:

i) Has that user already seen the post?

If no,

ii) How does the user react to the post?

Ignore, like, reblog? Increment counters accordingly.

This bit of code is essentially the same as that discussed in Part 2 - using the 'per-person' approach. We also introduce a new counter, eReblog, which keeps track of the (cumulative) number of reblogs at each level of eccentricity.

iii) If a user reblogs, add their follower count to the 'follower counter' (fi)

This is something of a corner-cutting method - the follower counts for all reblogging users are mixed into the same counter. Since some of the follower groups might overlap, users are selected from the pool with replacement to compensate.

This method is perfectly fine, so long as we don't want to create graphs of who reblogs who. And for simplicity, we don't.

b) If fi = 0, end simulation. Else, go back to (a) with F = fi

Termination condition - obviously, if no-one reblogs the post, there's no need to go any further.

That's basically it.


Initial Conditions

Population = 3,000; F0 = 150, Threshold Ranges: [4,11], Post Quality: No(3,2), Followers: Log-No(5,2)

Obviously, the population of Tumblr is much larger than 3,000. But this code starts to struggle when you get into the 10,000s, so if we're too impatient to wait much more than a minute for results, we have to make concessions . Not that that should have significant effect, other than the population being exhausted quicker.

The post quality has been set low, and the thresholds high, so that the spread doesn't instantly explode and burn out. This makes the results more interesting to observe.


Reassuring Results

Below is a graph of seven example outputs of the model - ALL with the same initial conditions.
Note the wide variation in behaviour. This is really the sort of thing I wanted to see.

In particular, I like 'Row 8' (dark purple) - the way it stays fairly low, then has a sudden burst at generation 4 - similar to what happened with the Dr Who image in Part 1 (albeit with only one burst).

We also have, in 'Row 2' (red) a spread that starts to die out immediately, as with the HIMYM image (also mentioned previously).

One result not graphed, because it can't be, was an example outcome where the post didn't spread at all (ie 0 reblogs) - which is not an uncommon occurrence in the real world.

While we're at it, here is graph for the average of all the lines in the previous image
I would guess, if you averaged many more sample outputs, the curve would even-out into a single hump.

For comparison, here is what the graph looks like when the post quality is higher, and the approval threshold on a wider range
Basically, a lot of reblogs over a very short space of time. Which, as mentioned above, was why I set the initial conditions lower.


Shape-wise, these graphs don't seem to differ significantly from those you'd expect to see from a SIR model. Not identical - there's certainly much more variance from this model - but it's interesting to see all the same.

Finally, if OP is 'Tumblr Famous' - that is, has 1,000 or more followers - this model would seem to suggest that their larger follower count has little effect on how far a post (subject to our initial conditions) would spread. This seems strange.

That being said, if we're using a population size = 3,000 , then an OP starting with 1,000 followers means that a third of the population is exposed and used up right away. So further investigation is probably needed on that one.


Notes and Improvements

So this model works, and is pretty good. But it's by no means perfect.

1) The Problem With Approval

Behaviour of super-node may vary from that of normal people - eg more exclusive, selective, etc - but for simplicity, this is ignored. That said, if true it wouldn't be too hard to have approval thresholds dependent on, for example, follower count.

2) Reblogs From Tags

Some people discover via their tags - or for that matter, through search engines - rather than from the people they follow. If we were desperate to incorporate this into the model, it could be as straight forward as picking extra (unexposed) users when choosing from the user pool.

3) Didn't See It

Some people just don't see the post, or don't see it til much later and from someone else. But we can assume that the former are indistinguishable from those who did and ignored, and can ignore the latter.

4) Who Reblogged Who

We'd have to replace the part of code where the reblogs are grouped into a single counter with code that has some sort awareness of who each user reblogged - see Appendix I

And while we're at it, we could also build up a graph off how all these users are connected - ie, a who follows who graph. But that's not necessary.

5) Fine-Tuning

For this model I chose the random distributions and variables arbitrarily, but based on educated guessing. Obviously, this is not ideal. For the best model, we'd need to tweak these to closely match real-world data. Sadly, I don't know where this data would come from.


What Use Is This Anyway?

The thing is, the advantage of the SIR models is that they can be solved exactly when looking for stationary points, or what initial conditions would lead to an epidemic, etc.

This is not the case with our stochastic model. Instead, we're left having to just play with the variables and observe the resulting behaviour. We could also modify the code to allow predefined variables to remove some of the randomness.

One other way the model could be used is, for a fixed set of initial conditions, repeat the simulation a significantly large number of times and look for the most likely outcome, as per the Monte Carlo method.

And as mentioned above, a similar model can be used to model spread of information through any network - including Twitter. It's just a matter of tweaking variables.

Anyway, the model serves its purpose as is, and I'm pleased with it. So yeah.


Oatzy

Sunday, June 26, 2011

Tumbling, Part Two: Over the Threshold

I Didn't Vote For This

A short while back, while people still cared about AV versus FPTP voting, I wanted to come up with a voter simulation, so as to get data to compare the merits of the two (and a few other) systems, since real-world data is limited. Sadly, it came to no avail, since I had trouble coming up with a plausible way to simulate the behaviour of British voters.

The basis of how it worked was this - each voter has ten 'vote points'. These points are divided amongst the parties according to how much the voter likes each party.

So if a voter was a strong Conservative supporter, they might give all their points to Conservative. A voter who's more liberal might give 6 points to the Lib Dems and 4 to Labour. Or whatever.
In the case of the simulation, these points were going to be assigned randomly, but according to some Markov Chain type model - based on the typical voting habits of Brits. This, by the way, is where I was having trouble - for large numbers of voters, the result were always the same.

The other part of the model was to (randomly) assign to each voter an 'approval threshold' - that is, if a voter assigns to a party a number of points higher than their personal approval threshold, then they 'approve' of that party, and will place a vote accordingly.
So if, in the liberal example described above (voter B in the diagram), the voter has a threshold of 5, they would place a vote for only the Lib Dems; if, on the other hand, they had a threshold of 3, then (under an AV voting system) they would place a vote for both - order of preference: 1. Lib Dem, 2. Labour.

(In the diagram, all three 'voters' are given the same approval threshold for simplicity. The thresholds are by no means constant and universal.)


I Like It, But Not that Much

So what does this have to do with Tumblr? Well. On Tumblr, you can either ignore a post, like it, or reblog it.

Now, there are some people who will reblog almost anything that catches their fancy. Then there are people, like me, who are very picky and only reblog things they really like.

This is basically analogous to the approval threshold in the voting sim.

Obviously, the quality of a post is subjective (to some extent). But imagine first that for every post you see, you assign to it some rating from 0 to 10. The rating you assign to a given post will be (relatively) unique to you.

Here is how we get around the problem of subjectivity,

We assume that the ratings a large sample of people will assign to a given post will have some regular distribution - most likely a normal distribution - where the average and spread of ratings reflects (to some extent) the quality of that post. Or, at least according to that particular sample of people.

Each user then has two threshold values - a 'like threshold' and a 'reblog threshold'. And as before, if the rating a person assigns to a given post exceeds one of these thresholds, that person will either like or reblog the post (possibly both) accordingly.


Some Asides

It's worth noting that similar modelling can be applied to any piece of information that can be spread - whether it be on Twitter, by email, whatever - where you have some 'sharing threshold' by which you judge whether or not you think a thing is worth sharing.

This may vary wildly from situation to situation. But in most cases outside of Tumblr, you will tend to find that people typically have a higher share threshold - higher standards, are more pick, are more wary of what they share, etc.

Another thing to consider, with Twitter in particular, the originator of a post can have a massive effect on how much said post is retweeted.

For example, some people (fangirls) will retweet someone like Justin Bieber simply by virtue of the fact that it's Justin Bieber. Conversely, some people may not retweet someone like Piers Morgan because they don't like him - even if they would otherwise like something he posted enough to have retweeted it.

Finally, people don't always share things simply because they like them - they may want to make some comment on the post, they may want to mock the post, whatever.

But for simplicity, we're ignoring that all these eccentricities, on the assumption that including them would have no appreciable effect on the model.


Model Affinity

I wrote a quick model of this behaviour (Python, can be viewed/downloaded here).

For this model, the user's thresholds are generated uniformly - that is, all numbers in the ranges have the same probability of being chosen,

For the 'like threshold' you pick a random number between 1 and 11.
For the 'reblog threshold' you pick a random number between the 'like threshold' and 11.

The user's assessment of a post is randomly generated based on a normal distribution - in this case with mean=5, sd=3, constrained to the interval [0,10]. These values were chosen arbitrarily.

For a simulation of 100,000 users, this model returns approximately 25,575 likes and 14,487 reblogs. This gives a 'reblog to like' ratio of 0.56.

This is quite different from the 'R:L ratio' of the Dr Who image, discussed in the previous post - which turns out to be 1.42

But if we (effectively) increase the quality of the post - mean=10, sd=2, same interval - we now get approximately 32,393 likes and 41,872 reblogs, giving a R:L ratio of 1.29

Which reflects reality quite well - that is, the better a post is, the more reblogs it will get (and the higher its R:L ratio). You could also get a similar effect by lowering the approval thresholds of the users - effectively lowering the users' standards.

Note/ You could probably get a better replication of real world data by fine tuning the random variables. The ones used here were chosen fairly arbitrarily.


Personal or Probable

For large numbers of people, there are two ways you could work this into a model of post spreading:

1) On a per person basis

That is, for each user work out whether they will ignore, like, or reblog, as you go. This would lead to richer behaviour on a per-simulation basis, and is better if you want a more detailed, more 'precise' model.

2) Based on the averaged probability

That is, run a simulation for a large number of people (as above), work out the average number of likes/reblogs, and use this as the 'probability of liking/reblogging', constant for all users. So from the first simulation above, the probabilities would be:

p(reblog) ~ 0.14
p(like) ~ 0.26

This would make the behaviour more uniform across multiple simulations, but would be less computationally taxing.

It can be argued that, when you average over a large number of simulations, you will get (almost) identical results for both approaches. So it's ultimately a matter of preference.


In Part 3 of the series, I'll be looking at constructing a model of how posts spread through Tumblr, building on the model described here.


Oatzy.


[addendum] -  Due to a mild error, the section labeled 'Model Affinity' had to be rewritten. If you read this post before this addendum was added, it's probably worth re-reading that section.

Saturday, June 25, 2011

Tumbling, Part One: Some Background

I've talked about Tumblr before, and I've talked about using epidemiology models as an analogue for the spread of 'information' on social networks.

And having spent a fair amount of time on Tumblr recently, I have more insights, and more to say on the subject.


Resummarising

First of all, Tumblr is like Twitter in that a post can be spread from person to person by reblogging (similar to retweeting on Twitter). But more importantly, for a normal person like me, you will find that you'll get far more reblogs than you can ever get retweets. And this is important because it gives you more data to work from.

The key here is that you're typically sharing pictures. And in particular, often pictures that fall into certain fandoms - Tumblr is very good at fandoms. So if you post a picture relating to Doctor Who, then it will attract the attention of Doctor Who fans, who may like it and want to reblog it.


An analogy

Imagine there is some air born contagious disease, BUT it only affects men. Women can't even act as carriers.

Now imagine some guy has the disease. In a normal population, you have an approximately even mix of men and woman, so if the disease spreads, it will spread relatively slowly.

But this guy was just on holiday from his all boys boarding school. So when he goes back after the summer - can you see where this is going?

The fact that it's all boys, who are generally in close proximity, means that the disease is likely to spread through the boarding school like wildfire, infecting (almost) every boy.

(There could still be boys that are resistant to the disease, after all.)

Oh, and for completeness, if this infective guy were instead amongst a group of all women, then the disease wouldn't spread at all. But that should go without saying.


Birds of a Feather

If you are a massive Doctor Who fan, and if a huge chunk of what you post on Tumblr is Doctor Who related, then you are going to attract followers that are Doctor Who fans. You're likely to follow other Doctor Who fans yourself.

So it is, that subscribers to a given fandom with tend to cluster together, forming (relatively) tightly knit communities. If you then post a particularly popular ('contagious') image, then it will spread through the fandom like wildfire, as per the analogy.

Obviously, varying degrees of contagiousness still apply; fandom or not, a crappy picture isn't going to get much attention.

And on the other side of the coin, if you were then to post an image relating to a different fandom, which doesn't strongly overlap with Doctor Who, then the post will spread slowly, if at all.


Big Milk Thing

So I posted this image shortly after A Good Man Goes to War aired. When I last checked, it had achieved a respectable 1,431 notes -> 591 likes, and 840 reblogs.

One of the problems with Tumblr is that it doesn't time stamp reblogs. So, unless you're willing to put in the effort to keep track by hand, you can't get good 'spread over time' data.

So what I did instead was work with 'spread over eccentricity'. Here's a network graph to help illustrate
The nodes are coloured by their distance (eccentricity) from me, going from red (closest), to blue (furthest away) - I'm the reddest dot, in the middle of the circular cluster near the top.

So what we can do is count how many pople are at each distance from me - equivalent to assuming that eccentricity is time linked. Which it isn't, but it's the best we have.

So here's the graph of that spread
And reassuringly, it bears some similarity to epidemiology graphs - it hits at tipping point at the third generation causing a massive burst of reblogs, then slowly dies out; albeit with a secondary hump after generation 5.

You can actually see why the tipping points happen by the clusters in the network graph - the biggest super-node, matt-smith-, exposed the image to a massive audience of Doctor Who fans.

Similarly in generation 5 with fuckyeahdrwho - the second largest circular cluster - though the effect was smaller, arguably because a large number of people had already seen it at that point.


Bar Graph of My Favorite Pies

For contrast, here's what the network graph looks like for a How I Met Your Mother image I posted
In this case most of the spread is directly from me - as the original poster I act as a de facto super-node, as is the case with most posts. The post also doesn't spread very far this time, and in fact starts to die out straight away - as seen in the graph below.
One could argue that, if the image were found by a HIMYM fan blog, it would have seen a resurgence similar to that for the Dr Who image. But on the other hand, I've had Scrubs images reblogged by fan blogs, and still not seen them spread as much as the Dr Who,
In this case, the image was reblogged by fyscrubs - the bottom 'super-node' - and it still didn't spread very far.

So I think it's fair to say that HIMYM and Scrubs fans are (in general) less 'intense', and less numerous than Doctor Who fans. Or at least so on Tumblr.


[edit] - Shortly after I posted this blog, an image I posted last week - that had garnered very little attention until now - hit a super-node. And as with the Dr Who example, it exploded with likes and reblogs. So that's arguably points against a 'spread over time' approach. If only because it's too unpredictable.


So that's that. In Part 2 and Part 3 of this series, I'll be looking at creating a mathematical model of the information spread discussed here.


Oatzy


[For the record, captioned screencaps aren't the only things I post on Tumblr.]

Friday, June 24, 2011

Fibonacci's Rainstorm

A couple years back I got it in my head that I was going to try and write a musical about time travel.

Yeah, I don't know either.

But perhaps the maddest thing about that idea is that I have next to no musical training. Not that I'm the sort to let that stop me. So instead, the plan was to try and create music using maths. Which is a little weird.

But not actually as mad as it sounds. After all, Pythagoras (yes, that Pythagoras) was the first - or so the legend goes - to study the maths of harmonics, and why certain sounds are more pleasing to the ear than others.

It's all to do with fractions. But I won't go into all that because it's not relevant to my thing - there's a good, brief explanation in this video, How to Solve a Song with Math.


Musical Integers

The basic idea was as simple as this - you assign an musical note to each integer - say, starting with 1 = Middle C, 2 = C#, etc. though there are other methods of assignment - then generate integer sequences and seeing what they sound like.

For a good example of this sort of thing, here's mathemusician Vi Hart on the music of pi.

After playing around with various functions, what I settled on where that Hailstone numbers; which I misremembered as being called rainstorm numbers.


Collatz Conjecture

The Collatz Conjecture (also known as the hailstone numbers) works something like this
Well, all except the last sentence fragment. The actual idea is that, regardless of what starting number you pick, the sequence will always end on the number 1 (full details on the wiki page).

The Collatz Conjecture is currently unproven, though recently someone published a proposed proof - currently under scrutiny. But that's a whole other story.

So you pick some number, generate the sequence and see what it sounds like. Here's the sequence for 12


That is - 12, 6, 3, 10, 5, 16, 8, 4, 2, 1

The problem with using hailstone numbers is that it has the potential for massive jumps - for example 7, 22, 11, 34, 17, 52, etc - and such a sequence would not sound pleasing (this goes back to Pythagoras).


Lateralis

So my second idea was to combine it with the Fibonacci Sequence - that is 1, 1, 2, 3, 5, 8, 13, etc - where each number in the sequence is the sum of the two preceding numbers.

This was actually inspired, slightly, by the Tool song Lateralis, where the syllables of the lyrics form a rising and falling Fibonacci Sequence:
Black / then / white are / all I see / in my infancy / red and yellow then came to be / reaching out to me / lets me see.
So the idea was to perform the Collatz algorithm on the numbers of the Fibonacci sequence - which sounds something like this

 
1 / 1 / 2, 1 / 3, 10, 5, 16, 8, 4, 2, 1 / 5, 16, 8, 4, 2, 1 / 8, 4, 2, 1


Finishing Touches

Now the next part's a bit fuzzy. While I still have the notebook in which I worked out the tune, my notes aren't exactly clear.
But, basically, what happened was that the resulting tune (see above) was far too erratic in places. So I made a lot of modifications to it, but still based vaguely on patterns in the generated tune.

And the final touch I added to make the tune sound richer was chords. Again, I didn't know anything about chords. But a quick look on wikipedia told me how to make major and minor chords, so I decided to add the minor chords of the first notes of each bar.

Here's what the final tune sounds like

 
Yes, it's a bit short and basic. But all things considered, I'd say I did pretty well.


More Math Music

I've already mentioned above the music based on the digits of pi.

There was also a guy who worked out a way to convert games of chess into music.

And, as mention briefly in a previous post, you can go (almost) completely random and generate music using Markov chains - for example.


Oatzy.


[Oh, and I never did write that musical. Not that that should surprise you.]

Thursday, June 23, 2011

Toilet Roll

Do you ever wake up in the morning, go to the bathroom, and wonder "how much smaller would the toilet roll be if it didn't have that hole in the middle?".

Yeah, me too.

As it turns out, it's not that hard to estimate. First of all, we note that the roll is near enough to a regular cylinder that we can simplify the problem to two dimensions. We're then just dealing with areas of circles.

Of course, you could just re-roll an actual toilet roll by hand to find out. But that would be silly.

Here's a diagram
The left one is the roll as it is, the green bit of the right is the roll as it would be if it didn't have the hole in the middle (r2 being the new radius), and dr is the change in radius size - what we want to work out.

The point here is that the areas of matching colour in each diagram are the same size. So we can equate the formulas for the (green) areas and use a bit of algebra to find an equation for dr.

The full derivation is left as an exercise for the reader.

The resulting equation looks like this
And plugging in values measured from an actual toilet roll we get,
So the answer turns out to be a decrease of about half a centimeter, or of ~9.5%. Which isn't that much - about 1/5th the radius of the hole.

So now you know.


Oatzy.


[Before you say anything, read what it says at the top of the blog]