Tuesday, October 6, 2009

stickiness measures

Motivation
- When users start to use a service at different points, it is hard to compare the raw visits. User A registered in 2007, and has visited the website for 20 times; user B registered in 2009, and has visited the website for 15 times. Which user is more active? Then, what about visit per month? Considering user decay, the earliest users are disadvantaged.
- Another problem: K12 effect. Teachers "hibernate" in summer, and if a teacher registered in June, probably we cannot expect too much visit from her in the first 3 months (June, July, August).
- In addition, the website grows everyday. More users mean more visits, and more projects .

Thus, I devised a stickiness measure that take three problems into consideration:
Problems and Solutions
- k12 effect ==> divide usage data into chunks, instead of months. Each chunk should have comparable number of visits/ projects/ resources.
- user decay ==> We treat users' 1st chunk differently with the 10th chunk.
- site growth ==>Estimate the growth rate. And the chunks should take the growth rate into consideration.

I will use project stickiness as an example to illustration my so-call stickiness algorithms
Implementation details
step1: estimate growth rate
A great number of projects were created in April and October every year. And summer and new year are two less active period. So, I use 6 months as a window -- April to Sept, then October to March of next year. Well, when the data are clumped together, it looks less "rugged" than monthly data. Though we only has 3-4 data points after clumping, we can still calculate the slope (growth rate over a 6 month period).

step2: develop chunks (size and time period).
Let's assume we use Sept 2008 - August 2009's data. Well, it is easier to have 6*N months of data (in this case, N=2).
Assume slope = 600.
Then, we divide the data into 12 chunks, this is a arithmetic sequence, with d=600/6=100.
Now, we know S (the total number of projects), d (the growth rate), and n (the number of chunks). We can easily calculate the size of each chunk.
new an int array chunksize{....}
Now we know the size of each chunk, we can divide Sept2008-August2009 into 12 chunks, and find out the the start and end time of each chunk according to the chunk size.
new a Timestamp array chunk_end_time{...}
------------------------------------------------------------------
Step2 gives us 12 equal chunks (consider growth rate). And if users contribute to the website equally, we should get those perfect chunks.
--------------------------------------------------------------------

step3: line up every user's usage, and estimate the average projects of each block.
let's assume
chunk_end_time={"2008-09-20", "2008-10-23", "2008-12-01" , "2009-01-29", "2009-03-01", "2009-04-01", ....}
And we have a user who registered on 2009-03-26. Then, he started on the 5th chunk, and he had 12-5=7 chunks' data.
We need to shift the window a little bit, since this user started on 2009-03-26, the 5th chunk should start from that date too, and the end time of each chunk is calculated using the method when we calculate chunk_end_time{...}.

Remember, we will line up every user. So, this user has 7 chunks of data, and we will seek, how many project he has created on the 1st period after registration, on the 2nd period after registration, on the 3rd, 4th,....7th period after registration.
So, his data contribute to the first seven cells of the array average_project{...}
average_project[i] means, the average number of project a user has created on the ith period after registration.

Of course at the same time, we need to store each users' total number of projects.

Need another array: aggregate_average_project{...}
aggregate_average_project[j] = aggregate_average_project[j-1] + average_project[j]. This means the average total number of projects if a user has j periods.

step4: calculate project stickiness
Ok, a user x's stickiness (assume user x has 7 periods after registration)
stickiness[x] = number of projects by x / aggregate_average_project[7].