Tuesday, October 6, 2009

stickiness measures

Motivation
- When users start to use a service at different points, it is hard to compare the raw visits. User A registered in 2007, and has visited the website for 20 times; user B registered in 2009, and has visited the website for 15 times. Which user is more active? Then, what about visit per month? Considering user decay, the earliest users are disadvantaged.
- Another problem: K12 effect. Teachers "hibernate" in summer, and if a teacher registered in June, probably we cannot expect too much visit from her in the first 3 months (June, July, August).
- In addition, the website grows everyday. More users mean more visits, and more projects .

Thus, I devised a stickiness measure that take three problems into consideration:
Problems and Solutions
- k12 effect ==> divide usage data into chunks, instead of months. Each chunk should have comparable number of visits/ projects/ resources.
- user decay ==> We treat users' 1st chunk differently with the 10th chunk.
- site growth ==>Estimate the growth rate. And the chunks should take the growth rate into consideration.

I will use project stickiness as an example to illustration my so-call stickiness algorithms
Implementation details
step1: estimate growth rate
A great number of projects were created in April and October every year. And summer and new year are two less active period. So, I use 6 months as a window -- April to Sept, then October to March of next year. Well, when the data are clumped together, it looks less "rugged" than monthly data. Though we only has 3-4 data points after clumping, we can still calculate the slope (growth rate over a 6 month period).

step2: develop chunks (size and time period).
Let's assume we use Sept 2008 - August 2009's data. Well, it is easier to have 6*N months of data (in this case, N=2).
Assume slope = 600.
Then, we divide the data into 12 chunks, this is a arithmetic sequence, with d=600/6=100.
Now, we know S (the total number of projects), d (the growth rate), and n (the number of chunks). We can easily calculate the size of each chunk.
new an int array chunksize{....}
Now we know the size of each chunk, we can divide Sept2008-August2009 into 12 chunks, and find out the the start and end time of each chunk according to the chunk size.
new a Timestamp array chunk_end_time{...}
------------------------------------------------------------------
Step2 gives us 12 equal chunks (consider growth rate). And if users contribute to the website equally, we should get those perfect chunks.
--------------------------------------------------------------------

step3: line up every user's usage, and estimate the average projects of each block.
let's assume
chunk_end_time={"2008-09-20", "2008-10-23", "2008-12-01" , "2009-01-29", "2009-03-01", "2009-04-01", ....}
And we have a user who registered on 2009-03-26. Then, he started on the 5th chunk, and he had 12-5=7 chunks' data.
We need to shift the window a little bit, since this user started on 2009-03-26, the 5th chunk should start from that date too, and the end time of each chunk is calculated using the method when we calculate chunk_end_time{...}.

Remember, we will line up every user. So, this user has 7 chunks of data, and we will seek, how many project he has created on the 1st period after registration, on the 2nd period after registration, on the 3rd, 4th,....7th period after registration.
So, his data contribute to the first seven cells of the array average_project{...}
average_project[i] means, the average number of project a user has created on the ith period after registration.

Of course at the same time, we need to store each users' total number of projects.

Need another array: aggregate_average_project{...}
aggregate_average_project[j] = aggregate_average_project[j-1] + average_project[j]. This means the average total number of projects if a user has j periods.

step4: calculate project stickiness
Ok, a user x's stickiness (assume user x has 7 periods after registration)
stickiness[x] = number of projects by x / aggregate_average_project[7].

Tuesday, September 29, 2009

distribution ... aggregation

well, I want to compare users' visits with their peers' activities. But do January's visits comparable with July's visits? Do we need to consider the k12 effect?
Let's check its distribution first

1, projects creation


2, resource collection











3, registered teachers' visits

Sunday, June 14, 2009

create unique session IDs

Each transaction (consisting of one or more entries) has a session id. But this id is not unique globally. I noticed that in jsp pages (written via struts, java, jsp), the same session id will be reused once it has been destroyed. In php pages (IA), session id is reused less, but this problem still exists.

To create unique session IDs:

step1: create a composite index first by session_id, then by id

step2: order all entries first by session_id, then by id

then

CREATE OR REPLACE FUNCTION make_session_unique(OUT num INTEGER) AS $$
DECLARE
currow RECORD;
viewed timestamp without time zone;
session text;
newid BIGINT;
BEGIN
viewed = '3000-01-1';
session = '1';
FOR currow IN (SELECT id, php_session_id, time_viewed
FROM tracking_page_hits ORDER BY php_session_id, id)
LOOP
IF (session <> currow.php_session_id) THEN
newid = currow.id;
session = currow.php_session_id;
ELSIF viewed < currow.time_viewed - INTERVAL '30 MINUTES'
THEN
newid = currow.id;
END IF;
UPDATE
tracking_page_hits
SET
new_session_id = php_session_id || '--' || newid
WHERE
id = currow.id;
viewed = currow.time_viewed;
END LOOP;
END;
$$ LANGUAGE plpgsql;

step3: drop the composite index

Friday, May 22, 2009

slash at the end of url

Some URL has a slash at the end, for example, /iamteacher.php/
This is not a good news for pattern matching. So I remove all of them:

update tracking_page_hits set target_id = substring(target_id from '(.*[^/]+)/+$') where target_id ~ '.+/$';

Be careful with regex. Does it remove all slashes at the end, or just the last one?

Monday, May 18, 2009

Recommend two social network analysis software

Free ...
Simple ...

CFinder -- very easy to use. Help to find communities, and cliques. But the new version doesn't show edge weight, though it takes weight into account. Also, the directed graph looks weird. If there are A -> B, and B->A, it only shows one direction, not both.

Visone -- good for visualization. But no help file. Cannot imagine a software without help manual.

Sunday, May 3, 2009

social network analysis - data generation

We are going to use social network analysis to analyze IA teachers' communities. Two types of communities: 1) view projects 2) copy projects.

For community 1) view projects
IA has a tracking projects hits table since 2006-08-22 12:46:49.14858. So, the teacher pool will be:
- registered teacher who has logged in >= 2006-08-22 12:46:49.14858
- registered teacher who has public projects
- have viewed or been viewed by others

1) I am going to create a "view" of this teacher pool. It will be a sql view, but I want to do it in my Java program, so that everything I run it, a new view can be generated, (and dropped after use).

2) I need to remove IA team members from this pool.

3) There is a concern that some projects has "inflated" number of views just due to their positions in IA. IA showcase presents the most frequently visited projects and the projects with the most resources. Those projects remain there "forever", and have been visited more than a thousand times. But when I check the tracking_page_hits table, the "inflated" number are just caused by unregistered users. Since we only analyze a teacher pool (registered users), this should be fine.

For community 2) copy projects. Teacher pool will be:
- have copied or been copied by another teacher after '2008-09-07' (the time when copy project function was launched)

Thursday, April 30, 2009

How to export report from google analytics (PHP)

You need:
GA account (email + password)
GA profile (usually the website you are tracking)
php

Why do we do this?
Access your data without logging into GA
Can download massive number of reports all at once
Twist it into which ever format you want

Template code:
http://manfred.dschini.org/2008/09/26/google-analytics-export/
click "available here " to download

How to use it:
You will see some $_GET[....] .... That is when the user need to write an interface, and set variables through the UI. Actually, you can set those variables inside the php code. I will explain the things you need to set:
$cmp (I don't know. the default is average. So, you can set it as 'average')
$fmt (This correspond to the GA's four export formats. 0-pdf, 1-xml, 2-..... It is highly recommended to use '1' if you want to twist the data further).
$rpt (report typle. export.php page lists all available reports)
$pdr (report duration. for example: $pdr='20080601-20080829')
$email (your GA email)
$password (your GA password)
$id (your GA ID. This is tricky. How do you know your id?
Go to this page:
http://code.google.com/apis/analytics/docs/gdata/gdataLibraries.html
Let's assume this url is not stable. You can get to this page by searching: "Google Analytics data report api Client Libraries & Sample Code". The first result ...
Okay, now you are on that page, click "interactive samples" under JavaScript. These are a few samples. Use the "Retrieve Your Accounts" demo, "Run". Follow the steps. You will get your profile. The profile Id is the value for your $id variable.
By default, the php template will return data on daily basis. If you select a very large time window, you will get too much numbers. There is another variable you can set:
$gdfmt. This allows you to get report by hour/day/week/month
$gdfmt='nth_month' means it is by month. I like by month -:p
This variable is not in the template. But once you know how it works, it is easy to set another variable.
As for the url's format, it is:
https://www.google.com/analytics/reporting/export?
with a number of paramters attached. You can go to your GA account, and click "export", then check out the url of one of those downloadable formats. That gives you a better idea of how to twist your template.

Other API
Bart mentions one:
http://www.cari.net/applescript-and-google-analytics-automatic-exporting.html
It seems it is much easier. But it is for Mac machine. I don't know Mac :(

Google provides its own data report api. The Google Analytics Data Export API is currently in public beta. While in public beta, our API will remain subject to change. I spend a few hours to learn it. It doesn't provide a good tutorial. Plus their Java codes have error! And I don't know where to download the complete js library. I give up...

Friday, April 24, 2009

the precision of timestamp

Each user activity is associated with a timestamp : create a project, add a resource, view a project, register...
Some are in the 'yyyy-mm-dd hh:mm:ss' format, and some are in the 'yyyy-mm-dd hh:mm:ss.xxxx'. There is no problem to make time comparision within the same table (same timestamp precision), but when I need to refer to different tables, it is a trouble ...

Here is an example:
tracking_page_hits (table)
- time_viewed (timestamp, '2009-03-04 13:45:05.08988)
- target_id (text, /addownresource.php)
- uid
resources (table)
- creation_date (timestamp, '2009-03-04 12:45:05)
- uid
- resource_id
There are two buttons in the addownresource.php page: "add", "cancel". Either button brings user back to the same page - /myresources.php
Did the user add a resource? I need to check the resources table to see if a resource was created at that moment. I use these two criteria to check:
- creation_date > time_viewed AND creation_date - time_viewed <= 1 minute (The action should take less than a minute)
- uid is the same

'2009-03-04 12:45:05 < 2009-03-04 13:45:05.08988. Judging from the criteria, this particular resource is not the "outcome" of visiting "addownresource.php". But, but, but, what if the creation_date is actually '2009-03-04 12:45:05.37599'? Then, it fits the criteria.

I suggest to convert all timestamps to the same precision. That's how it works in psql:
date_trunc('second', time_viewed)
Then every timestamp will be in the 'yyyy-mm-dd hh:mm:ss' format.

------------------------------
Be careful when comparing timestamps of different precision / different timezone.

Thursday, April 23, 2009

More than one slash in url

The target_id column in the tracking_page_hits table records the user-requested url (usually a php page). Normally it should start with ONE slash, then the url, for example /index.php. I just realized that some url starts with MORE THAN slashes, for example ///index.php. There must be some error in recording those urls. This affect my data cleaning:
when target_id = '/feedback.php' then blah blah....
when target_id ~ '^/feedback\.php.*$' then blah blah blah ...
In the case where there are >1 slash, the regex cannot detect the patterns. I can think of two solutions:
1, give up the equal sign, and alway use target_id ~ '^/+.....'
2, remove all those unnecessary starting ////// from the table.
I like method 2, the problem is solved once for all, and I don't need to worry about whether I should use = or ~ , '^/ ......, or '^/+ ...'
How to do it then? still thinking ....
-----------------------
URLs are not always in your expected format. If you find some anomaly, regular expression is a good tool.

Wednesday, April 22, 2009

Why am I blogging?

This is a journal on my experience with "educational data mining" a digital library service: http://ia.usu.edu

who might be interested:
me :)
edm researchers
in-service teachers, maybe maybe not. Probably not now, but you will find something interesting 3 months from now on.