Thursday, April 23, 2009

More than one slash in url

The target_id column in the tracking_page_hits table records the user-requested url (usually a php page). Normally it should start with ONE slash, then the url, for example /index.php. I just realized that some url starts with MORE THAN slashes, for example ///index.php. There must be some error in recording those urls. This affect my data cleaning:
when target_id = '/feedback.php' then blah blah....
when target_id ~ '^/feedback\.php.*$' then blah blah blah ...
In the case where there are >1 slash, the regex cannot detect the patterns. I can think of two solutions:
1, give up the equal sign, and alway use target_id ~ '^/+.....'
2, remove all those unnecessary starting ////// from the table.
I like method 2, the problem is solved once for all, and I don't need to worry about whether I should use = or ~ , '^/ ......, or '^/+ ...'
How to do it then? still thinking ....
-----------------------
URLs are not always in your expected format. If you find some anomaly, regular expression is a good tool.

1 comment:

  1. update tracking_page_hits set target_id = substring(target_id from ('^/+(/.*)')) where target_id ~ '^/+/.*';

    that's how I removed the unnecessary slash.

    ReplyDelete