Improve storage of URLs, normalization at DB Level
|Reported by:||EZdesign||Owned by:||EZdesign|
|Priority:||major||Milestone:||1.12.x - Piwik 1.12.x|
Description (last modified by EZdesign)
Pages might have multiple URLs, especially if a site has multiple domains.
The basic cases are
- Alias domains
- With and without www
- http and https
- For the Actions report, normalization is done implicitly in Piwik_Actions::getActionExplodedNames. The domain is removed, which takes care of (2) and (3). It also handles (1) correctly if the domains are real aliases, but it fails if the aliases are domain1.com/ and domain2.com/something/.
- For the segement pageUrl, one can use =@ (i.e. contains) and omit domain and protocol.
Why this is not enough
For upcoming features, we need data based on URLs. One URL might have multiple idactions.
Using WHERE idaction IN ( ... ) works in some cases but not always. E.g. if you want the pages visitors viewed directly after a certain page you can use idaction_url_ref IN ( ... ) but you cannot recognize aliases in the result (column idaction) on the DB level.
- Do some normalization before writing to log_action. We would not record protocol and www for page URLs ie. type == TYPE_ACTION_URL == 1.
- Add the information about protocol and www as a TINYINT to log_action. 1 = http://, 2 = http://www, 3 = https://, 4 = https://www. This information is added when new actions are inserted and never modified. It can be used to reconstruct the full URL. When actions should be tracked with the same URL but different TINYINT, use the existing idaction.
- Analysis can be done without worrying about problem (2) and (3).
- Problem (4) - alias domains - is acceptable. Maybe we can add a switch to the site configuration where users can configure Piwik to treat the domains as aliases. If set, the alias domains will be replaced with the main domain when tracking. By default, this option is off.
- We need to upgrade existing databases. Transform log_action, find duplicates, change foreign keys to duplicates in other tables, remove duplicates. Ideally, this is done without using stored procedures / functions and without data shipping to PHP.
Change History (24)
comment:12 Changed 23 months ago by matt (mattab)
- Milestone changed from 1.9 Piwik 1.9 to 1.8.x - Piwik 1.8.x
- Priority changed from normal to major
comment:21 Changed 20 months ago by matt (mattab)
- Resolution set to fixed
- Status changed from new to closed
comment:23 Changed 19 months ago by matt (mattab)
- Summary changed from URL Normalization on DB Level to Improve storage of URLs, normalization at DB Level