Opened 2 years ago

Closed 20 months ago

Last modified 19 months ago

#2976 closed Task (fixed)

Improve storage of URLs, normalization at DB Level

Reported by: EZdesign Owned by: EZdesign
Priority: major Milestone: 1.12.x - Piwik 1.12.x
Component: Core Keywords:
Cc: Sensitive: no

Description (last modified by EZdesign)

Problem

Pages might have multiple URLs, especially if a site has multiple domains.

The basic cases are

  1. Alias domains
  2. With and without www
  3. http and https

Current solution

  • For the Actions report, normalization is done implicitly in Piwik_Actions::getActionExplodedNames. The domain is removed, which takes care of (2) and (3). It also handles (1) correctly if the domains are real aliases, but it fails if the aliases are domain1.com/ and domain2.com/something/.
  • For the segement pageUrl, one can use =@ (i.e. contains) and omit domain and protocol.

Why this is not enough

For upcoming features, we need data based on URLs. One URL might have multiple idactions.

Using WHERE idaction IN ( ... ) works in some cases but not always. E.g. if you want the pages visitors viewed directly after a certain page you can use idaction_url_ref IN ( ... ) but you cannot recognize aliases in the result (column idaction) on the DB level.

Proposed solution

  1. Do some normalization before writing to log_action. We would not record protocol and www for page URLs ie. type == TYPE_ACTION_URL == 1.
  2. Add the information about protocol and www as a TINYINT to log_action. 1 = http://, 2 = http://www, 3 = https://, 4 = https://www. This information is added when new actions are inserted and never modified. It can be used to reconstruct the full URL. When actions should be tracked with the same URL but different TINYINT, use the existing idaction.
  3. Analysis can be done without worrying about problem (2) and (3).
  4. Problem (4) - alias domains - is acceptable. Maybe we can add a switch to the site configuration where users can configure Piwik to treat the domains as aliases. If set, the alias domains will be replaced with the main domain when tracking. By default, this option is off.
  5. We need to upgrade existing databases. Transform log_action, find duplicates, change foreign keys to duplicates in other tables, remove duplicates. Ideally, this is done without using stored procedures / functions and without data shipping to PHP.

Change History (24)

comment:1 Changed 2 years ago by vipsoft (robocoder)

  • Description modified (diff)

s/remote/remove/

comment:2 Changed 2 years ago by vipsoft (robocoder)

I think this is an acceptable edge case. From prior discussion, SEO would discourage the use of alias domains.

comment:3 Changed 2 years ago by matt (mattab)

We need to focus on performance, because this feature will be potentially very slow. We can't afford adding a new field to log_action to deal with this edge case. To maximise performance, solution 4) is the way to go. It is not a big drawback IMO because most websites do not use alias.

vote for wontfix/worksforme/ acceptable edge case.

comment:4 Changed 2 years ago by matt (mattab)

Note: a relevant new FAQ to put up to help with this issue is to explain how to track the custom URL to be the canonical URL (if websites generate it, most CMS/apps do these days). See #2974 for code example

comment:5 Changed 2 years ago by matt (mattab)

See also #2805 (Purge logs from log_action) - could this be done if we happen to change URL storage algorithm?

comment:6 Changed 2 years ago by matt (mattab)

After discussing further with timo, we came to the conclusion that storing protocol + hostname for Page URLS (type==1) in log_action table is not currently useful very much.

  • Storing hostname is only used when using the segments pageUrl, exitPageUrl, entryPageUrl.
  • However storing hostname for all rows where type==1 in log_action causes extra disk usage and prevents us from writing efficient and user friendly new features

Therefore, it makes sense to consider not storing the protocol+hostname in the future.
Proposed solution

  • Tracking: We would not record protocol+hostname for Page URLs ie. type== TYPE_ACTION_URL ==1
    • log_action rows for type TYPE_OUTLINK = 2; TYPE_DOWNLOAD = 3; TYPE_ACTION_NAME = 4; would be unchanged and still have full URL
  • Upgrade: The upgrade procedure would be quite slow. Timo is working on a test upgrade that we will try on the demo. Definitely high traffic piwik users will have to stop tracking and disable UI during the upgrade.
    • Post upgrade: the log_action table would be smaller (protocol+host stripped out)
    • Future developments will not have to worry about URL normalization. It will also improve user friendliness since URLs will be automatically aggregated.
  • Archiving: We would not put the hostname in the Actions Page URLs datatables.
  • API: The hostname would be prepended in the metadata 'url' column in the Actions.getPageUrls API call
    • This change to Archiving+API would result in smaller future Actions Datatable (not a significant improvement but always nice)


TODO: New FAQ to explain how to do alias domain URL segmentation.

  • Because some users might use pageUrl, exitPageUrl, entryPageUrl segmentation to test various aliases performance, we should document, before releasing this change, how to do Domain segmentation via Custom Vars. Typically, user could record the hostname in a custom variable of scope "page". Then it could segment the reports using &segment=customVariablePageName1==example.org;pageUrl=http://this-domain-is-not-used/page.html


This will first normalize the pageUrl to "page.html" then select the idaction, then segment further only page views that had a custom variable "hostname" set to "example.org".


This would achieve the current behavior of &segment=pageUrl=http://example.org/page.html .

  • document in the segmentation page the fact that segment pageUrl, entryPageUrl, exitPageUrl do not use the hostname.
    • This is the main downside of this change: that segments implementation is not really what their name mean, because only the path+querystring from the specified segment valueswill be used (rather than the whole URL).

The benefits of this change far outweight the downsides. The very slow upgrade will be a challenge, but a necessary one if we want to keep innovating :)

comment:7 Changed 2 years ago by matt (mattab)

  • Milestone set to 1.8 Piwik 1.8

comment:8 Changed 2 years ago by EZdesign (BeezyT)

  • Description modified (diff)

After some discussion via email, here's an updated version of the ticket description.

comment:9 Changed 2 years ago by matt (mattab)

Maybe we can add a switch to the site configuration where users can configure Piwik to treat the domains as aliases. If set, the alias domains will be replaced with the main domain when tracking. By default, this option is off.

I think that's OK for V2 but we don't need the switch for V1, as this is an edge case and I think will not be useful for many people.

Update script

Hopefully you can manage to do the update without stored functions indeed, as I'm pretty sure many users will not be allowed to create the functions on shared hosts.

Because now the normalization is simple, maybe we can do something like:

select @prefix_type := (
case when LEFT(name, 10) = 'http://www' then 'http://www' 
when LEFT(name, 7) = 'http://' then 'http://' 
case when LEFT(name, 11) = 'https://www' then 'https://www'
when LEFT(name, 8) = 'https://' then 'https://' 
else '' 
end), 
name, 
IF(@prefix_type <> '', REPLACE(name,@prefix_type,''), name ) as nameafter
from piwik_log_action
where type=1
#AND idaction > 730641 

comment:10 Changed 2 years ago by EZdesign (BeezyT)

Trac won't let me add files with 1K lines. So here's the patch in gist: https://gist.github.com/2157026 . It consists of two commits I have in a git branch. I hope you can manage to apply it to your SVN working copy.

  • All the integration tests pass and I added a new test suite for the normalization.
  • I called the update 1.7.3-b1, assuming that we release 1.7.2 before we do this. But the version can be changed when I commit to trunk.

Please review the patch carefully.


comment:11 Changed 2 years ago by matt (mattab)

Patch looks good, thanks. +1 for the tests!
I didn't apply it but I generally don't apply patches anyway, just read them.

The only potential problem I can think of is the inserting of NULL values, which I remember has some issues with Mysqli. But, if that's the case, the beauty is that tests will fail when commmitted on jenkins, so we shall see.

great stuff!

comment:12 Changed 23 months ago by matt (mattab)

  • Milestone changed from 1.9 Piwik 1.9 to 1.8.x - Piwik 1.8.x
  • Priority changed from normal to major

comment:13 Changed 20 months ago by EZdesign (BeezyT)

(In [6790]) refs #2976 updates can be marked a major

comment:14 Changed 20 months ago by EZdesign (BeezyT)

(In [6791]) refs #2976 fixing outdated doc of Piwik_Config

comment:15 Changed 20 months ago by EZdesign (BeezyT)

(In [6792]) refs #2976 url normalization: store protocol and www in the url_prefix column of log_action. treat pages with different protocol or with/without www as the same action. includes a major db transformation and tests.

comment:16 Changed 20 months ago by EZdesign (BeezyT)

It would be good if as many eyes as possible could review this update since it's quite critical.

comment:17 Changed 20 months ago by EZdesign (BeezyT)

(In [6794]) refs #2976 svn properties

comment:18 Changed 20 months ago by EZdesign (BeezyT)

Looks like I broke the build...

/home/www/data/root/jenkins.private/jobs/Piwik/workspace/build/plugins/PrivacyManager/tests/PrivacyManager.test.php -&gt; Test_Piwik_PrivacyManager -&gt; test_purgeData_deleteReportsKeepRangeReports -&gt; Unexpected exception of type [Zend_Db_Statement_Exception] with message [SQLSTATE[42000]: Syntax error or access violation: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ' ,

[exec] , , 1000)' at line 4]

What is this trying to tell me??

This particular test case doesn't complete on my machine at all.

comment:19 Changed 20 months ago by SteveG (sgiehl)

(In [6802]) refs #2976 fixed privacymanager tests

comment:20 Changed 20 months ago by matt (mattab)

(In [6808]) Adding quick tip about disabling maintenance and re-enable tracker refs #2976

comment:21 Changed 20 months ago by matt (mattab)

  • Resolution set to fixed
  • Status changed from new to closed

Great code Timo!

It will be useful to have this data cleaned up and simplified int he DB...

This wasnt an easy upgrade script to write, you did really well

  • Tests look good, testing both the segmentation, and that the log_action is correctly built, and case sensitivity..

Not much to add, good stuff. Marking as fixed

comment:22 Changed 20 months ago by matt (mattab)

(In [6814]) No duplicate code + Testing for entryPageUrl/exitPageUrl Refs #2976

comment:23 Changed 19 months ago by matt (mattab)

  • Summary changed from URL Normalization on DB Level to Improve storage of URLs, normalization at DB Level

comment:24 Changed 19 months ago by matt (mattab)

in [7017] Refs #2976 Removing comments from the SQL as, when we display SQL, it gets all inlined and #Xxxx comments gets inlined causing SQL to fail. Reported in forums eg. http://forum.piwik.org/read.php?2,93934

Note: See TracTickets for help on using tickets.