Opened 5 years ago

Closed 4 years ago

Last modified 4 years ago

#708 closed New feature (fixed)

Proposal spec for schema and code updates for logging pageviews, downloads, outlinks

Reported by: matt Owned by: matt
Priority: critical Milestone: Piwik 0.5
Component: Core Keywords:
Cc: Sensitive: no

Description

Proposal spec for the new Actions schema

There are various tickets related to improvements in the "Actions" reporting in Piwik (pages, outlinks, downloads). Some of these require database schema updates to accodomate the new funtionnality. Here is a proposal spec. This spec was written with the following tickets in mind: #306, #530, #556, #707

Requirements

We want to report:

  • top pages by URL, classified by category (current behavior)
  • top pages by flattened URL (see #707)
  • report entry / exit / time per page by URL (see #306); we already have this data.
  • top pages by page title. In the top pages by title, titles don't link to the URLs by default, because it is a lot of computation to keep the relationship page title<> (URL1, URL2, ..) as a same title can have many URLs. Also, we don't report best entry page, top exit page, or time per page for each title. These analysis are only done for the URLs, to minimize overhead during archiving. (see #530) However we now need two different information about a pageview: URL, and a Name.

We do not consider an event tracking feature (see #472) as it is out of scope for Piwik 1.0, and would anyway require a different more complicated and modular data structure.

This change is also welcome at this time considering the new Javascript API that has the following related methods:

piwik.setDocumentTitle( customTitle )
piwik.trackPageView()
piwik.trackLink( url, type, customVars)

Users can customize the document title for a given pageview. This would only customize the "Top pages by title report" and would not affect the "top pages URL" and "top pages flattened URLs" reports.

Implications on Tracking (piwik.php, core/Tracker/*)

There is a currently a performance issue when querying piwik_log_action that we hope to resolve with this schema update. Indeed, when querying this table for a given URL, eg. /test/test2/test3/ if most entries in this table have a similar structure, because the index on name is limited to a few characters (15) the index lookup badly fails and we end up doing a full table scan at each page view.
To fix this issue we are going to add a new column in this table, "hash", that will be the hash of the "name" column, using the fast CRC32 algorithm. This will speed up the SELECT idaction in piwik.php.

We also now need to save two information per page view: the name of the page, and the URL.
For clicks and downloads, we will by default only save the URL, but plugins could override this and also save a name, to allow advanced reporting on downloads and outlinks.
We want to keep things simple and will save all names and all URLs in this same table, piwik_log_action.

Schema updates for piwik_log_action

Current schema:

idaction INTEGER(10) UNSIGNED NOT NULL AUTO_INCREMENT,
name VARCHAR(255) NOT NULL,
type TINYINT UNSIGNED NULL,
PRIMARY KEY(idaction),
INDEX index_type_name (type, name(15))

Proposed schema:

  idaction INTEGER(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  hash UNSIGNED INT NOT NULL,
  name VARCHAR(255) NOT NULL,
  type TINYINT UNSIGNED NULL,
  PRIMARY KEY(idaction),
  INDEX index_type_name (type, hash)

Note: this is explained in the book "high performance mysql 2nd edition" page 104;

Basically instead of

SELECT WHERE name = '/path1/path2/path3' 

you do

SELECT WHERE hash = CRC32('/path1/path2/path3') AND name = '/path1/path2/path3'

you need the AND name = '/path1/path2/path3' in case two urls have the same hash (which happens with 1% probability after 100k results, cf birthday paradox).

CRC32 doc is on http://dev.mysql.com/doc/refman/5.1/en/mathematical-functions.html#function_crc32.

The hash field should be set when INSERTing new page names/ urls in this table, using the mysql CRC32 function; we don't want to use the php crc at any time.

Currently Actions->getIdAction returns the idaction for the given name. Instead, the function would now set actionIdName and actionIdUrl in the object. For downloads and outlinks, actionIdName would be empty;

In the case of pages, we would like to select these two IDs in only one SQL select for efficiency. For example, it would look like:

SELECT idaction 
FROM piwik_log_action 
WHERE 

    (hash = CRC32('/path1/path2/path3') AND name = '/path1/path2/path3' AND type = $this->getActionType())
  OR 
    (hash = CRC32('Welcome to URL') AND name = 'Welcome to URL' AND type = Piwik_Tracker_Action_Interface::TYPE_ACTION_NAME)

For downloads/outlinks there wouldn't be the part after the OR, as by default we only track URLs for downloads/outlinks;

Eg.

SELECT idaction 
FROM piwik_log_action 
WHERE (hash = CRC32('http://piwik.org/latest.zip') AND name = 'http://piwik.org/latest.zip' AND type = $this->getActionType()

Schema update for piwik_log_link_visit_action

  `idlink_va` int(11) NOT NULL auto_increment,
  `idvisit` int(10) unsigned NOT NULL,
  `idaction` ---> renamed in idaction_url
  NEW idaction_name 
  `idaction_ref` --> renamed in idaction_url_ref
  `time_spent_ref_action` int(10) unsigned NOT NULL,
  PRIMARY KEY  (`idlink_va`),
  KEY `visit` (`idvisit`)

For pages, we would record both idaction_name and idaction_url fetched from piwik_log_action as specified above.
For download and outlinks, we would only set idaction_url, and idaction_name would be null. Plugins could set this value if necessary

Action->record() will have to be updated accordingly to now record these two idaction_*;

Note: Action object already has a getActionName and getActionUrl method which matches nicely; we would maybe add a getActionNameId and getActionUrlId that would be set by the current getActionId (that could be rename loadActionNameAndUrl)?

Schema update for piwik_log_visit

visit_exit_idaction is renamed in visit_exit_idaction_url
visit_entry_idaction is renamed in visit_entry_idaction_url

Several lines have to be updated in the code to reflect this rename but there are no logic changes.

Schema update for piwik_log_conversion

idaction is renamed in idaction_url

GoalManager->recordGoals() query must be updated accordingly.

As shown, all existing idaction reference the entry in piwik_log_action for the URL of the concerned pageview; the URL is now officially the main information about a pageview, and the "name" (eg. the html document title) is one more piece of information about each page view. However conversions, and visits entry/exit pages, only care about the pageview URL, mostly for simplicity.

Note: I would expect to have more small modifications not explained here, in the php code of Tracker/*, especially Tracker/Action.php

Implications on Archiving (plugins/Actions/Actions.php)

Changes in the schema should only affect archiving for plugins/Actions/Actions.php.

To do #530, the first query at line 124 would have to ran twice, once joining on idaction_url for processing reports by page title (existing query), and one more by joining on idaction_name to process the new report "top pages by title".

Other queries would have to be slightly updated to reflect the field's name change.

User Interface

To do #530 we would add a new submenu: Actions> Page titles to show best pages by title, similar to existing Actions>Pages.
There would be a new API method that would return the top pages by title.

Migration

There are several required schema updates which may take minutes to execute on the piwik_log_* tables; we might want to update the message before the update to make clear that this will take a while and user should be patient.

The schema updates must be done in the CREATE TABLE statements in core/Piwik.php, in all queries querying these tables (mentionned above) and the update script can be written in core/Updates/X.php

Other

  • The data generator in core/Tracker/Generator and misc/generateVisits.php will have to be updated to reflect the new DB URL structure and ensure that the generator generates all the title + URL data as expected.
  • Also this new schema change will invalidate the TrackerSecondaryDb which will have to be slightly updated to reflect the new structure.
  • Some of this code is covered by unit tests which will have to be updated (tests/core/Tracker/Action) with new logic

Conclusion

This is not a small change, but we are blocked on several fronts because of the current DB schema; implementing this change will make a lot of other tickets possible and add new features in Piwik in the near future. Also, it makes possible for plugins to custom name outlinks and downloads.

Any questions, feedback, ideas, please let me know: this is a proposal!

Attachments (2)

708.patch (49.3 KB) - added by mauser 5 years ago.
0.5.php (1.5 KB) - added by vipsoft 5 years ago.
revised update script

Download all attachments as: .zip

Change History (29)

comment:1 Changed 5 years ago by rc_amsterdam

Seems a good proposal to me! I have a note on saving the page Title.

I've been playing around with Piwik to make it log both Title and url. In PIWIK 0.4 it is possible to log the Title using the following function in the trackingcode:

piwikTracker.setDocumentTitle(document.title);

But in the (utf-8) database this results in problems with both utf-8 encoding and html_entity encoding. I had to put in this code to resolve this problem:

if($actionType == 1){
   $actionName = utf8_decode($actionName);   
   $actionName = html_entity_decode($actionName);
}

I think this is something that needs to be addressed when implementing this proposal.

comment:2 follow-up: Changed 5 years ago by matt (mattab)

rc_amsterdam, can you please submit steps to reproduce the error? what exactly happens if you submit utf7 strings?

comment:3 in reply to: ↑ 2 Changed 5 years ago by rc_amsterdam

Replying to matt:

rc_amsterdam, can you please submit steps to reproduce the error? what exactly happens if you submit utf7 strings?

I don't know if the problem relates to utf7 strings.

As far as I remember these steps:

  1. add
    piwikTracker.setDocumentTitle(document.title);
    

in trackingcode.

  1. Look in de database (I use phpMyAdmin) in the table "log_action" and I see a lot of messed up data in the "name" column. But the data only gets messed up for pages with special characters. Characters like "é", "ë" etc. But also special characters in HTML like "'", "&" etc.

does this help?

comment:4 Changed 5 years ago by matt (mattab)

do you then have wrong data in the "Actions" report as well?

comment:5 follow-up: Changed 5 years ago by vipsoft (robocoder)

In this forum post, r.guggeis suggests changing the index on the log_action table:

ALTER TABLE `log_action` DROP INDEX `index_type_name`
ALTER TABLE `log_action` ADD INDEX `name` (`name`);

comment:6 in reply to: ↑ 5 Changed 5 years ago by matt (mattab)

Replying to vipsoft:

In this forum post, r.guggeis suggests changing the index on the log_action table:

ALTER TABLE `log_action` DROP INDEX `index_type_name`
ALTER TABLE `log_action` ADD INDEX `name` (`name`);

yes this would make sense for the current schema, but the proposal here is an even better solution (we index the hash of the name as opposed to indexing the name which has an unknown length - this could be very costly for websites having very long names)

comment:7 Changed 5 years ago by alivenk

comment:8 Changed 5 years ago by mauser (zawadzinski)

  • Sensitive unset

new version of patch (compatible with current release of Piwik 0.4.3) added.

comment:9 Changed 5 years ago by mauser (zawadzinski)

  • Sensitive set

comment:10 Changed 5 years ago by matt (mattab)

  • Milestone changed from 2- DigitalVibes to 2 - Piwik 0.5
  • Sensitive unset

comment:11 Changed 5 years ago by domtop

comment:12 Changed 5 years ago by spomoni

comment:13 Changed 5 years ago by spomoni

comment:14 Changed 5 years ago by spomoni

comment:15 Changed 5 years ago by vipsoft (robocoder)

  • Priority changed from major to critical

comment:16 Changed 5 years ago by vipsoft (robocoder)

  • Milestone changed from 2 - Piwik 0.6 to 1 - Piwik 0.5

Changed 5 years ago by mauser (zawadzinski)

comment:17 Changed 5 years ago by mauser (zawadzinski)

Current patch in the attachment. One important note:

Page Titles are tracked only when user sets in Javascript action_name. If action_name is not given, given page view is not tracked in Page Titles view. As long as user does not set action_name Piwik behaves the same way as before (view Actions->Pages display action URL's).

Backward incompability occurs only when user used action_name's because they will appear in Actions->Pages, and after the update they will go into Actions->Page titles (data in the database won't be updated because we do not have information about which Action was URL and which was not). This is important to note that users have to be aware of before updating to 0.5.

Changed 5 years ago by vipsoft (robocoder)

revised update script

comment:18 Changed 4 years ago by matt (mattab)

see commit in [1530]

TODO before closing the ticket:

  • Anthon: apply small modifs to the JS tracker (setCustomUrl() and action_name defaulting to document.title)
  • Matt: improve/fix unit tests around the expected behavior of Action naming
  • Maciej/Matt: load testing to ensure this is working as fast as we expect under heavy load, with a huge log_action table

comment:19 Changed 4 years ago by vipsoft (robocoder)

In [1531]

  • refs #708 - add setCustomUrl(url) to override document.location.href (default)
  • set action_name to document.title (default)
  • add alias for window.location.hostname (for yuicompressor)
  • rename private method, getWebBug(), to logPageView() for consistency
  • jslint 2009-10-04 ("The Good Parts")

comment:20 Changed 4 years ago by matt (mattab)

  • Resolution set to fixed
  • Status changed from new to closed

marking as fixed for now - thanks all for your work

comment:21 Changed 4 years ago by vipsoft (robocoder)

what about the two unit tests failing in tests/core/Tracker/Action.test.php (empty request) and plugins/Actions/tests/Actions.test.php ?

comment:22 Changed 4 years ago by matt (mattab)

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:23 Changed 4 years ago by vipsoft (robocoder)

  • Owner changed from mauser to matt
  • Status changed from reopened to new

comment:24 Changed 4 years ago by vipsoft (robocoder)

  • Milestone changed from 1 - Piwik 0.5 to 1 - Piwik 0.5.1

comment:25 Changed 4 years ago by matt (mattab)

  • Resolution set to fixed
  • Status changed from new to closed

comment:26 Changed 4 years ago by vipsoft (robocoder)

  • Milestone changed from 1 - Piwik 0.5.1 to 1 - Piwik 0.5
Note: See TracTickets for help on using tickets.