Ticket #708 (closed New feature: fixed)
Proposal spec for schema and code updates for logging pageviews, downloads, outlinks
| Reported by: | matt | Owned by: | matt |
|---|---|---|---|
| Priority: | critical | Milestone: | Piwik 0.5 |
| Component: | Core | Keywords: | |
| Cc: | Sensitive: | no |
Description
Proposal spec for the new Actions schema
There are various tickets related to improvements in the "Actions" reporting in Piwik (pages, outlinks, downloads). Some of these require database schema updates to accodomate the new funtionnality. Here is a proposal spec. This spec was written with the following tickets in mind: #306, #530, #556, #707
Requirements
We want to report:
- top pages by URL, classified by category (current behavior)
- top pages by flattened URL (see #707)
- report entry / exit / time per page by URL (see #306); we already have this data.
- top pages by page title. In the top pages by title, titles don't link to the URLs by default, because it is a lot of computation to keep the relationship page title<> (URL1, URL2, ..) as a same title can have many URLs. Also, we don't report best entry page, top exit page, or time per page for each title. These analysis are only done for the URLs, to minimize overhead during archiving. (see #530) However we now need two different information about a pageview: URL, and a Name.
We do not consider an event tracking feature (see #472) as it is out of scope for Piwik 1.0, and would anyway require a different more complicated and modular data structure.
This change is also welcome at this time considering the new Javascript API that has the following related methods:
piwik.setDocumentTitle( customTitle ) piwik.trackPageView() piwik.trackLink( url, type, customVars)
Users can customize the document title for a given pageview. This would only customize the "Top pages by title report" and would not affect the "top pages URL" and "top pages flattened URLs" reports.
Implications on Tracking (piwik.php, core/Tracker/*)
There is a currently a performance issue when querying piwik_log_action that we hope to resolve with this schema update. Indeed, when querying this table for a given URL, eg. /test/test2/test3/ if most entries in this table have a similar structure, because the index on name is limited to a few characters (15) the index lookup badly fails and we end up doing a full table scan at each page view. To fix this issue we are going to add a new column in this table, "hash", that will be the hash of the "name" column, using the fast CRC32 algorithm. This will speed up the SELECT idaction in piwik.php.
We also now need to save two information per page view: the name of the page, and the URL. For clicks and downloads, we will by default only save the URL, but plugins could override this and also save a name, to allow advanced reporting on downloads and outlinks. We want to keep things simple and will save all names and all URLs in this same table, piwik_log_action.
Schema updates for piwik_log_action
Current schema:
idaction INTEGER(10) UNSIGNED NOT NULL AUTO_INCREMENT, name VARCHAR(255) NOT NULL, type TINYINT UNSIGNED NULL, PRIMARY KEY(idaction), INDEX index_type_name (type, name(15))
Proposed schema:
idaction INTEGER(10) UNSIGNED NOT NULL AUTO_INCREMENT, hash UNSIGNED INT NOT NULL, name VARCHAR(255) NOT NULL, type TINYINT UNSIGNED NULL, PRIMARY KEY(idaction), INDEX index_type_name (type, hash)
Note: this is explained in the book "high performance mysql 2nd edition" page 104;
Basically instead of
SELECT WHERE name = '/path1/path2/path3'
you do
SELECT WHERE hash = CRC32('/path1/path2/path3') AND name = '/path1/path2/path3'
you need the AND name = '/path1/path2/path3' in case two urls have the same hash (which happens with 1% probability after 100k results, cf birthday paradox).
CRC32 doc is on http://dev.mysql.com/doc/refman/5.1/en/mathematical-functions.html#function_crc32.
The hash field should be set when INSERTing new page names/ urls in this table, using the mysql CRC32 function; we don't want to use the php crc at any time.
Currently Actions->getIdAction returns the idaction for the given name. Instead, the function would now set actionIdName and actionIdUrl in the object. For downloads and outlinks, actionIdName would be empty;
In the case of pages, we would like to select these two IDs in only one SQL select for efficiency. For example, it would look like:
SELECT idaction
FROM piwik_log_action
WHERE
(hash = CRC32('/path1/path2/path3') AND name = '/path1/path2/path3' AND type = $this->getActionType())
OR
(hash = CRC32('Welcome to URL') AND name = 'Welcome to URL' AND type = Piwik_Tracker_Action_Interface::TYPE_ACTION_NAME)
For downloads/outlinks there wouldn't be the part after the OR, as by default we only track URLs for downloads/outlinks;
Eg.
SELECT idaction
FROM piwik_log_action
WHERE (hash = CRC32('http://piwik.org/latest.zip') AND name = 'http://piwik.org/latest.zip' AND type = $this->getActionType()
Schema update for piwik_log_link_visit_action
`idlink_va` int(11) NOT NULL auto_increment, `idvisit` int(10) unsigned NOT NULL, `idaction` ---> renamed in idaction_url NEW idaction_name `idaction_ref` --> renamed in idaction_url_ref `time_spent_ref_action` int(10) unsigned NOT NULL, PRIMARY KEY (`idlink_va`), KEY `visit` (`idvisit`)
For pages, we would record both idaction_name and idaction_url fetched from piwik_log_action as specified above. For download and outlinks, we would only set idaction_url, and idaction_name would be null. Plugins could set this value if necessary
Action->record() will have to be updated accordingly to now record these two idaction_*;
Note: Action object already has a getActionName and getActionUrl method which matches nicely; we would maybe add a getActionNameId and getActionUrlId that would be set by the current getActionId (that could be rename loadActionNameAndUrl)?
Schema update for piwik_log_visit
visit_exit_idaction is renamed in visit_exit_idaction_url visit_entry_idaction is renamed in visit_entry_idaction_url
Several lines have to be updated in the code to reflect this rename but there are no logic changes.
Schema update for piwik_log_conversion
idaction is renamed in idaction_url
GoalManager->recordGoals() query must be updated accordingly.
As shown, all existing idaction reference the entry in piwik_log_action for the URL of the concerned pageview; the URL is now officially the main information about a pageview, and the "name" (eg. the html document title) is one more piece of information about each page view. However conversions, and visits entry/exit pages, only care about the pageview URL, mostly for simplicity.
Note: I would expect to have more small modifications not explained here, in the php code of Tracker/*, especially Tracker/Action.php
Implications on Archiving (plugins/Actions/Actions.php)
Changes in the schema should only affect archiving for plugins/Actions/Actions.php.
To do #530, the first query at line 124 would have to ran twice, once joining on idaction_url for processing reports by page title (existing query), and one more by joining on idaction_name to process the new report "top pages by title".
Other queries would have to be slightly updated to reflect the field's name change.
User Interface
To do #530 we would add a new submenu: Actions> Page titles to show best pages by title, similar to existing Actions>Pages. There would be a new API method that would return the top pages by title.
Migration
There are several required schema updates which may take minutes to execute on the piwik_log_* tables; we might want to update the message before the update to make clear that this will take a while and user should be patient.
The schema updates must be done in the CREATE TABLE statements in core/Piwik.php, in all queries querying these tables (mentionned above) and the update script can be written in core/Updates/X.php
Other
- The data generator in core/Tracker/Generator and misc/generateVisits.php will have to be updated to reflect the new DB URL structure and ensure that the generator generates all the title + URL data as expected.
- Also this new schema change will invalidate the TrackerSecondaryDb which will have to be slightly updated to reflect the new structure.
- Some of this code is covered by unit tests which will have to be updated (tests/core/Tracker/Action) with new logic
Conclusion
This is not a small change, but we are blocked on several fronts because of the current DB schema; implementing this change will make a lot of other tickets possible and add new features in Piwik in the near future. Also, it makes possible for plugins to custom name outlinks and downloads.
Any questions, feedback, ideas, please let me know: this is a proposal!

