Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#2007 closed Task (fixed)

Performance: don't store md5 as strings in the log tables

Reported by: matt Owned by:
Priority: normal Milestone: Piwik 1.2
Component: Core Keywords:
Cc: Sensitive: no

Description

MD5 are 128bit hashes commonly represented as 32 char strings. In the log_visit and log_link_visit_action tables, we store these hashs as string which is unfortunate since it is using a lot of space for nothing and makes inserting / querying less efficient. We are not doing cryptography so don't need something as complex as md5. We are just trying to avoid collisions, most of the time.

we should shorten the visitor ID to a much smaller value, at most unsigned BIGINT, and saves a ton of disk space (index space + data), as well as increase performance.

unsigned BIGINT goes from 0 to 18446744073709551615 - a long shot before we get collisions.

To summarize

  • Using only 8 bytes for these values, We would save 48bytes per visit and 24bytes per page view.
  • I'm not sure what would be the best hash function for 64bits uuid. In any case let's be careful with 64 bits manipulation in php
  • in the cookie, the uuid should be displayed in ascii.
  • I understand we will need the hash function to appear in both the JS (with 1st party cookie improvement) and in php (in case a user doesn't support/delete cookies)

Please comment if you have any idea on how to make this work well, thx

Change History (17)

comment:1 Changed 3 years ago by vipsoft (robocoder)

Developing a 64-bit, collision resistant hash is really out-of-scope.

That said, since the avalanche effect means a small change in the content often results in a significantly different hash, you might be able to simply truncate the md5 hash to 64 bits. (You can probably run a query on the database to see how many collisions result, if any.)

I'm currently using a sha1 hash in piwik.js because the code is shorter than for md5. If you don't want to use the full value on the server-side, I believe you can truncate it.

comment:2 Changed 3 years ago by matt (mattab)

mysql> select count(distinct(SUBSTRING(visitor_idcookie FROM 1 FOR 8))) as truncated, count(distinct visitor_idcookie) as cookie
    -> FRom piwik_log_visit
    -> ;
+-----------+---------+
| truncated | cookie  |
+-----------+---------+
|   7571288 | 7577884 |
+-----------+---------+

So we have a collision rate of .0008% or 0.08% (on a 4GB piwik_log_visit setup)

I'd say it's very much a good enough number

plus at the moment we only do uniques per month, which gives lower collision rates of 0.01%:

mysql> select *, 1-truncated/cookie from (select MONTH(visit_server_date),count(distinct(SUBSTRING(visitor_idcookie FROM 1 FOR 8))) as truncated, count(distinct visitor_idcookie) as cookie FRom piwik_log_visit group by MONTH(visit_server_date)) t1;
+--------------------------+-----------+---------+--------------------+
| MONTH(visit_server_date) | truncated | cookie  | 1-truncated/cookie |
+--------------------------+-----------+---------+--------------------+
|                        1 |    234263 |  234272 |             0.0000 |
|                        2 |    300337 |  300345 |             0.0000 |
|                        3 |    445592 |  445611 |             0.0000 |
|                        4 |    627290 |  627336 |             0.0001 |
|                        5 |    742757 |  742822 |             0.0001 |
|                        6 |    745563 |  745635 |             0.0001 |
|                        7 |    827379 |  827452 |             0.0001 |
|                        8 |    835817 |  835887 |             0.0001 |
|                        9 |    869653 |  869719 |             0.0001 |
|                       10 |    889817 |  889917 |             0.0001 |
|                       11 |   1010989 | 1011109 |             0.0001 |
|                       12 |    891054 |  891149 |             0.0001 |
+--------------------------+-----------+---------+--------------------+
12 rows in set (6 min 31.77 sec)

so yes I'd agree with the suggestion of keeping first 8 chars of a md5 hash. how much bigger is md5 in JS VS sha1?

comment:3 Changed 3 years ago by matt (mattab)

I'm not sure how we can convert a 8 byte string into a bigint to feed mysql... maybe we have to use CONV() ? http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html#function_conv

I'm not sure if we can do it in php itself, since it can't handle 64 bits math operation by default, but maybe i'm missing something

comment:4 Changed 3 years ago by vipsoft (robocoder)

Why not use a BINARY (or VARBINARY) column?

Given that the hash value has no sequence and there's no math, then there's no reason to treat it as an int. (plus, no type conversion or MySQL-isms would be required.)

comment:5 Changed 3 years ago by matt (mattab)

I think it might be easier to work with bigint VS binary? funny, i was researching this and stumbled upon an interesting report: http://bugs.mysql.com/bug.php?id=44478 ;)

comment:6 Changed 3 years ago by vipsoft (robocoder)

Not really. As my "invalid" bug report shows, there are implied signed-ness in operations even when dealing with unsigned bigint.

Using a BINARY column and avoiding MySQL-specific functions in the queries will make it easier to support non-MySQL databases in the future.

comment:7 Changed 3 years ago by matt (mattab)

  • Resolution set to fixed
  • Status changed from new to closed

(In [3729]) Fixes #2007

  • stores idvisitor, config_id as BIGINT rather than char(32)
  • updates Tracking and Archiving code

comment:8 Changed 3 years ago by matt (mattab)

  • Milestone changed from 1.x - Piwik 1.x to 1.2 Piwik 1.2

Anthon your concerns are valid, there are 2 functions that are mysql specific in Piwik_Tracker now, we can abstract these when need be. I already wrote the code when you suggested to use binary field and wasn't keen to go through it again, knowing that we don't support other than mysql at this time.

As a result of this change, as well as other performance change in trunk, I note a 33% performance improvement in Archiving time between 1.1 and 1.2 with the same dataset.

comment:9 Changed 3 years ago by vipsoft (robocoder)

This really should be BINARY and use php's built-in pack/unpack to convert the 16 hexits to/from an 8 byte binary string. This wouldn't require any future abstraction.

comment:10 Changed 3 years ago by vipsoft (robocoder)

  • Resolution fixed deleted
  • Status changed from closed to reopened

Re-opening to investigate the spurious (but recurring) build failures.

983, 985, 987

Fail: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Main.test.php -> Test_Piwik_Integration_Main -> test_TwoVisitors_twoWebsites_differentDays -> <br/> Differences with expected in: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/processed/test_TwoVisitors_twoWebsites_differentDays__Actions.getPageTitles_day.xml at [/var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Integration.php line 379]
string(83) "ERROR FOR Actions.getPageTitles_day.xml -- FETCHED RESPONSE, then EXPECTED RESPONSE"
string(4240) " first page view 2 1 2 360 2 2 3 360 1 1 1 180 50% 50% first visitor 1 1 0 1 0 0% 100% second page view 1 1 1 0 1 1 second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 1 0 1 1 a new visit 1 1 1 360 1 1 2 360 0 Website 2 page view 1 1 1 0 1 1 1 0 1 1 1 0 100% 100% "
string(4240) " first page view 2 2 2 360 2 2 3 360 1 1 1 180 50% 50% first visitor 1 1 0 1 0 0% 100% second page view 1 1 1 0 1 1 second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 1 0 1 1 a new visit 1 1 1 360 1 1 2 360 0 Website 2 page view 1 1 1 0 1 1 1 0 1 1 1 0 100% 100% "
Fail: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Main.test.php -> Test_Piwik_Integration_Main -> test_TwoVisitors_twoWebsites_differentDays -> <br/> Differences with expected in: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/processed/test_TwoVisitors_twoWebsites_differentDays__Actions.getPageTitles_week.xml at [/var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Integration.php line 379]
string(84) "ERROR FOR Actions.getPageTitles_week.xml -- FETCHED RESPONSE, then EXPECTED RESPONSE"
string(4676) " first page view 2 2 360 2 3 360 1 1 1 2 1 180 50% 50% first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% "
string(4676) " first page view 2 2 360 2 3 360 1 1 2 2 1 180 50% 50% first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% "
Fail: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Main.test.php -> Test_Piwik_Integration_Main -> test_TwoVisitors_twoWebsites_differentDays -> <br/> Differences with expected in: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/processed/test_TwoVisitors_twoWebsites_differentDays__Actions.getPageTitles_month.xml at [/var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Integration.php line 379]
string(85) "ERROR FOR Actions.getPageTitles_month.xml -- FETCHED RESPONSE, then EXPECTED RESPONSE"
string(4428) " first page view 2 2 360 2 3 360 1 1 1 2 1 180 50% 50% second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% "
string(4428) " first page view 2 2 360 2 3 360 1 1 2 2 1 180 50% 50% second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% "
Fail: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Main.test.php -> Test_Piwik_Integration_Main -> test_TwoVisitors_twoWebsites_differentDays -> <br/> Differences with expected in: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/processed/test_TwoVisitors_twoWebsites_differentDays__Actions.getPageTitles_year.xml at [/var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Integration.php line 379]
string(84) "ERROR FOR Actions.getPageTitles_year.xml -- FETCHED RESPONSE, then EXPECTED RESPONSE"
string(4386) " first page view 2 2 360 2 3 360 1 1 1 2 1 180 50% 50% second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% "
string(4386) " first page view 2 2 360 2 3 360 1 1 2 2 1 180 50% 50% second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% " 

comment:11 Changed 3 years ago by vipsoft (robocoder)

Sorry... last commit message was incomplete (trying to edit while holding sleeping toddler).

In [3732], refs #2007 - use BINARY columns in database and in-memory to avoid MySQL-isms in the queries (and special cases); convert to/from hex for cookie; fix typo in ALTER statement

comment:12 follow-up: Changed 3 years ago by matt (mattab)

(In [3741]) Refs #2007 Style & debug changes

Maybe the bug is caused by a failed BINARY comparison of config_id (see bottom of http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html )
investigating...

comment:13 Changed 3 years ago by matt (mattab)

  • Resolution set to fixed
  • Status changed from reopened to closed

Closing for now... if the build starts to randomly failing again, we will reopen

comment:14 in reply to: ↑ 12 Changed 3 years ago by vipsoft (robocoder)

Replying to matt:

Maybe the bug is caused by a failed BINARY comparison of config_id (see bottom of http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html )

The failures were occuring before the BINARY change. I don't think my commit fixed whatever was causing it.

comment:15 Changed 3 years ago by matt (mattab)

Correct, and thanks for putting the binary change in!!

it seems the failed build happened in Build #1007 but I can't see why from the output: http://qa.piwik.org:8080/hudson/job/Piwik/1007/consoleFull

comment:16 follow-up: Changed 3 years ago by vipsoft (robocoder)

I looked in the artifacts .zip; it failed in the auto-update. I'm guessing the problem is because we disable JavaScript before clicking on the 'update now' link, but interrupting HtmlUnit's JavaScript thread while it's manipulating the DOM. (Disabling JavaScript and/or sleep()'ing are unreliable, so we're going to get spurious failures like these. If it becomes a more frequent problem, I'm going to switch to Selenium.)

comment:17 in reply to: ↑ 16 Changed 3 years ago by sple007

Note: See TracTickets for help on using tickets.