Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: don't store md5 as strings in the log tables #2007

Closed
mattab opened this issue Jan 12, 2011 · 16 comments
Closed

Performance: don't store md5 as strings in the log tables #2007

mattab opened this issue Jan 12, 2011 · 16 comments
Labels
Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change.
Milestone

Comments

@mattab
Copy link
Member

mattab commented Jan 12, 2011

MD5 are 128bit hashes commonly represented as 32 char strings. In the log_visit and log_link_visit_action tables, we store these hashs as string which is unfortunate since it is using a lot of space for nothing and makes inserting / querying less efficient. We are not doing cryptography so don't need something as complex as md5. We are just trying to avoid collisions, most of the time.

we should shorten the visitor ID to a much smaller value, at most unsigned BIGINT, and saves a ton of disk space (index space + data), as well as increase performance.

unsigned BIGINT goes from 0 to 18446744073709551615 - a long shot before we get collisions.

To summarize

  • Using only 8 bytes for these values, We would save 48bytes per visit and 24bytes per page view.
  • I'm not sure what would be the best hash function for 64bits uuid. In any case let's be careful with 64 bits manipulation in php
  • in the cookie, the uuid should be displayed in ascii.
  • I understand we will need the hash function to appear in both the JS (with 1st party cookie improvement) and in php (in case a user doesn't support/delete cookies)

Please comment if you have any idea on how to make this work well, thx

@robocoder
Copy link
Contributor

Developing a 64-bit, collision resistant hash is really out-of-scope.

That said, since the avalanche effect means a small change in the content often results in a significantly different hash, you might be able to simply truncate the md5 hash to 64 bits. (You can probably run a query on the database to see how many collisions result, if any.)

I'm currently using a sha1 hash in piwik.js because the code is shorter than for md5. If you don't want to use the full value on the server-side, I believe you can truncate it.

@mattab
Copy link
Member Author

mattab commented Jan 12, 2011

mysql> select count(distinct(SUBSTRING(visitor_idcookie FROM 1 FOR 8))) as truncated, count(distinct visitor_idcookie) as cookie
    -> FRom piwik_log_visit
    -> ;
+-----------+---------+
| truncated | cookie  |
+-----------+---------+
|   7571288 | 7577884 |
+-----------+---------+

So we have a collision rate of .0008% or 0.08% (on a 4GB piwik_log_visit setup)

I'd say it's very much a good enough number

plus at the moment we only do uniques per month, which gives lower collision rates of 0.01%:

mysql> select *, 1-truncated/cookie from (select MONTH(visit_server_date),count(distinct(SUBSTRING(visitor_idcookie FROM 1 FOR 8))) as truncated, count(distinct visitor_idcookie) as cookie FRom piwik_log_visit group by MONTH(visit_server_date)) t1;
+--------------------------+-----------+---------+--------------------+
| MONTH(visit_server_date) | truncated | cookie  | 1-truncated/cookie |
+--------------------------+-----------+---------+--------------------+
|                        1 |    234263 |  234272 |             0.0000 |
|                        2 |    300337 |  300345 |             0.0000 |
|                        3 |    445592 |  445611 |             0.0000 |
|                        4 |    627290 |  627336 |             0.0001 |
|                        5 |    742757 |  742822 |             0.0001 |
|                        6 |    745563 |  745635 |             0.0001 |
|                        7 |    827379 |  827452 |             0.0001 |
|                        8 |    835817 |  835887 |             0.0001 |
|                        9 |    869653 |  869719 |             0.0001 |
|                       10 |    889817 |  889917 |             0.0001 |
|                       11 |   1010989 | 1011109 |             0.0001 |
|                       12 |    891054 |  891149 |             0.0001 |
+--------------------------+-----------+---------+--------------------+
12 rows in set (6 min 31.77 sec)

so yes I'd agree with the suggestion of keeping first 8 chars of a md5 hash. how much bigger is md5 in JS VS sha1?

@mattab
Copy link
Member Author

mattab commented Jan 12, 2011

I'm not sure how we can convert a 8 byte string into a bigint to feed mysql... maybe we have to use CONV() ? http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html#function_conv

I'm not sure if we can do it in php itself, since it can't handle 64 bits math operation by default, but maybe i'm missing something

@robocoder
Copy link
Contributor

Why not use a BINARY (or VARBINARY) column?

Given that the hash value has no sequence and there's no math, then there's no reason to treat it as an int. (plus, no type conversion or MySQL-isms would be required.)

@mattab
Copy link
Member Author

mattab commented Jan 13, 2011

I think it might be easier to work with bigint VS binary? funny, i was researching this and stumbled upon an interesting report: http://bugs.mysql.com/bug.php?id=44478 ;)

@robocoder
Copy link
Contributor

Not really. As my "invalid" bug report shows, there are implied signed-ness in operations even when dealing with unsigned bigint.

Using a BINARY column and avoiding MySQL-specific functions in the queries will make it easier to support non-MySQL databases in the future.

@mattab
Copy link
Member Author

mattab commented Jan 14, 2011

(In [3729]) Fixes #2007

  • stores idvisitor, config_id as BIGINT rather than char(32)
  • updates Tracking and Archiving code

@mattab
Copy link
Member Author

mattab commented Jan 14, 2011

Anthon your concerns are valid, there are 2 functions that are mysql specific in Piwik_Tracker now, we can abstract these when need be. I already wrote the code when you suggested to use binary field and wasn't keen to go through it again, knowing that we don't support other than mysql at this time.

As a result of this change, as well as other performance change in trunk, I note a 33% performance improvement in Archiving time between 1.1 and 1.2 with the same dataset.

@robocoder
Copy link
Contributor

This really should be BINARY and use php's built-in pack/unpack to convert the 16 hexits to/from an 8 byte binary string. This wouldn't require any future abstraction.

@robocoder
Copy link
Contributor

Re-opening to investigate the spurious (but recurring) build failures.

983, 985, 987

Fail: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Main.test.php -> Test_Piwik_Integration_Main -> test_TwoVisitors_twoWebsites_differentDays -> <br/> Differences with expected in: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/processed/test_TwoVisitors_twoWebsites_differentDays__Actions.getPageTitles_day.xml at [/var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Integration.php line 379]
string(83) "ERROR FOR Actions.getPageTitles_day.xml -- FETCHED RESPONSE, then EXPECTED RESPONSE"
string(4240) " first page view 2 1 2 360 2 2 3 360 1 1 1 180 50% 50% first visitor 1 1 0 1 0 0% 100% second page view 1 1 1 0 1 1 second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 1 0 1 1 a new visit 1 1 1 360 1 1 2 360 0 Website 2 page view 1 1 1 0 1 1 1 0 1 1 1 0 100% 100% "
string(4240) " first page view 2 2 2 360 2 2 3 360 1 1 1 180 50% 50% first visitor 1 1 0 1 0 0% 100% second page view 1 1 1 0 1 1 second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 1 0 1 1 a new visit 1 1 1 360 1 1 2 360 0 Website 2 page view 1 1 1 0 1 1 1 0 1 1 1 0 100% 100% "
Fail: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Main.test.php -> Test_Piwik_Integration_Main -> test_TwoVisitors_twoWebsites_differentDays -> <br/> Differences with expected in: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/processed/test_TwoVisitors_twoWebsites_differentDays__Actions.getPageTitles_week.xml at [/var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Integration.php line 379]
string(84) "ERROR FOR Actions.getPageTitles_week.xml -- FETCHED RESPONSE, then EXPECTED RESPONSE"
string(4676) " first page view 2 2 360 2 3 360 1 1 1 2 1 180 50% 50% first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% "
string(4676) " first page view 2 2 360 2 3 360 1 1 2 2 1 180 50% 50% first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% "
Fail: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Main.test.php -> Test_Piwik_Integration_Main -> test_TwoVisitors_twoWebsites_differentDays -> <br/> Differences with expected in: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/processed/test_TwoVisitors_twoWebsites_differentDays__Actions.getPageTitles_month.xml at [/var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Integration.php line 379]
string(85) "ERROR FOR Actions.getPageTitles_month.xml -- FETCHED RESPONSE, then EXPECTED RESPONSE"
string(4428) " first page view 2 2 360 2 3 360 1 1 1 2 1 180 50% 50% second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% "
string(4428) " first page view 2 2 360 2 3 360 1 1 2 2 1 180 50% 50% second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% "
Fail: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Main.test.php -> Test_Piwik_Integration_Main -> test_TwoVisitors_twoWebsites_differentDays -> <br/> Differences with expected in: /var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/processed/test_TwoVisitors_twoWebsites_differentDays__Actions.getPageTitles_year.xml at [/var/lib/jetty/webapps/root/hudson.private/jobs/Piwik/workspace/build/tests/integration/Integration.php line 379]
string(84) "ERROR FOR Actions.getPageTitles_year.xml -- FETCHED RESPONSE, then EXPECTED RESPONSE"
string(4386) " first page view 2 2 360 2 3 360 1 1 1 2 1 180 50% 50% second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% "
string(4386) " first page view 2 2 360 2 3 360 1 1 2 2 1 180 50% 50% second visitor 2 2 360 1 2 360 0 1 180 0% 50% two days later 2 2 360 1 2 360 0 1 second page view 1 1 0 1 1 1 a new visit 1 1 360 1 2 360 0 1 1 first visitor 1 1 0 1 0 0% 100% second page view 1 1 0 1 1 1 Website 2 page view 1 1 0 1 1 0 1 1 1 1 1 0 100% 100% " 

@robocoder
Copy link
Contributor

Sorry... last commit message was incomplete (trying to edit while holding sleeping toddler).

In [3732], refs #2007 - use BINARY columns in database and in-memory to avoid MySQL-isms in the queries (and special cases); convert to/from hex for cookie; fix typo in ALTER statement

@mattab
Copy link
Member Author

mattab commented Jan 15, 2011

(In [3741]) Refs #2007 Style & debug changes

Maybe the bug is caused by a failed BINARY comparison of config_id (see bottom of http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html )
investigating...

@mattab
Copy link
Member Author

mattab commented Jan 15, 2011

Closing for now... if the build starts to randomly failing again, we will reopen

@robocoder
Copy link
Contributor

Replying to matt:

Maybe the bug is caused by a failed BINARY comparison of config_id (see bottom of http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html )

The failures were occuring before the BINARY change. I don't think my commit fixed whatever was causing it.

@mattab
Copy link
Member Author

mattab commented Jan 15, 2011

Correct, and thanks for putting the binary change in!!

it seems the failed build happened in Build #1007 but I can't see why from the output: http://qa.piwik.org:8080/hudson/job/Piwik/1007/consoleFull

@robocoder
Copy link
Contributor

I looked in the artifacts .zip; it failed in the auto-update. I'm guessing the problem is because we disable JavaScript before clicking on the 'update now' link, but interrupting HtmlUnit's JavaScript thread while it's manipulating the DOM. (Disabling JavaScript and/or sleep()'ing are unreliable, so we're going to get spurious failures like these. If it becomes a more frequent problem, I'm going to switch to Selenium.)

@mattab mattab added this to the Piwik 1.2 milestone Jul 8, 2014
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change.
Projects
None yet
Development

No branches or pull requests

2 participants