New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: don't store md5 as strings in the log tables #2007
Comments
Developing a 64-bit, collision resistant hash is really out-of-scope. That said, since the avalanche effect means a small change in the content often results in a significantly different hash, you might be able to simply truncate the md5 hash to 64 bits. (You can probably run a query on the database to see how many collisions result, if any.) I'm currently using a sha1 hash in piwik.js because the code is shorter than for md5. If you don't want to use the full value on the server-side, I believe you can truncate it. |
So we have a collision rate of .0008% or 0.08% (on a 4GB piwik_log_visit setup) I'd say it's very much a good enough number plus at the moment we only do uniques per month, which gives lower collision rates of 0.01%:
so yes I'd agree with the suggestion of keeping first 8 chars of a md5 hash. how much bigger is md5 in JS VS sha1? |
I'm not sure how we can convert a 8 byte string into a bigint to feed mysql... maybe we have to use CONV() ? http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html#function_conv I'm not sure if we can do it in php itself, since it can't handle 64 bits math operation by default, but maybe i'm missing something |
Why not use a BINARY (or VARBINARY) column? Given that the hash value has no sequence and there's no math, then there's no reason to treat it as an int. (plus, no type conversion or MySQL-isms would be required.) |
I think it might be easier to work with bigint VS binary? funny, i was researching this and stumbled upon an interesting report: http://bugs.mysql.com/bug.php?id=44478 ;) |
Not really. As my "invalid" bug report shows, there are implied signed-ness in operations even when dealing with unsigned bigint. Using a BINARY column and avoiding MySQL-specific functions in the queries will make it easier to support non-MySQL databases in the future. |
(In [3729]) Fixes #2007
|
Anthon your concerns are valid, there are 2 functions that are mysql specific in Piwik_Tracker now, we can abstract these when need be. I already wrote the code when you suggested to use binary field and wasn't keen to go through it again, knowing that we don't support other than mysql at this time. As a result of this change, as well as other performance change in trunk, I note a 33% performance improvement in Archiving time between 1.1 and 1.2 with the same dataset. |
This really should be BINARY and use php's built-in pack/unpack to convert the 16 hexits to/from an 8 byte binary string. This wouldn't require any future abstraction. |
Re-opening to investigate the spurious (but recurring) build failures.
|
Sorry... last commit message was incomplete (trying to edit while holding sleeping toddler). In [3732], refs #2007 - use BINARY columns in database and in-memory to avoid MySQL-isms in the queries (and special cases); convert to/from hex for cookie; fix typo in ALTER statement |
(In [3741]) Refs #2007 Style & debug changes Maybe the bug is caused by a failed BINARY comparison of config_id (see bottom of http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html ) |
Closing for now... if the build starts to randomly failing again, we will reopen |
Replying to matt:
The failures were occuring before the BINARY change. I don't think my commit fixed whatever was causing it. |
Correct, and thanks for putting the binary change in!! it seems the failed build happened in Build #1007 but I can't see why from the output: http://qa.piwik.org:8080/hudson/job/Piwik/1007/consoleFull |
I looked in the artifacts .zip; it failed in the auto-update. I'm guessing the problem is because we disable JavaScript before clicking on the 'update now' link, but interrupting HtmlUnit's JavaScript thread while it's manipulating the DOM. (Disabling JavaScript and/or sleep()'ing are unreliable, so we're going to get spurious failures like these. If it becomes a more frequent problem, I'm going to switch to Selenium.) |
MD5 are 128bit hashes commonly represented as 32 char strings. In the log_visit and log_link_visit_action tables, we store these hashs as string which is unfortunate since it is using a lot of space for nothing and makes inserting / querying less efficient. We are not doing cryptography so don't need something as complex as md5. We are just trying to avoid collisions, most of the time.
we should shorten the visitor ID to a much smaller value, at most unsigned BIGINT, and saves a ton of disk space (index space + data), as well as increase performance.
unsigned BIGINT goes from 0 to 18446744073709551615 - a long shot before we get collisions.
To summarize
Please comment if you have any idea on how to make this work well, thx
The text was updated successfully, but these errors were encountered: