New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete old logs should also delete unused urls from piwik_log_action #2805
Comments
It would very nice to get this done to ensure optima use of DB space (many users complaining of Piwik being too hungry for disk space) |
I've looked into fixing this issue, which unfortunately isn't as straightforward to fix as it seems. I've come up w/ a way to solve it, but first here are two solutions that won't work:
This is the solution I've come up with that will work, but may not be ideal. It involves,
I haven't tested the speed of say deleting 100,000 rows in log_link_visit_action w/ the triggers in place, but I'm guessing it won't take too long. The big issue w/ this solution (other than the fact that an extra column has to be added), is that when installing the new Piwik, Piwik will have to do a full table scan of log_link_visit_action and set the ref_count appropriately. I think its possible for this to take less than 30 mins for a table w/ 15 million rows, but again I haven't tested anything yet. What do you think of my idea? |
capedfuzz, this sounds good, because it is mandatory for a fast piwik instance to keep the actions as low as possible. Perhaps I have an idea, that develops your solution further: Instead of adding a ref_count-column with multiple hooks (increase on use, decrease on delete, check if 0, then delete), wouldn't it be better to add a timestamp row "last_seen" or "last_used" where the timestamp of the last use of the given row was logged? Then you have only one hook: increase on use. When you delete, you only have to look for timestamps that are smaller than the "delete logs"-intervall-end-time (or e.g. smaller than the first dataset timestamp in the log_visit table). So you would know immediately, which of the rows are orphans. Does this make sense? |
Yes, that makes perfect sense, and solves every potential problem I can think of. I should be able to create a fix for this issue pretty soon. Thanks for the great idea! |
Attachment: Patch for this issue (uses triggers). |
I uploaded a patch for this issue. This patch uses a trigger, and triggers don't seem to be used elsewhere in Piwik, so I assume it should be looked over. Also, I'm not completely sure how adding new Update types is done. Let me know what you think. |
Sorry to join the party late... I was on a trip. Thanks for your research on this issue, and comparing all possibilities! Triggers are unfortunately not possible to be in core, for performance/ compatibility reasons. Also a "sysadmin" operation ideally should not require a new column in a busy table. According to your tests, the temp table solution 15M pages views DB would take ~ 40 min running to time to delete old pages, this sounds good to me personnally. the task will run at the end of archiving often once a week, so it can take 1 hour if it wishes. |
Attachment: Patch for this issue (no triggers, uses another table). |
Uploaded another patch for this issue. This patch uses the temporary table approach and won't result in any concurrency issues. Some notes on this issue:
Ok, so maybe those are a lot of notes :) |
Code feedback
Looking forward to it... great backend improvement! |
Replying to matt:
Wouldn't it be better to lock on tables? This way, other users can still view reports and it might not block all Tracker actions if the storage engine uses row-level locking.
I did some tests and here are the results:
None of the rows that were deleted were adjacent to one another. I think, due to the fact that it's deleting rows that aren't in the temp table, there's a point in which less rows to delete takes more time. |
Replying to capedfuzz:
I just meant the code ould be refactored in the Db object where you refactored optimize/repair queries already.
These look good enough. I would be OK also even remove the Event mechanism and simply issue the delete low priority query. I think it will be fine for users! but you can also leaeve the event code inside, if it works without it :) |
Attached another patch. Notes:
This should be the last one :) |
Oh, forgot, one other thing: I added a utility var arg function prefixTables to Piwik_Common. I can always make it a private helper function if Piwik_Common shouldn't be modified. Let me know what you think. |
Attachment: Another patch (uses temp table, no event). |
(In [6221]) Fixes #2805, made LogDataPurger delete unused actions from log_actions. Notes:
|
(In [6222]) Refs #2805, fixes build? |
Good commit & coding! Thinking aloud: deleting 1,000,000 takes 8 min 16s. It means that during these 8 minutes new visits/actions cannot be recorded. I think it's OK since it will be extremely rare to delete so many idactions, and will most likely only happen the first time the task runs for large Piwik DBs. QA:
|
Replying to matt:
Temporary tables are deleted when the connection that created them is closed. I did a test and checked that there were no extra table files after the apache process was killed, and there weren't any.
Table locks are released as well (see http://dev.mysql.com/doc/refman/5.0/en/lock-tables.html, after the bullet points in 'Rules for lock release'). That said, I can't attest as to what will happen if the MySQL instance is stopped during a DELETE, but I'm guessing since MySQL is a mature piece of software, it will be able to fail gracefully. |
Good to hear you tested to kill the process and it worked! During the delete at worst I gues sit could need a REPAIR. In the future we should maybe consider doing a daily REPAIR just in case, if users start reporting issues but it hasn't been a big problem so far. Good work, ticket closed ;) |
Currently, "Delete logs older than X Months" does not delete unused rows in the lookup table piwik_log_action
We could however remove all entries from this table, that is not currently referenced in piwik_log_link_visit_action.
we could run a JOIN query to delete unused rows from this table. This query might run extremely slowly on large tables, so it would be nice to test it on a good example.
suggestion directly from the high traffic piwik thread
The text was updated successfully, but these errors were encountered: