Opened 2 years ago

Closed 2 years ago

Last modified 2 years ago

#2805 closed Bug (fixed)

Delete old logs should also delete unused urls from piwik_log_action

Reported by: matt Owned by:
Priority: critical Milestone: 1.12.x - Piwik 1.12.x
Component: Core Keywords:
Cc: Sensitive: no

Description

Currently, "Delete logs older than X Months" does not delete unused rows in the lookup table piwik_log_action

We could however remove all entries from this table, that is not currently referenced in piwik_log_link_visit_action.

we could run a JOIN query to delete unused rows from this table. This query might run extremely slowly on large tables, so it would be nice to test it on a good example.

suggestion directly from the high traffic piwik thread

Attachments (3)

2805.diff.tar.gz (3.8 KB) - added by capedfuzz 2 years ago.
Patch for this issue (uses triggers).
2805.diff.tar.2.gz (6.9 KB) - added by capedfuzz 2 years ago.
Patch for this issue (no triggers, uses another table).
2805.diff.tar.3.gz (6.2 KB) - added by capedfuzz 2 years ago.
Another patch (uses temp table, no event).

Download all attachments as: .zip

Change History (21)

comment:1 Changed 2 years ago by matt (mattab)

  • Milestone changed from 1.8 Piwik 1.8 to 1.7 Piwik 1.7

comment:2 Changed 2 years ago by matt (mattab)

  • Priority changed from normal to critical

It would very nice to get this done to ensure optima use of DB space (many users complaining of Piwik being too hungry for disk space)

comment:3 Changed 2 years ago by capedfuzz (diosmosis)

I've looked into fixing this issue, which unfortunately isn't as straightforward to fix as it seems. I've come up w/ a way to solve it, but first here are two solutions that won't work:

0) Selecting idactions from log_link_visit_action that aren't in log_action into a temporary table and deleting them. This won't work since the SELECT will take forever. Essentially it will require going through every row of log_action, then for each of those, doing a full table scan of log_link_visit_action. Which would take forever.

1) Selecting idactions that do exist into a temporary table and then deleting log_action entries that aren't in the temporary table. This works, but is still too slow. I did a test w/ 15 million log_link_visit_action rows that reference 5 million distinct idactions. It took ~28 minutes to insert into the temporary table and ~12 minutes to do the DELETE.

This is the solution I've come up with that will work, but may not be ideal. It involves,

0) Adding a 'ref_count' column to log_action.
1) Creating two triggers, one that increments this count in the appropriate rows when a log_link_visit_action row is inserted, and one that decrements it when one log_link_visit_action row is deleted.
2) Then deleting for ref_count = 0. Which can be done w/ a limit.

I haven't tested the speed of say deleting 100,000 rows in log_link_visit_action w/ the triggers in place, but I'm guessing it won't take too long.

The big issue w/ this solution (other than the fact that an extra column has to be added), is that when installing the new Piwik, Piwik will have to do a full table scan of log_link_visit_action and set the ref_count appropriately. I think its possible for this to take less than 30 mins for a table w/ 15 million rows, but again I haven't tested anything yet.

What do you think of my idea?

comment:4 follow-up: Changed 2 years ago by peterb (peterbo)

capedfuzz, this sounds good, because it is mandatory for a fast piwik instance to keep the actions as low as possible.

Perhaps I have an idea, that develops your solution further: Instead of adding a ref_count-column with multiple hooks (increase on use, decrease on delete, check if 0, then delete), wouldn't it be better to add a timestamp row "last_seen" or "last_used" where the timestamp of the last use of the given row was logged? Then you have only one hook: increase on use.

When you delete, you only have to look for timestamps that are smaller than the "delete logs"-intervall-end-time (or e.g. smaller than the first dataset timestamp in the log_visit table). So you would know immediately, which of the rows are orphans.

Does this make sense?

comment:5 in reply to: ↑ 4 Changed 2 years ago by capedfuzz (diosmosis)

Yes, that makes perfect sense, and solves every potential problem I can think of. I should be able to create a fix for this issue pretty soon. Thanks for the great idea!

Changed 2 years ago by capedfuzz (diosmosis)

Patch for this issue (uses triggers).

comment:6 Changed 2 years ago by capedfuzz (diosmosis)

I uploaded a patch for this issue. This patch uses a trigger, and triggers don't seem to be used elsewhere in Piwik, so I assume it should be looked over. Also, I'm not completely sure how adding new Update types is done. Let me know what you think.

comment:7 Changed 2 years ago by matt (mattab)

Sorry to join the party late... I was on a trip.

Thanks for your research on this issue, and comparing all possibilities!

Triggers are unfortunately not possible to be in core, for performance/ compatibility reasons. Also a "sysadmin" operation ideally should not require a new column in a busy table.

According to your tests, the temp table solution 15M pages views DB would take ~ 40 min running to time to delete old pages, this sounds good to me personnally. the task will run at the end of archiving often once a week, so it can take 1 hour if it wishes.

however priority should be set the lowest so it runs last.

Changed 2 years ago by capedfuzz (diosmosis)

Patch for this issue (no triggers, uses another table).

comment:8 Changed 2 years ago by capedfuzz (diosmosis)

Uploaded another patch for this issue. This patch uses the temporary table approach and won't result in any concurrency issues.

Some notes on this issue:

  • Since deleting log_actions can take a very long time, I made it an option for the log deletion feature. It can be disabled if its not that important to the user.
  • Inserting into another table and then deleting takes a long time and is also subject to some concurrency issues. I believe I've solved them, but the code might look a bit funny. I use an event to execute a delete asynchronously and lock on log tables to make sure there are no references to non-existant log_action rows.
  • The table locking locks on all the log tables (except log_action), but I've minimized the amount of time it locks on them. The length of the DELETE on log_action will be more of an issue.
  • Because an event is used, this feature can only be used when the MySQL event_scheduler is turned on. The option in the UI will be disabled if this is not the case. Also because an event is used, the "temporary" table that is created can't actually be TEMPORARY.
  • To make sure the code works, I've added a test case that simulates the dangling reference issue. It required adding a hook to LogDataPurger. The hook should only be used when testing.
  • The tests require the event_scheduler to be on, so we'll have to make sure its on in the jenkins server.
  • Finally, I underestimated how long it would take to do the insert. The 27 minute estimate above for inserting into the new table doesn't take into account the idaction references in other tables. It will likely take a lot longer on large Piwik instances.

Ok, so maybe those are a lot of notes :)

comment:9 follow-up: Changed 2 years ago by matt (mattab)

Code feedback

  • good work & research!
  • getMaxRowIdsInLogTables() could be more simple without union: simply run each query separately and store result in array directly (4queries not slower than 1 union)
  • lock/unlock should be on the DB object (similarly to optimize etc)
  • I'm afraid EVENTs are not available on many users...
    • if mysql event feature is disabled, instead of displaying error message, catch the exception and fallback to the slower version.
      • please remove the new message in en.php as we should hide the implementation in this case
  • suggest to remove all strings in en.php and do smart defaults instead:
    • The task can be executed weekly by default
    • I don't think it should be an option at all but enabled for all users, even if it's slow it should work OK once a week ?
    • the fact it's not included in the estimate is OK and not worth mentioning.
      • we are fixing a "bug" that most people think is working already. so, it is OK to not announce it in this case :)
    • the code should work when event scheduled is disabled by catching the exception if possible.
      • But I would be keen to remove this code all together and simply do DELETE FROM linking to the TEMP table. There would be very few data loss and should not cause large locking / slowing down visit tracking?
        • Could you please test to see how long it takes, without the event solution (just the DELETE) on a 10M rows log_action table for example ?

Looking forward to it... great backend improvement!

comment:10 in reply to: ↑ 9 ; follow-up: Changed 2 years ago by capedfuzz (diosmosis)

Replying to matt:

Code feedback

  • lock/unlock should be on the DB object (similarly to optimize etc)

Wouldn't it be better to lock on tables? This way, other users can still view reports and it might not block all Tracker actions if the storage engine uses row-level locking.

[snip]

  • But I would be keen to remove this code all together and simply do DELETE FROM linking to the TEMP table. There would be very few data loss and should not cause large locking / slowing down visit tracking?
    • Could you please test to see how long it takes, without the event solution (just the DELETE) on a 10M rows log_action table for example ?

I did some tests and here are the results:

Rows Deleted        Elapsed Time
10,000              3 min 40s
100,000             2 min 30s
1,000,000           8 min 16s

None of the rows that were deleted were adjacent to one another.

I think, due to the fact that it's deleting rows that aren't in the temp table, there's a point in which less rows to delete takes more time.

comment:11 in reply to: ↑ 10 Changed 2 years ago by matt (mattab)

Replying to capedfuzz:

Replying to matt:

Code feedback

  • lock/unlock should be on the DB object (similarly to optimize etc)

Wouldn't it be better to lock on tables? This way, other users can still view reports and it might not block all Tracker actions if the storage engine uses row-level locking.

I just meant the code ould be refactored in the Db object where you refactored optimize/repair queries already.

I did some tests and here are the results:

Rows Deleted        Elapsed Time
10,000              3 min 40s
100,000             2 min 30s
1,000,000           8 min 16s

These look good enough.

I would be OK also even remove the Event mechanism and simply issue the delete low priority query. I think it will be fine for users! but you can also leaeve the event code inside, if it works without it :)

comment:12 Changed 2 years ago by capedfuzz (diosmosis)

Attached another patch. Notes:

  • I removed the event related code & the UI modifications.
  • I added the task priority related changes also (and fixed a bug I introduced earlier).

This should be the last one :)

comment:13 Changed 2 years ago by capedfuzz (diosmosis)

Oh, forgot, one other thing: I added a utility var arg function prefixTables to Piwik_Common. I can always make it a private helper function if Piwik_Common shouldn't be modified. Let me know what you think.

Changed 2 years ago by capedfuzz (diosmosis)

Another patch (uses temp table, no event).

comment:14 Changed 2 years ago by capedfuzz (diosmosis)

  • Resolution set to fixed
  • Status changed from new to closed

(In [6221]) Fixes #2805, made LogDataPurger delete unused actions from log_actions.

Notes:

  • Added more priority levels for scheduled tasks. Priorities can be between 0 and 12 now.
  • Modified priority of purge data tasks to have LOW priority and optimize tables task to have LOWEST priority.
  • Fixed bug introduced in #53 w/ regard to task priorities. lowest priority tasks were not being run.
  • Added Piwik_LockTables & Piwik_UnlockTables functions to PluginsFunctions/Sql.php.
  • Added prefixTables(...) utility function to Piwik_Common.

comment:15 Changed 2 years ago by capedfuzz (diosmosis)

(In [6222]) Refs #2805, fixes build?

comment:16 follow-up: Changed 2 years ago by matt (mattab)

Good commit & coding!

Thinking aloud: deleting 1,000,000 takes 8 min 16s. It means that during these 8 minutes new visits/actions cannot be recorded. I think it's OK since it will be extremely rare to delete so many idactions, and will most likely only happen the first time the task runs for large Piwik DBs.

QA:

  • Did you test the following use case: After locks were granted, while the delete is proceeding, kill the apache process for this visit.
    • we have to make sure we recover from this use case since it will likely happen.
    • there will be a TMP table left over.
      • You could run a DROP TABLE IF EXISTS x prior to create table.
    • I think the locks will still be in place
      • Can you please test this use case? We have to make sure locks are released when the process crashes or this could result in lost statistics! and would be hard for users to recover. Can you think of a good solution to this (if tests show locks are not released) ?

comment:17 in reply to: ↑ 16 Changed 2 years ago by capedfuzz (diosmosis)

Replying to matt:

QA:

  • Did you test the following use case: After locks were granted, while the delete is proceeding, kill the apache process for this visit.
    • we have to make sure we recover from this use case since it will likely happen.
    • there will be a TMP table left over.
      • You could run a DROP TABLE IF EXISTS x prior to create table.

Temporary tables are deleted when the connection that created them is closed. I did a test and checked that there were no extra table files after the apache process was killed, and there weren't any.

  • I think the locks will still be in place
    • Can you please test this use case? We have to make sure locks are released when the process crashes or this could result in lost statistics! and would be hard for users to recover. Can you think of a good solution to this (if tests show locks are not released) ?

Table locks are released as well (see http://dev.mysql.com/doc/refman/5.0/en/lock-tables.html, after the bullet points in 'Rules for lock release').

That said, I can't attest as to what will happen if the MySQL instance is stopped during a DELETE, but I'm guessing since MySQL is a mature piece of software, it will be able to fail gracefully.

comment:18 Changed 2 years ago by matt (mattab)

Good to hear you tested to kill the process and it worked!

During the delete at worst I gues sit could need a REPAIR. In the future we should maybe consider doing a daily REPAIR just in case, if users start reporting issues but it hasn't been a big problem so far.

Good work, ticket closed ;)

Note: See TracTickets for help on using tickets.