Opened 6 years ago

Closed 2 years ago

Last modified 21 months ago

#53 closed New feature (fixed)

Lightweight Piwik mode: Setting to purge aggregated reports older than N days from the database

Reported by: matt Owned by:
Priority: critical Milestone: 1.12.x - Piwik 1.12.x
Component: Core Keywords:
Cc: Sensitive: no

Description (last modified by matt)

The main reason that people stop using Piwik on small websites is because the DB grows out of control. Often users have 25Mb or 50m for the whole Mysql database. After installing the CMS, it is common for Piwik to have only 20M of space. We should try to make it easy to run Piwik with basic history and keep DB space below 20M.

Proposal "Lightweight Piwik" feature

  • Display the current DB size. Reuse API code from: #3004
  • Have a new setting "General settings" to select number of months to keep data for
    • By default, keep all historical data
    • In the dropdown selecting the number of months, put estimate of DB size that will be freed
  • When setting enabled,
    • Delete old archive_blob tables that are older than N months
    • delete from all archive_numeric all rows except a critical metrics (visits, pageviews, ecommerce/goal metrics).
    • also delete all segment data, numeric and blob
    • Propose a dropdown to select number of months
      • For each month choice "delete older than [3,6,9,12,18,2 years,4 years,custom]". Once entry is selected, refresh the space free "delete older than [18 months] (N Gb will be deleted, Total Piwik DB size will become MGb)"
    • When the feature is enabled, execute it as a scheduled task (safer for huge DBs). But on page reload, we could also display a link to let the user execute the purge code NOW (when in a hurry to get the DB size under control).

Note: it would be very nice to also get rid of #2805 - just one SQL query away after all...
PS: and this one too #3003

V2 of this feature (to be put as "feature request" ticket)

  • User could choose exactly which reports to purge.
  • Have a "less agressive" purge mode that would keep more data, but still delete a lot of useless old reports nobody would care about
    • The list would be pre-selected with our estimate of the most useful reports (eg. Keywords, Page URLs/Page Titles, Websites, Campaigns, All goal reports...) and delete all others. Call our selection "Recommended historical reports to keep"
    • The list would handle numeric and blob.
    • "Delete data for segmented reports" or "Keep reports for segmented data"

Attachments (3)

proposed_mods.png (89.2 KB) - added by capedfuzz 2 years ago.
Proposed modifications
53.diff.tar.gz (12.3 KB) - added by capedfuzz 2 years ago.
Patch for this issue. (1st)
53.diff.tar.2.gz (27.2 KB) - added by capedfuzz 2 years ago.
Patch for this issue. (2nd)

Download all attachments as: .zip

Change History (45)

comment:1 Changed 6 years ago by matt (mattab)

  • Milestone set to Future features

comment:2 Changed 5 years ago by matt (mattab)

  • Priority changed from major to critical

comment:3 Changed 4 years ago by matt (mattab)

  • Description modified (diff)
  • Sensitive unset
  • Summary changed from Plugin to purge the database for old statistics to Plugin to purge old statistics from the database

comment:4 Changed 3 years ago by matt (mattab)

  • Description modified (diff)

comment:5 Changed 3 years ago by peterb (peterbo)

(In [4856]) PrivacyManager / Delete old statistics from database; Refs #2233, #53, #5

comment:6 Changed 3 years ago by peterb (peterbo)

(In [4868]) Refs #2233, #53, #5

  • tweaking / optimizing / commenting

comment:7 Changed 2 years ago by matt (mattab)

  • Description modified (diff)
  • Milestone changed from Feature requests to 1.7.x - Piwik 1.7.2
  • Summary changed from Plugin to purge old statistics from the database to Light Piwik mode: Setting to purge aggregated reports older than N days from the database

Updating description for proposal first version. I think this would be extremely useful to a LOT of Piwik users! This would fix a major complaint with Piwik: DB size.

comment:8 Changed 2 years ago by matt (mattab)

  • Component changed from New Plugin to Core

comment:9 Changed 2 years ago by matt (mattab)

  • Summary changed from Light Piwik mode: Setting to purge aggregated reports older than N days from the database to Lightweight Piwik mode: Setting to purge aggregated reports older than N days from the database

comment:10 Changed 2 years ago by matt (mattab)

  • Description modified (diff)

Improving feature description (usability)

Changed 2 years ago by capedfuzz (diosmosis)

Proposed modifications

comment:11 Changed 2 years ago by capedfuzz (diosmosis)

I'd like to implement this by modifying the 'Delete old visitor logs from database' section. I've attached an image laying out what the changes would be. Some notes:

  • The estimated DB size would be updated via AJAX every time a field is modified.
  • I think this entire section should be moved to the General Settings tab. It seems to have little to do w/ privacy.

What do you think?

comment:12 Changed 2 years ago by matt (mattab)

Along with a few other key feature we're currently working on, I am confident this one in particular will have a HUGE impact. Because, it is in the top 3 critical feedback, that Piwik uses too much DB size. Many shared host will be relieved with this. Great that you're working on this amazing feature!!

Review:

  • OK to merge both settings as they make sense indeed.
  • Please link the "Delete log feature" from Privacy page to the General Settings page (similarly to how it is linked now from General -> Privacy)
  • "Keep basic metrics" should be checked by default
    • Also it should be hidden until "Regularly delete old report" is set to Yes
    • Maybe renamed to "(recommended) Keep basic numeric metrics (Visits, Page views, Bounce rate, Goal conversions, Ecommerce conversions, Number of products bought, etc.)."
  • "Estimated DB size...' could be 'Database size'
    • Then the message onthe right could say "Your Piwik DB currently uses X Gb of space. After old data will be deleted, the database should be approximately Y Gb, freeing up approximately Z Mb of data. "
    • How will you process this estimate?
      • piwik_log_ tables: the DB size will be affected only when the setting "starts" to be enabled (when it is just now been checked). Because, after a few days that logs are regularly deleted, the log_* table size will be approximately constant (assuming website traffic is constant).


When the setting is enabled, we don't know how many rows are "older" than N days. I'm not sure how we can estimate how much data will be saved. I guess we could run a SQL query to get an estimate, but not sure if it's useful just for this help message.


Should we include piwik_log_* approx DB space in the estimate?

  • piwik_archive_*: it's easier to guess how much size will be saved, since the DBStats plugin API returns the DB size by table , so we can roughly sum the size of all _blob_ tables that will be deleted to get an estimate? If "Keep basic metrics" is NOT checked, we can also add the archive_numeric tables.
  • Note: The number of months should be enforced to the minimum 3 months I think (for safety)
  • One feature that I forgot in the ticket that is quite important:
    • "Keep all data for" (x) daily reports, (x) weekly reports, (x) monthly reports, (x) yearly reports (x) Custom date ranges.
      • By default, only "Month" and "Year" would be checked. I'm assuming these are the most useful reports for "long term" vision and analytics.
      • This would allow a user to only delete daily archives for example, but still keep the week/month level visibility which can be very important to keep. It's quite easy to enforce on the DB since there is a column 'period'
      • Why this feature of keeping weeks/months/etc is important: even if users wish to delete data, we should help them make the right decision. When using Piwik, historical data has incredible value (especially with upcoming #534). We should therefore make it easy to free up significant DB space, while not making Piwik a useless tool for long term analytics.

comment:13 Changed 2 years ago by matt (mattab)

UX Request:

  • it would also be great if, when "delete old logs" or "delete old reports" is enabled, the UI asks for confirmation in a Popup "Are you sure you want to enable [Delete old logs/Delete old reports]?" (the one being enabled right now, or both if they are both being enabled).
    • there is a style for standard popover in Piwik (try for example changing password it will trigger popover)

Changed 2 years ago by capedfuzz (diosmosis)

Patch for this issue. (1st)

comment:14 Changed 2 years ago by capedfuzz (diosmosis)

Just attached a patch for this issue.

Some notes:

  • I didn't move the settings over to General Settings. That proved to be more trouble than it was worth (though its possible, so let me know if you think it'd be better).
  • Haven't put the 'Purge DB now' link in. I wanted to ask, should this be disabled if the estimated data savings is beyond a certain amount? Or, perhaps make sure the request stops after a certain amount of time?
  • When rows are deleted from a table, the estimated DB size is calculated using the average row size.
  • Unit tests I wrote take about ~100s to run. At first, though, they took ~450s. When looking at how to lower the amount of time it took to run them, I noticed two things were causing them to run slowly: using the tracker and that when archiving, EVERY plugin's archiveDay/Period are called. Might be helpful in speeding up the tests...

Let me know what you think.

comment:15 Changed 2 years ago by capedfuzz (diosmosis)

Forgot to mention, the patch is for rev 6123.

comment:16 follow-up: Changed 2 years ago by matt (mattab)

Great work cappedfuzz! Nice to read...
Here is the Code & UI Review:

  • en.php: when changing the text, please also update the translation key: this will force translators to translate again (otherwise they will not see that the string was updated)
  • 'CoreHome_ThereIsNoDataForThisReport' should not changed
  • 'CoreHome_ThereIsNoDataForThisReport' "data may have been purged" message, should only be displayed when the feature is enabled. By default, reports should still display only the simple current message.
    • Also it should only display when the selected date is affected by the Report deletion. For example, only when looking at reports older than 6 months will the message say "No data may have been purged"
  • the Click here with html link is replaced by Click %s here %s which are then replaced by HTML in the templates
  • for code style, in getReportPeriodsToKeep, it could be an array then a loop testing all indexes and assigning
  • In the JS modifications, Please add one comment above the main new pieces of code, it helps separate the concerns a bit
  • Defaut delete_reports_older_than should be at least 12 months
  • would it be possible to have Delete logs and Delete Reports as 2 separate scheduled tasks? this would help lower the time spent in each task
  • Otherwise code looks very good!!

'Purge DB now' link

I would say the link should always be visible when the feature is enabled. Maybe open in a new window target=_blank as to make sure user can still use Piwik?

comment:17 in reply to: ↑ 16 ; follow-up: Changed 2 years ago by capedfuzz (diosmosis)

Some questions:

Replying to matt:

Great work cappedfuzz! Nice to read...
Here is the Code & UI Review:

  • en.php: when changing the text, please also update the translation key: this will force translators to translate again (otherwise they will not see that the string was updated)
  • 'CoreHome_ThereIsNoDataForThisReport' should not changed
  • 'CoreHome_ThereIsNoDataForThisReport' "data may have been purged" message, should only be displayed when the feature is enabled. By default, reports should still display only the simple current message.
    • Also it should only display when the selected date is affected by the Report deletion. For example, only when looking at reports older than 6 months will the message say "No data may have been purged"

There's an issue with this since the purge settings can be changed. So it could be set to reports older than 6 months, a purge could run, and it could be reset to reports older than, say, 8 months. In which case, some reports won't have the message when they should. I think, though, that this can be solved w/ the following:

If the table for the report does not exist, then it has been purged.
If the table does exist, but has a special entry (where idsite && idarchive == -1 or some other special number), then rows have been deleted and data has been purged.

This should be more accurate and keep purging effective (as opposed to not deleting rows).

  • the Click here with html link is replaced by Click %s here %s which are then replaced by HTML in the templates
  • for code style, in getReportPeriodsToKeep, it could be an array then a loop testing all indexes and assigning
  • In the JS modifications, Please add one comment above the main new pieces of code, it helps separate the concerns a bit
  • Defaut delete_reports_older_than should be at least 12 months
  • would it be possible to have Delete logs and Delete Reports as 2 separate scheduled tasks? this would help lower the time spent in each task
  • Otherwise code looks very good!!

'Purge DB now' link

I would say the link should always be visible when the feature is enabled. Maybe open in a new window target=_blank as to make sure user can still use Piwik?

Actually, I think over-thought. An AJAX request would work.

comment:18 in reply to: ↑ 17 Changed 2 years ago by matt (mattab)

There's an issue with this since the purge settings can be changed. So it could be set to reports older than 6 months, a purge could run, and it could be reset to reports older than, say, 8 months. In which case, some reports won't have the message when they should.

I think it's an acceptable solution. I propose that the message is displayed, only for dates older than the current deletion threshold. It is just an estimate anyway. The good thing is not to show the message to normal users, or for recent dates :)

Actually, I think over-thought. An AJAX request would work.

Yes it would, if you can please use the standard "Loading data..." helper during the request (or maybe even a better message "Deleting old reports, please wait..."

Changed 2 years ago by capedfuzz (diosmosis)

Patch for this issue. (2nd)

comment:19 Changed 2 years ago by capedfuzz (diosmosis)

Attached another patch for this issue. Let me know what you think of it.

comment:20 Changed 2 years ago by matt (mattab)

Good work Benaka!! :)

Code review:

  • DEBUG TIP: Not sure if you knew, but you can force the scheduled tasks to trigger by setting PIWIK_TRACKER_DEBUG_FORCE_SCHEDULED_TASKS to true in /piwik.php
  • The tests are hard to understand because numbers not detailed, do you mind commenting the meaning of the numbers in tests:
    +		// perform checks on prediction
    +		$janPeriodCount = 5 + 4 + 1 + 1;
    +		$febPeriodCount = 6 + 4 + 1;
    
    ----
    
    +	const JAN_PERIOD_COUNT = 37; // 31 + 4 + 1 + 1;
    +	const FEB_PERIOD_COUNT = 34; // 29 + 4 + 1;
    
    ---
    
    +		$expectedPrediction = array(
    +			Piwik_Common::prefixTable('log_conversion') => 6,
    +			Piwik_Common::prefixTable('log_link_visit_action') => 6,
    +			Piwik_Common::prefixTable('log_visit') => 3,
    +			Piwik_Common::prefixTable('log_conversion_item') => 3,
    +			Piwik_Common::prefixTable('archive_numeric_2012_01') => -1,
    +			Piwik_Common::prefixTable('archive_blob_2012_01') => -1
    +		);
    


Idem in test_purgeData_deleteReportsKeepBasicMetrics and other tests

  • By default, which metrics are purged and which aren't? let's only keep visit metrics + goal. functions such as getGoalMetricsToPurge return empty array ?
  • The new code introduces quite a bit of new SQL queries. Can you please refactor each SQL query in its own private method? This would help separate concerns in the code.
    • The SQL used for data prediction and the one used for data deletion are very similar. Could the SQL be refactored? Not critical to do, might not be worth until later (but at least facotring out all sql queries in own method will help - no need to comment these private methods btw)
  • + $sql = "SELECT COUNT(idarchive) FROM $table WHERE period NOT IN (" Here this should be count(*) because we should count all rows, not the number of unique archive IDs
  • Currently, all settings are stored in the config file. we would like to save them in the DB.
    • I think these settings should be stored in the Database using the Piwik_Option mechanism. Can you please do the modification? This will code simpler. You could have then a simple Piwik_GetOption for each value, store it in an array in a class attribute... ?
    • Storing dynamic settings in the config file is bad practise for Piwik and a mistake we've been doing earlier, now trying to stop :)
    • If you could also move the current Log purge settings from the config file to the Option table it would be great.
      • To keep backward compatibility, we will still "read" the log purge config settings in case user defined it there, but will always fallback on the DB value.
    • Storing settings in DB ensures Piwik works well in a distributed balanced piwik cluster (this is an actual use case by many piwik users)
  • $result[$table] = -1; Here the -1 should be a CONST attribute on the class (to avoid magic numbers)
  • Translation files: you don't need to edit other files, it will be done automatically before the release (only the english file needs to be kept clean). Actually it's true that the French file has to be updated so that tests can pass.

If you can apply these last feedback then commit it would be great! Cool feature to have.. We will have to do a bit of marketing around it.

comment:21 Changed 2 years ago by capedfuzz (diosmosis)

  • Resolution set to fixed
  • Status changed from new to closed

(In [6174]) Fixes #53. Augmented the log data deletion feature. Added the ability to purge old reports and metrics.

Other changes:

  • Added the following plugin functions: Piwik_DropTables, Piwik_OptimizeTables, Piwik_DeleteAllRows. Also refactored existing code to use them.
  • Modified graph, tag cloud & datatable templates/views to show a different message if there's no data for a report and if its possible that report was purged.
  • Refactored DbStats API, added getAllTablesStatus method that doesn't modify the SHOW TABLE STATUS result.
  • Deletelogs config options are now stored in the DB.
  • Added task priority support to the TaskScheduler.

comment:22 Changed 2 years ago by capedfuzz (diosmosis)

(In [6175]) Refs #53, UI & security tweaks along w/ one bug fix ('delete_logs_max_rows_per_query' was not set in getPurgeSettingsFromRequest).

comment:23 Changed 2 years ago by capedfuzz (diosmosis)

(In [6176]) Refs #53, show 'data was purged' message for every user type, not just super user.

comment:24 Changed 2 years ago by capedfuzz (diosmosis)

Commit [6177] deals with this ticket. (forgot to add the Refs)

comment:25 Changed 2 years ago by matt (mattab)

Very nice! it looks perfect now :)

comment:26 Changed 2 years ago by matt (mattab)

(In [6178]) Refs #3011 #53 Small updates to english strings & ui

comment:27 Changed 2 years ago by matt (mattab)

(In [6179]) Refs #53 labels html

comment:28 Changed 2 years ago by EZdesign (BeezyT)

I just noticed the metrcs picker is not working on the visitors overview anymore. I guess this problem has been introduced in [6174].

The problem is that Piwik_Date::factory doesn't recognize a date.

This is the backtrace (from the XHR response):

#0 /Users/timo/Sites/piwik/svn-trunk-git/core/ViewDataTable.php(1205): Piwik_Date::factory('2009-02-01,2011...')
#1 /Users/timo/Sites/piwik/svn-trunk-git/core/ViewDataTable/GenerateGraphHTML.php(133): Piwik_ViewDataTable->hasReportBeenPurged()
#2 /Users/timo/Sites/piwik/svn-trunk-git/core/ViewDataTable/GenerateGraphHTML.php(97): Piwik_ViewDataTable_GenerateGraphHTML->buildView()
#3 /Users/timo/Sites/piwik/svn-trunk-git/core/Controller.php(241): Piwik_ViewDataTable_GenerateGraphHTML->main()
#4 /Users/timo/Sites/piwik/svn-trunk-git/plugins/VisitsSummary/Controller.php(78): Piwik_Controller->getLastUnitGraphAcrossPlugins('VisitsSummary', 'getEvolutionGra...', Array, Array, 'Dies ist eine ?...')
#5 [internal function]: Piwik_VisitsSummary_Controller->getEvolutionGraph()
#6 /Users/timo/Sites/piwik/svn-trunk-git/core/FrontController.php(138): call_user_func_array(Array, Array)
#7 /Users/timo/Sites/piwik/svn-trunk-git/index.php(53): Piwik_FrontController->dispatch()

comment:29 Changed 2 years ago by capedfuzz (diosmosis)

(In [6195]) Refs #53, fix issue of failure in purged report message logic when a set of dates is requested.

comment:30 Changed 2 years ago by capedfuzz (diosmosis)

@EZDesign Just committed a fix for the problem, the metrics picker works for me now. Let me know if you still encounter a problem.

comment:31 Changed 2 years ago by EZdesign (BeezyT)

Thanks for the quick fix! It works now.

comment:32 Changed 2 years ago by matt (mattab)

(In [6197]) Refs #53 reusing a function

comment:33 Changed 2 years ago by matt (mattab)

There is a similar bug with date=last7 reported in #3107

comment:34 Changed 2 years ago by capedfuzz (diosmosis)

(In [6217]) Refs #53, #3107, fix issue of failure in purged report message logic when special date range (lastN, etc.) is used.

comment:35 Changed 2 years ago by EZdesign (BeezyT)

Thanks for the good work, capedfuzz!

I have a request for this feature. I hope you can do it...

Piwik creates archives for every custom date range you pick. It would be cool if there would be a setting to only delete those archives - keeping days,weeks,months,years and also segments intact. There could be a similar setting do delete segments and keep days,weeks,months,years but I would only need the option to delete period=range archives.

Is that possible?

comment:36 Changed 2 years ago by capedfuzz (diosmosis)

@EZDesign Yes, it's possible, and likely easy to do. I should be able to get to it soon.

comment:37 Changed 2 years ago by capedfuzz (diosmosis)

(In [6218]) Refs #53, Added ability to purge range reports and ability to keep segment data for saved reports.

Other changes:

  • Added Piwik_FetchCol function to PluginFunctions/Sql.php.
  • Modified purged report message to display message if report is over custom date range.

comment:38 Changed 2 years ago by capedfuzz (diosmosis)

(In [6219]) Refs #53, fixes build?

comment:39 Changed 2 years ago by capedfuzz (diosmosis)

(In [6220]) Refs #53, removed Piwik_FetchCol function as mysqli seems to have a problem w/ fetchCol.

comment:40 Changed 2 years ago by matt (mattab)

kaboom, so good!

comment:41 Changed 21 months ago by capedfuzz (diosmosis)

(In [6557]) Refs #53, show DB purged message after successful purge.

comment:42 Changed 21 months ago by capedfuzz (diosmosis)

(In [6559]) Refs #53, internationalize db purged message.

Note: See TracTickets for help on using tickets.