Ticket #766 (new Bug)
Archiving Memory exhausted error - Archiving memory leaks
| Reported by: | konus | Owned by: | |
|---|---|---|---|
| Priority: | critical | Milestone: | 1.8 Piwik 1.8 |
| Component: | Core | Keywords: | |
| Cc: | Sensitive: | no |
Description (last modified by matt) (diff)
Users have memory exhausted errors when running Piwik archiving. There are two distinct use cases:
- Users with a low to medium traffic websites exhausting the fixed-sized memory of their PHP web host (eg. 64MB)
- Users with very high traffic websites (100k or +1M pages per day) exhausting even a very high PHP memory limit (eg. 1G is not enough).
The error looks like:
Fatal error: Allowed memory size of 67108864 bytes exhausted (tried to allocate 73 bytes) in /var/www/web5/html/piwik/core/DataTable.php on line 939
-> We highly suggest users experiencing issues to upgrade to PHP 5.3 which has some improvements around memory management.
Ideas to solve this issue (less time consuming first)?
- 1) Monthly archiving should use the smaller amount of weeks and days inside the month Currently, archiving a month in Piwik will sum the reports for all 30 days and sum all reports together. This is causing most of the php memory issues which start failing when summing all datatables for Pages URLs for a month for example. Instead, we should reuse the algorithm used to process date ranges, which select the minimum number of periods to use to process the enclosing period. So, a month would typically use 3 or 4 full weeks + the remaining ~5-10 days. This would result in up to 70% less datatables to select and sum!
ie a month archiving would select 4 weeks archiving + 3 day archiving, or 3 weeks archiving + 10 days archiving, so either 7 or 13 archives, instead of 30...
- 2) Fix remaining Memory leaks
- This error is most likely caused by archiving tasks not freeing memory after the task finished runnning. We did some improvements in the past around memory management in Piwik (PHP doesn't free memory automatically in all cases) - see ticket #374 for more info. There is more investigation work to do obviously.
- 3) Process aggregation in MySQL Currently, to process "Top Page URLs" report, we select ALL URLs, then truncate to 500 URLs (see datatable_archiving_maximum_rows_actions in global.ini.php)
Loading for example 50,000 URLs in memory in PHP results in obvious memory crashes. The solution here would be to process the aggregate of the top 500 URLs (including the last "Others" which aggregated all other "truncated" rows).
- Proposal: run the Aggregation in Mysql!
Example query that will select the sum of visits per hour, then truncate in Mysql to the first 15 rows, aggregating all rows from 16-23 in the label "Others". Imagine doing the same thing for Page URLs and only returning the top 1000 pages instead of all pages!
#This query uses the "counter" to truncate the rows after X rows. In real life, we would truncate after X=500 or 5000 SELECT SUM(count), case when counter = 15 then "Others" else hour end AS label FROM ( #This query adds the "counter" to each row after it was sorted by visits desc SELECT count, hour, case when @counter = 15 then 15 else @counter:=@counter+1 end AS counter FROM ( #This query selects count for each hour of the day SELECT count(*) as count, HOUR(visit_last_action_time) as hour FROM `piwik_log_visit` WHERE 1 GROUP BY hour ORDER BY count DESC ) T1, (SELECT @counter:=0) T2 ) T3 GROUP BY counter ORDER BY counter ASC
Here I only sum the number of visits, but we could automatically rewrite all aggregate queries in ArchiveProcessing/Day.php using this mechanism?
Output:
SUM(count) label 113 5 108 19 94 2 93 20 90 21 87 22 80 4 79 9 77 8 67 6 65 1 65 3 61 11 60 10 467 Others
Obviously truncating the "Hourly" report is a bad example, but replace it with Page URLs to get the picture :) This will save a lot of bandwith and CPU processing, especially for websites that have more than 500 unique Page URLs per day or external keywords, etc.
