Archiving is not bounded for the reports Pages by title, pages by URL, causing inefficient / broken archiving #1075

mattab · 2009-12-13T22:09:37Z

Summary

The sql queries used in the archiving: https://github.com/piwik/piwik/blob/master/plugins/Actions/Actions.php#L135

are not bounded, ie. they select all the rows, and then loops over all the result set and processes the php array containing all the data. If the default limit values haven't been changed in the config file (see explanation in the faq) then the archiving limits the number of rows in the php array. The problem is that this is done too late: millions of rows could have been processed from which we would only keep the top 5000 pages.

Unfortunately, this is by design: Piwik keeps 500 rows for the first level table, and keeps 100 rows for each of the children tables (ie. when you click on a row to expand it). Reference: https://github.com/piwik/piwik/blob/master/config/global.ini.php#L138

For a given day for a given website, you could basically record in the database up to 500 * (100 * 100 * ...), ie. millions of rows. This could be fine, but on websites where this is actually happening, this is way too much data to be handled by a single server. The initial SQL query which tries to return these millions of rows will have to swap and block on IO for minutes or hours to try and deliver this much data, temporarily LOCKing the tables involved.

The solution would be to introduce a new "hard limit" on the number of rows returned by the SQL, and have it hardcoded in the sql archiving queries. Because the queries sort by count descending, we could simply have a LIMIT 0, 50000 for example. This way, this would ensure a bounded data set and predictable behavior. This sounds like a rather simple fix which should just work...

Initial Report

Everything was working great on 0.4.5. After updating to 0.5.1 the archive script runs for hours and hours. I managed to get it to archive all daily stats yesterday but that took at least an hour to run. Last night's cron was still running at 11AM when I checked it (9 hours). This is on a quad core Xeon server. Here are the number of rows in the main tables:

piwik_log_action: 6267
piwik_log_link_visit_action: 111901
piwik_log_visit: 45021

This wouldn't be a big deal if it didn't LOCK TABLES and require extensive use of temp tables. It completely locks out any use of Piwik resulting in issues for anyone accessing a site that uses Piwik or anyone trying to view stats. It also puts a high load on the server slowing general operation.

Here's a sample of some of the slow queries from my logs:

# Time: 091212 11:03:02
# User@Host: piwik[piwik] @  [10.2.1.239]
# Query_time: 60  Lock_time: 0  Rows_sent: 6  Rows_examined: 44481023
SELECT  name,
                                                        type,
                                                        count(distinct t1.idvisit) as nb_visits,
                                                        count(distinct visitor_idcookie) as nb_uniq_visitors,
                                                        count(*) as nb_hits
                                        FROM (piwik_log_visit as t1
                                                LEFT JOIN piwik_log_link_visit_action as t2 USING (idvisit))
                                                        LEFT JOIN piwik_log_action as t3 ON (t2.idaction_url = t3.idaction)
                                        WHERE visit_server_date = '2009-12-12'
                                                AND idsite = '7'
                                        GROUP BY t3.idaction
                                        ORDER BY nb_hits DESC;

# Time: 091212 11:08:39
# User@Host: piwik[piwik] @  [10.2.1.239]
# Query_time: 36  Lock_time: 0  Rows_sent: 3  Rows_examined: 26729656
SELECT  name,
                                                        type,
                                                        count(distinct t1.idvisit) as nb_visits,
                                                        count(distinct visitor_idcookie) as nb_uniq_visitors,
                                                        count(*) as nb_hits
                                        FROM (piwik_log_visit as t1
                                                LEFT JOIN piwik_log_link_visit_action as t2 USING (idvisit))
                                                        LEFT JOIN piwik_log_action as t3 ON (t2.idaction_name = t3.idaction)
                                        WHERE visit_server_date = '2009-12-12'
                                                AND idsite = '39'
                                        GROUP BY t3.idaction
                                        ORDER BY nb_hits DESC;

The text was updated successfully, but these errors were encountered:

mattab · 2009-12-15T09:43:01Z

I don't think the LIMIT 0, 50000 (for example) will actually solve the issue. Mysql will still have to select the full result set, sort it, and then limit it. Maybe it will be necessary to

denormalize the log_link_visit_action table and add a: idsite, visit_server_date field
rewrite the archiving query to only select the top 50,000 rows from this table only
then select the piwik_log_action.name for each of these 50,000 piwik_log_link_visit_action.idaction_name (or idaction_url)

mattab · 2011-01-11T20:54:57Z

dup of #2002

mattab added this to the Future releases milestone Jul 8, 2014

mattab added T: Bug labels Jul 8, 2014

sathishcyberintelsysnew mentioned this issue Mar 15, 2023

[Snyk] Upgrade iframe-resizer from 4.2.11 to 4.3.4 sathishcyberintelsysnew/matomo#6

Open

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archiving is not bounded for the reports Pages by title, pages by URL, causing inefficient / broken archiving #1075

Archiving is not bounded for the reports Pages by title, pages by URL, causing inefficient / broken archiving #1075

mattab commented Dec 13, 2009

mattab commented Dec 15, 2009

mattab commented Jan 11, 2011

Archiving is not bounded for the reports Pages by title, pages by URL, causing inefficient / broken archiving #1075

Archiving is not bounded for the reports Pages by title, pages by URL, causing inefficient / broken archiving #1075

Comments

mattab commented Dec 13, 2009

Summary

Initial Report

mattab commented Dec 15, 2009

mattab commented Jan 11, 2011