Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archiving is not bounded for the reports Pages by title, pages by URL, causing inefficient / broken archiving #1075

Closed
mattab opened this issue Dec 13, 2009 · 2 comments
Labels
Bug For errors / faults / flaws / inconsistencies etc. duplicate For issues that already existed in our issue tracker and were reported previously. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.

Comments

@mattab
Copy link
Member

mattab commented Dec 13, 2009

Reported by DriverDan

Summary

The sql queries used in the archiving: https://github.com/piwik/piwik/blob/master/plugins/Actions/Actions.php#L135

are not bounded, ie. they select all the rows, and then loops over all the result set and processes the php array containing all the data. If the default limit values haven't been changed in the config file (see explanation in the faq) then the archiving limits the number of rows in the php array. The problem is that this is done too late: millions of rows could have been processed from which we would only keep the top 5000 pages.

Unfortunately, this is by design: Piwik keeps 500 rows for the first level table, and keeps 100 rows for each of the children tables (ie. when you click on a row to expand it). Reference: https://github.com/piwik/piwik/blob/master/config/global.ini.php#L138

For a given day for a given website, you could basically record in the database up to 500 * (100 * 100 * ...), ie. millions of rows. This could be fine, but on websites where this is actually happening, this is way too much data to be handled by a single server. The initial SQL query which tries to return these millions of rows will have to swap and block on IO for minutes or hours to try and deliver this much data, temporarily LOCKing the tables involved.

The solution would be to introduce a new "hard limit" on the number of rows returned by the SQL, and have it hardcoded in the sql archiving queries. Because the queries sort by count descending, we could simply have a LIMIT 0, 50000 for example. This way, this would ensure a bounded data set and predictable behavior. This sounds like a rather simple fix which should just work...

Initial Report

Everything was working great on 0.4.5. After updating to 0.5.1 the archive script runs for hours and hours. I managed to get it to archive all daily stats yesterday but that took at least an hour to run. Last night's cron was still running at 11AM when I checked it (9 hours). This is on a quad core Xeon server. Here are the number of rows in the main tables:

piwik_log_action: 6267
piwik_log_link_visit_action: 111901
piwik_log_visit: 45021

This wouldn't be a big deal if it didn't LOCK TABLES and require extensive use of temp tables. It completely locks out any use of Piwik resulting in issues for anyone accessing a site that uses Piwik or anyone trying to view stats. It also puts a high load on the server slowing general operation.

Here's a sample of some of the slow queries from my logs:

# Time: 091212 11:03:02
# User@Host: piwik[piwik] @  [10.2.1.239]
# Query_time: 60  Lock_time: 0  Rows_sent: 6  Rows_examined: 44481023
SELECT  name,
                                                        type,
                                                        count(distinct t1.idvisit) as nb_visits,
                                                        count(distinct visitor_idcookie) as nb_uniq_visitors,
                                                        count(*) as nb_hits
                                        FROM (piwik_log_visit as t1
                                                LEFT JOIN piwik_log_link_visit_action as t2 USING (idvisit))
                                                        LEFT JOIN piwik_log_action as t3 ON (t2.idaction_url = t3.idaction)
                                        WHERE visit_server_date = '2009-12-12'
                                                AND idsite = '7'
                                        GROUP BY t3.idaction
                                        ORDER BY nb_hits DESC;

# Time: 091212 11:08:39
# User@Host: piwik[piwik] @  [10.2.1.239]
# Query_time: 36  Lock_time: 0  Rows_sent: 3  Rows_examined: 26729656
SELECT  name,
                                                        type,
                                                        count(distinct t1.idvisit) as nb_visits,
                                                        count(distinct visitor_idcookie) as nb_uniq_visitors,
                                                        count(*) as nb_hits
                                        FROM (piwik_log_visit as t1
                                                LEFT JOIN piwik_log_link_visit_action as t2 USING (idvisit))
                                                        LEFT JOIN piwik_log_action as t3 ON (t2.idaction_name = t3.idaction)
                                        WHERE visit_server_date = '2009-12-12'
                                                AND idsite = '39'
                                        GROUP BY t3.idaction
                                        ORDER BY nb_hits DESC;
@mattab
Copy link
Member Author

mattab commented Dec 15, 2009

I don't think the LIMIT 0, 50000 (for example) will actually solve the issue. Mysql will still have to select the full result set, sort it, and then limit it. Maybe it will be necessary to

  • denormalize the log_link_visit_action table and add a: idsite, visit_server_date field
  • rewrite the archiving query to only select the top 50,000 rows from this table only
  • then select the piwik_log_action.name for each of these 50,000 piwik_log_link_visit_action.idaction_name (or idaction_url)

@mattab
Copy link
Member Author

mattab commented Jan 11, 2011

dup of #2002

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc. duplicate For issues that already existed in our issue tracker and were reported previously. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.
Projects
None yet
Development

No branches or pull requests

1 participant