Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"All Websites dashboard" should load fast even when tracking hundreds of websites! MultiSites #1077

Closed
mattab opened this issue Dec 14, 2009 · 37 comments
Labels
Bug For errors / faults / flaws / inconsistencies etc. Critical Indicates the severity of an issue is very critical and the issue has a very high priority.
Milestone

Comments

@mattab
Copy link
Member

mattab commented Dec 14, 2009

Currently, the multi sites dashboard page render takes 6 API requests (ie. 6 mysql SELECTs) for each website displayed. If a Piwik install has 1000 registered websites, the page load will simply fail, because the php page would try and execute 6,000 queries which is not a good idea.

Instead, only one request should be made to select the 6 values necessary for each website. This query would therefore select 6,000 values at once which would be as fast as we can get with that many values.

This query would join the site, access and archive_* tables. See Piwik_Access::reloadAccess() and core/Archive/Array/IndexedBySite.php for the current behavior. I think the solution might be to have a new class in core/Archive/Array indexed by user?

@anonymous-matomo-user
Copy link

Hi, is there any update on this?
One of the main reasons I installed Piwik was to use the multisite feature (it worked great when I had 10 sites). However I now have 500+ sites and this feature is unusable. If I try to run it, it times out of after a few minutes. Cheers.

@anonymous-matomo-user
Copy link

+1 vote for this issue

1 similar comment
@anonymous-matomo-user
Copy link

+1 vote for this issue

@mattab
Copy link
Member Author

mattab commented Feb 19, 2010

Other bug report:

I am tracking 22 sites with Piwik. Two have titles that start with A (we'll call them A1 and A2), and the other 20 have titles starting with C-Z.

The main MultiSites screen is a bit funky. The initial view shows sites C-Z, in descending alphabetical order. Clicking on Next shows A1 and A2, in alpha order. These should be shown at the top of the other screen, while the last 2 should be on this second screen.

Also, the text at the bottom reads "1-21 of 22", when in fact it should read "1-20 of 22"

@zawadzinski
Copy link
Contributor

[1872] fixes sorting and other small UI issues

@robocoder
Copy link
Contributor

(In [refs #1077 - re-apply fix from 1810) where site name contains quotes

@mattab
Copy link
Member Author

mattab commented Jul 20, 2010

Doable in one query, we can go from 1,000 websites= 6,000 SELECT to one SELECT which would greatly improve performance. This is one of the last performance blockers for 1.0 with #409

@robocoder
Copy link
Contributor

See also #1315

@mattab
Copy link
Member Author

mattab commented Nov 23, 2010

Also it would be useful to have a global.ini.php setting controlling how many websites to show in the UI by default (so Super Users can decide to show 1000 if they want to)

@mattab
Copy link
Member Author

mattab commented Dec 12, 2010

  • for consistency, we could add a link to download the 'All websites' data in xml/excel/etc. at the bottom of the dashboard

@mattab
Copy link
Member Author

mattab commented Dec 28, 2010

Feature requests from forum post

  • Show more than 20 websites in the page (could be a config file setting)
  • Show Goal conversions in a new column

@diosmosis
Copy link
Member

Hello all, I'm interested in fixing this bug. I think I've worked out a way to fix it, but I'm not completely familiar with the code base, so I wanted to run my idea by you guys before I created a patch.

I looked at the queries being made when the 'All Websites' page is loaded and found that the large number of queries was due to the Array Archive types looping over an array of other Archives. I thought a possible solution would be a new archive type (called Piwik_Archive_Cached) that cached the archive IDs that result when prepareArchive() is called in a BLOB archive record. This would reduce the number of queries made from two per site per API call to two queries per api call, but it wouldn't make the page any faster when archiving is triggered through the browser.

Does this sound like an acceptable approach?

@mattab
Copy link
Member Author

mattab commented Aug 22, 2011

capedfuzz, in your proposal, will you run a query such as WHERE idarchive IN (.........) ?

This could be a problem, when there are for example, 5,000 websites, the WHERE idarchive IN ... will be rather slow. But, it is worth testing anyway such patch, as it might still improve performance?

In any case, it is normal that the page will work nicely only when automatic archiving is setup (ie. browser archiving is disabled), so this limitation is definitely OK.

looking forward to seeing your patch or ideas about this!

@diosmosis
Copy link
Member

I planned on using IN (...), but I might be able to optimize the query... I'll create a prototype and see what happens.

@diosmosis
Copy link
Member

A quick update, wanted to let you know I'm still working on this. I had to go a different route than what I proposed: instead of storing archive IDs I select all the IDs in one go. From my simple tests (which didn't include loading sparklines), this sped up the page from ~46 secs to ~17 secs when loading data for 800 websites.

I still have a fair amount of work to do, but hopefully I'll have a patch posted soon.

@diosmosis
Copy link
Member

Actually, sorry, that's ~24 to 17secs. Forgot to disable archiving when testing the normal version. So not quite as monumental a speed up :).

@mattab
Copy link
Member Author

mattab commented Sep 5, 2011

capedfuzz, any update on your patch? Any improvement is definitely welcome for the next release :) thx

@diosmosis
Copy link
Member

Yes, I've got an update. I ended up adding some more improvements and got the page to load in < 4 secs for a sufficiently small amount of sites per page. However, to do that I had to make it possible to do the sorting/pagination logic server-side. So the patch is a bit big. I should be able to submit it some time tomorrow (EST), but in the mean time, here's a list of the big changes I made:

  1. I've added a new class, Piwik_QueryOptions, which holds extra query info (like what to use in the LIMIT/OFFSET/ORDER BY parts of a query). I added a parameter to the VisitsSummary API's get & the Goals API's get methods, as well as each Archive type's getDataTableFromNumeric method. Currently, its only used by the IndexedBySite type.
  2. IndexedBySite won't call prepareArchive unless archiving is enabled. It'll also select archive ids in one query.
  3. I added a getTotal method to the VisitSummary API & the Goals API to compute the sum of a set of archived values, and I added a getSum method to every Archive type. Currently, its only implemented by the IndexedBySite archive type (the others throw an exception).

The page will still load slowly when looking for a custom range that initiates archiving.

@diosmosis
Copy link
Member

Attachment: Patch for this issue.
issue1077.diff.tar.gz

@diosmosis
Copy link
Member

My patch is attached. It will speed things up greatly when there aren't that many sites shown, and speed things up a bit when a lot of sites are shown.

Some notes:

  • The amount of time it takes to load the page is still proportional to the amount of websites Piwik tracks. When there are ~1000 sites it'll display ~50 entries very quickly, but when I increased the number of sites to ~5000, it took a couple seconds longer.
  • The page will still load slowly (if at all) when looking for a custom range that initiates archiving.
  • When sorting, results are sorted by the desired field (ie, nb_visits, nb_actions, etc.), and then by site ID (as opposed to name). This might make the results look a bit weird.

@diosmosis
Copy link
Member

Attachment: Patch for issue #1077. Optimizes the IndexedBySite archive type. Speeds up All Websites page by 12 seconds when Piwik tracks 800 websites.
issue1077.diff

@diosmosis
Copy link
Member

I've created some optimizations that speed up the All Websites page by ~12 seconds when tracking 800 websites. I had to make changes to the Piwik_ArchiveProcessing class, but I don't have detailed knowledge how this class & its descendants work. Would anyone mind taking a quick look at my changes to make sure there are no potential issues w/ it (I've uploaded a patch)? Thanks in advance.

@mattab
Copy link
Member Author

mattab commented Nov 22, 2011

capedfuzz, looks like an excellent patch! I haven't tried applying it, but will definitely test performance before and after you have committed.

Feedback

  • Code looks v. good!
  • testability: it is important to test code coverage of most archiving code since it is core to Piwik. Would you mind adding an integration test testing with idSites=all or a comma separated list?
    The test would create fake visits for 2 or 3 different websites. then would call a few APIs to trigger archiving. Then, disable browser archiving. Then call idSites=1,2,3 and check the new code path is executed, and code is tested and works as expected. Then, enable browser archiving.
  • To "disable/enable browser archiving" temporarily in your test, you could add a new helper such as:
Index: core/ArchiveProcessing.php
===================================================================
--- core/ArchiveProcessing.php  (revision 5455)
+++ core/ArchiveProcessing.php  (working copy)
@@ -900,15 +900,22 @@
        return $isDisabled;
    }

+   protected $archivingDisabled = false;
    protected function isRequestAuthorizedToArchive()
    {
-       return self::isBrowserTriggerArchivingEnabled()
-               || Piwik_Common::isPhpCliMode()
-               || (Piwik::isUserIsSuperUser() 
-                   && Piwik_Common::isArchivePhpTriggered())
-                   ;
+       return !$this->archivingDisabled
+               && (self::isBrowserTriggerArchivingEnabled()
+                   || Piwik_Common::isPhpCliMode()
+                   || (Piwik::isUserIsSuperUser() 
+                       && Piwik_Common::isArchivePhpTriggered())
+                   );
    }

+   static public function disableArchiving($disable = true)
+   {
+       $this->archivingDisabled = $disable;
+   }
+   
    /**
     * Returns true when 
     * - there is no segment and period is not range 

  • excellent idea to submit code bit by bit otherwise it is much harder to review

I don't see any potential issues otherwise, please commit when you manage to add the test. if you need any help let me know :)

How long does the page take before and after? what % improvement?

@diosmosis
Copy link
Member

(In [5475]) Refs #1077. Improved speed of the IndexedBySite archive type by optimizing the case when launching the archiving process is disabled. Instead of selecting archive IDs one at a time, IndexedBySite will now select them all at once.

Notes:

  • Also added an integration test that will test whether Piwik works when archiving is disabled.
  • Modified getKnownSegmentsToArchive() to cache its result as static function data instead of having ArchiveProcessing cache it as instance data.
  • Changed several instance methods in ArchiveProcessing to static methods. These methods are responsible for determining if archiving is disabled.
  • Removed meaningless idSite related code from TablePartitioning.

@diosmosis
Copy link
Member

Attachment: A quick benchmark used to test how long it takes to retrieve the All Websites HTML. Written in CoffeeScript & uses node.js.
test_all_sites_time.coffee

@diosmosis
Copy link
Member

I committed the change & there's a new integration test.

The times I record when I test aren't constant, but the most conservative view of the speed improvement is from 14s to 6s (~57%).

I've also uploaded the script I've used to test the times.

@mattab
Copy link
Member Author

mattab commented Nov 26, 2011

Very nice commit and ideas. New integration test good.

I see from 17s to 7s improvements on my 600 websites test setup... Very nice!!! Great work.

When I enable profiler and debug logging, I see that now there is this executed for each site:

Executed 565 times in 564.3ms (average = 1ms)
    SELECT * FROM piwik_site WHERE idsite = ?

I guess this could be detected upstream and select * instead, which would probably bring another 10% improvement.

Executed 4 times in 40.2ms (average = 10.1ms)

    SELECT idsite, MAX(idarchive) AS idarchive FROM piwik_archive_numeric_2011_11 WHERE date1 = ? AND date2 = ? AND period = ? AND (name = 'done' ) AND (value = '1' OR value = '3') AND idsite IN (1,3,4,5,6,30,2944,2945,[... hundreds of ids ...]05,3506) GROUP BY idsite

This one is also interesting. Executed 4 times, because of: Piwik_MultiSites_Controller->getSitesInfo() called at [D:\piwik\svn\trunk\plugins\MultiSites\Controller.php:37]

  • we could group these 4 API calls into one (maybe via the new API.get helper?
    • For example, maybe we create a custom API call for the ALl Websites dashboard that selects the few metrics needed. This is more a hack, but that would work well and would be quite easy to implement.
      Downside: tied to naming of metrics in VisitsSummary,Goals plugin, but this is fine since these metrics won't change their name anytime soon.

@mattab
Copy link
Member Author

mattab commented Nov 28, 2011

(In [5489]) Refs #1077 Pre-load all websites when user is super user only. Saves hundreds of select * from piwik_site where idsite = ?

@mattab
Copy link
Member Author

mattab commented Nov 28, 2011

(In [5490]) Refs #1077 Missing code preventing multiple executions

@diosmosis
Copy link
Member

Replying to matt:

When I enable profiler and debug logging, I see that now there is this executed for each site:

Didn't know there was a profiler built into Piwik, thanks for pointing it out!

Executed 4 times in 40.2ms (average = 10.1ms)

  SELECT idsite, MAX(idarchive) AS idarchive FROM piwik_archive_numeric_2011_11 WHERE date1 = ? AND date2 = ? AND period = ? AND (name = 'done' ) AND (value = '1' OR value = '3') AND idsite IN (1,3,4,5,6,30,2944,2945,[... hundreds of ids ...]05,3506) GROUP BY idsite

This one is also interesting. Executed 4 times, because of: Piwik_MultiSites_Controller->getSitesInfo() called at [D:\piwik\svn\trunk\plugins\MultiSites\Controller.php:37]

  • we could group these 4 API calls into one (maybe via the new API.get helper?
    • For example, maybe we create a custom API call for the ALl Websites dashboard that selects the few metrics needed. This is more a hack, but that would work well and would be quite easy to implement.
      Downside: tied to naming of metrics in VisitsSummary,Goals plugin, but this is fine since these metrics won't change their name anytime soon.

I think you could get rid of these by allowing Piwik_Archive creation to be separate from the querying. I'll upload a patch to show you what I mean. (This might result in a speed up when archiving is enabled, too...)

Though from what I can tell the remaining SQL isn't taking too much time to execute. I'm going to try and whittle the time down a couple more seconds. If I can't I'll post a comment.

@diosmosis
Copy link
Member

Attachment: Patch for issue #1077. Several smaller optimizations included. Made for rev 5492.
issue1077_assorted_speedups.diff

@diosmosis
Copy link
Member

I uploaded another patch with a bunch of smaller fixes. This patch speeds All Sites up for me from 5.2s to 2.2s (w/ 800 websites stored). Here's a list of the changes:

  • Modified Piwik_Archive::build so it can accept Piwik_Segment instances. This way there won't be hundreds of segments created when they get overwritten by IndexedBySite anyway. Saves ~.2s
  • Modified the previously committed site caching code. All Sites already selects those sites, so it was a bit redundant. I removed part of it, but I could apply a similar optimization to other parts of Piwik if you let me know where they are. Saves ~2.25s (no idea why so much...)
  • Modified the setMinMaxDateAcrossWebsites function and the Piwik_Date class. The aforementioned function looped through every site and instantiated several Date instances. I modified it to work w/o having to create Date instances. Saves ~.5s

I can do one more optimization that should reduce the time to ~1.5s. When the website count goes into the thousands, the speed up will be more apparent. But, it will require an added method to the VisitsSummary & Goals APIs.

I noticed that the only remaining thing that takes too much time is the creation of the IndexedBySite archive. The Archive::build method instantiates an Archive, Site, Period and Date instance and IndexedBySite will invoke it for every site. The first time its run it takes ~1s on my setup. I want to add getForArchive($archive) methods to VisitsSummary & Goals which take an already built archive. Do you think there would be a problem w/ this?

Also, I can rewrite part of the MultiSites controller and add an API class w/ a getSitesInfo method. This way it can have some integration tests.

@mattab
Copy link
Member Author

mattab commented Dec 1, 2011

Great!!

core/Archive/Array/IndexedBySite.php 
36                              $archive = Piwik_Archive::build($idSite, $strPeriod, $strDate, $segment->getString() ); 
37                              $archive->setSite(new Piwik_Site($idSite)); 
38                              $archive->setSegment($segment); 
    36                          $archive = Piwik_Archive::build($idSite, $strPeriod, $strDate, $segment ); 

Here the call to setSite() is not done anymore - is this call to setSite() not used/useless?

  • I support you've used the profiler to test the queries are called once as expected etc.

  • Excellent changes in the Date class (makes the code much clearer!!) and good speed improvements :)

    I removed part of it, but I could apply a similar optimization to other parts of Piwik if you let me know where they are. Saves ~2.25s (no idea why so much...)
    I think the sql query is quite useful for the "super user", since getSitesWithViewAccess will issue a sql query ID IN(...) with thousands of IDs. For the Super User, we know that all websites are needed so this query would help, but leave your new code w/ this improvement as well for admin/view/anonymous users.
    add an API class w/ a getSitesInfo method
    WHat's your idea? I thought that this function could return the metrics needed, bypassing the VisitsSummary and Goals APIs. This would change from 4 sql queries to sql 2 queries.

    I want to add getForArchive($archive) methods to VisitsSummary & Goals which take an already built archive. Do you think there would be a problem w/ this?

After all other changes (and calling 1 API for today, 1 for yesterday), I'm not sure if this one would be required still.

Please feel free to commit! great stuff! in fact this ticket can be closed after this, since the page will render for most users now.

@diosmosis
Copy link
Member

Replying to matt:

Here the call to setSite() is not done anymore - is this call to setSite() not used/useless?

Archive::build will set the site on a Single archive, so the call in IndexedBySite is unnecessary. It did have an effect on the IndexedByDate class, but I added a call to setSite to IndexedByDate's constructor.

  • I support you've used the profiler to test the queries are called once as expected etc.

  • Excellent changes in the Date class (makes the code much clearer!!) and good speed improvements :)

    I removed part of it, but I could apply a similar optimization to other parts of Piwik if you let me know where they are. Saves ~2.25s (no idea why so much...)
    I think the sql query is quite useful for the "super user", since getSitesWithViewAccess will issue a sql query ID IN(...) with thousands of IDs. For the Super User, we know that all websites are needed so this query would help, but leave your new code w/ this improvement as well for admin/view/anonymous users.

The issue I see w/ the super user improvement is that its in the Archive::build method. So any API request that checks for 'all' sites may result in two queries (if the original API method optimizes for non-super user cases). If I move the optimization to the All Sites plugin, this wouldn't occur, but the optimization would have to be explicitly added to other parts of Piwik that get data for 'all' sites. I'm going to leave it in when I commit, though.

add an API class w/ a getSitesInfo method
WHat's your idea? I thought that this function could return the metrics needed, bypassing the VisitsSummary and Goals APIs. This would change from 4 sql queries to sql 2 queries.

I want to add getForArchive($archive) methods to VisitsSummary & Goals which take an already built archive. Do you think there would be a problem w/ this?

After all other changes (and calling 1 API for today, 1 for yesterday), I'm not sure if this one would be required still.

No, it wouldn't. After I thought about it some more I realized you could just go through the data table and forget about the other APIs :).

Please feel free to commit! great stuff! in fact this ticket can be closed after this, since the page will render for most users now.

I'd still like to refactor the Multi Sites plugin, though. Do you mind if I create a new ticket: 'Refactor Multi Sites plugin to follow Piwik plugin conventions'?

@mattab
Copy link
Member Author

mattab commented Dec 1, 2011

If I move the optimization to the All Sites plugin, this
wouldn't occur, but the optimization would have to be explicitly added to
other parts of Piwik that get data for 'all' sites. I'm going to leave it
in when I commit, though.

I think the only other parts is the "Manage Websites" page in the Settings, for the Super User. Otherwise no other page should select all websites information I think.

I'd still like to refactor the Multi Sites plugin, though. Do you mind if I create a new ticket: 'Refactor Multi Sites plugin to follow Piwik plugin conventions'?

Creating a new ticket sounds good, since this one focuses on performance which is now mostly fixed :)

@diosmosis
Copy link
Member

(In [5505]) Fixes #1077, Assorted optimizations for the Multi Sites plugin:

  • Modified Piwik_Archive::build so it can accept Piwik_Segment instances. This way there won't be hundreds of segments created when they get overwritten by IndexedBySite anyway.
  • Removed previously committed optimization that selected all sites in Piwik_Archive and replaced w/ a similar optimization in the code that selects data for every website (the Multi Sites controller & Sites Manager controller).
  • Modified the setMinMaxDateAcrossWebsites function of the Multi Sites controller so Piwik_Date instances wouldn't be created within its loop and modified the Piwik_Date class to make the former change possible.

@mattab
Copy link
Member Author

mattab commented Dec 3, 2011

(In [5523]) Refs #1077 fix typo
note to self: we need UI / Widgets rendering tests :)

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc. Critical Indicates the severity of an issue is very critical and the issue has a very high priority.
Projects
None yet
Development

No branches or pull requests

5 participants