Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Tracking API performance: track bulk page views as POST request #3134

Closed
mattab opened this issue May 3, 2012 · 21 comments
Closed
Labels
Critical Indicates the severity of an issue is very critical and the issue has a very high priority. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc.
Milestone

Comments

@mattab
Copy link
Member

mattab commented May 3, 2012

Currently the tracking API accepts a set of GET parameters. each request is one page view. For performance reasons, it sometimes is desired to send multiple page views at once:

  • We could batch page views for example 20 pages in one request in our server Log Analytics tool. This would greatly improve performance since Python wouldnt have to create HTTP request for each log line which has huge overhead
  • We could use this feature in a Piwik Mobile SDK for example, after connectivity is restored all pending requests are sent at once.

The changes to make are:

@mattab
Copy link
Member Author

mattab commented May 26, 2012

This is needed quickly for faster log imports

@mattab
Copy link
Member Author

mattab commented May 30, 2012

  • I think we should generalize this concept and also apply to API requests, as it could be useful to request 5 reports from the analytics api at once rather than doing 5 http requests.
  • The PHP Tracking Client could have new function enableBulkRequests() that would automatically send requests by batch?

@diosmosis
Copy link
Member

Attachment: Patch for this issue.
3134.diff.tar.gz

@diosmosis
Copy link
Member

I uploaded a patch for this issue. It provides bulk tracking & bulk API requests.

Some notes on the patch:

  • Bulk tracking is done by POST-ing JSON to the tracker. I use JSON instead of the normal form encoded data since using JSON will result in less data being sent (we avoid re-encoding a list of URLs).
  • The JSON data will look like: array('requests' => array(...), 'token_auth' => ...).
  • I modified core/API/DocumentationGenerator to use Piwik_Url when generating an example URL.
  • Fixed a bug in the XML data table renderer that didn't format some values.

What do you think of my patch?

@mattab
Copy link
Member Author

mattab commented Jun 18, 2012

Sorry for the long delay in replying.

  • It looks excellent! Very exciting & well done (quite funny to see multiple API output in one XML!). This change will improve so much performance of log import :-)
  • +if (empty($_GET) && empty($_POST) && $_SERVER['REQUEST_METHOD'] == 'POST')
    Piwik_Config::getInstance()->Tracker['tracking_requests_require_authentication'] = 0;
    which use case has empty _POST but still requiring no authentication?
  • connectDatabase() is called within the loop, but we should connect only once to mysql for performance reasons
  • in sendRequest() I don't thinkw e need to supprot PUT or DELETE
  • in the tests for all chagnes such as:
  •   $this->checkResponse($visitorB->doTrackAction('mailto:test@example.org/strangelink', 'link'));
    
  •   $visitorB->doTrackAction('mailto:test@example.org/strangelink', 'link');
    

When bulk import is enabled, doTrack* should return "true" I think. Then we should test that the functions return true instead of not testing the output.

  • Would you mind also preparing a short new paragraph of documentation for: http://piwik.org/docs/analytics-api/reference that would explain the feature and show 1 example of multiple data requests in one http request? this feature is great and many users will enjoy performance benefits
  • why reading the _POSt values from php://input rather than $_POST?
  • I am wondering about the sanitizing then unsanitizing of values in core/Tracker.php __construct () -- is it necessary? I thought we should just forward the raw values and the getRequestVar() calls would take care of the sanitizing? Maybe i'm missing something :)
  • can you please confirm my note in: Log analytics list of improvements #3163 is correct as per your API?

@diosmosis
Copy link
Member

Attachment: New patch for this issue.
3134.diff.tar.2.gz

@diosmosis
Copy link
Member

I uploaded a new patch, let me know what you think.

Replying to matt:

  • +if (empty($_GET) && empty($_POST) && $_SERVER['REQUEST_METHOD'] == 'POST')
    Piwik_Config::getInstance()->Tracker['tracking_requests_require_authentication'] = 0;
    which use case has empty _POST but still requiring no authentication?
  • why reading the _POSt values from php://input rather than $_POST?

When posting JSON (and not a form-encoded string), $_POST won't be populated. My code will POST JSON so urls don't get encoded (which saves space).

  • connectDatabase() is called within the loop, but we should connect only once to mysql for performance reasons

connectDatabase will check if the DB connection is already created, and the DB is disconnected only after the loop. I've renamed the function to make this clearer.

  • I am wondering about the sanitizing then unsanitizing of values in core/Tracker.php __construct () -- is it necessary? I thought we should just forward the raw values and the getRequestVar() calls would take care of the sanitizing? Maybe i'm missing something :)

Oops :)

Yes, that's right. And I'll do some benchmarks to see what the best # of requests to send are.

@diosmosis
Copy link
Member

I did some performance testing of my patch:

REQUESTS SENT           TIME            TIME PER REQUEST
10(>15)             .4s         .04s
20(>15)             .5s         .025s
30(>15)             .6s         .02s
50(>15)             .85s            .017s
100(>15)            1.5s            .015s
200(>15)            2.55s           .01275s
300(>15)            3.55s           .0118s
400(10-15)          4.6s            .0115s
500(10-15)          5.9s            .0118
1000(5)             11.3s           .0113
2000(5)             23s         .0115
5000(5)             56.25s          .01125

The number in the parentheses is the number of times I ran the test. I was pessimistic and used the longest elapsed time I encountered. The sweet spot seems to be around 400 requests. After that it stays pretty constant (I tried 10000 once and it was one request every .0113s.

This was done going through localhost. I assume when tracking from a different machine, it would be faster just to send everything in one POST.

@cbay
Copy link
Contributor

cbay commented Jul 7, 2012

capedfuzz: awesome, I'm eager to see how much faster the log import script will be. I'm also glad you chose to use JSON, it will make things easier.

@mattab
Copy link
Member Author

mattab commented Jul 20, 2012

Benaka, beautiful patch, please commit, this is a significant improvement to the tracking API that Log Analytics users will enjoy a lot (since performance was the only negative feedback from the log import feature which MANY users like a lot!)

@diosmosis
Copy link
Member

(In [6524]) Fixes #3134, added bulk tracking feature and bulk request API method.

Notes:

  • Modified php tracker lib so bulk tracking can be used (via enableBulkTracking & doBulkTrack methods).
  • Modified DocumentationGenerator.php to generate example URLs using Piwik_Url instead of creating it by hand.
  • Fixed bug in XML DataTable renderer where row values weren't escaped when the row has one value and no columns.

@diosmosis
Copy link
Member

(In [6525]) Refs #3134, forgot to add expected output for getBulkRequest test.

@mattab
Copy link
Member Author

mattab commented Jul 23, 2012

Thanks - could you please also send the documentation to add to: http://piwik.org/docs/tracking-api/reference/ to explain how to do bulk requests?

There will be several users: mobile app piwik anynomous tracking, iOs SDK, Cyril with the Log Import script. A doc will be appreciated for sure :)

@mattab
Copy link
Member Author

mattab commented Jul 25, 2012

(In [6551]) Refs #3134 Allowing POST to trigger bulk request + requiring authentication for bulk requests

@mattab
Copy link
Member Author

mattab commented Jul 25, 2012

Thanks, doc is online at: bulk tracking API requests

@mattab
Copy link
Member Author

mattab commented Jul 25, 2012

(In [6552]) Refs #3134 Fixing bug thanks Jenkins

@diosmosis
Copy link
Member

(In [6581]) Refs #3134, fixing build & make sure PiwikTracker throws when token auth not set and bulk track attempted.

@mattab
Copy link
Member Author

mattab commented Jul 29, 2012

(In [6586]) Doing general reset checks Refs #3134
I think that otherwise, if some requests were forcing the IP, the forced IP or datetime would have be enforeced for the next request since not reset.

@anonymous-matomo-user
Copy link

Hi capedfuzz,
Is this enhancement integrated in the log importer python script in piwik 1.8.3 ?

@mattab
Copy link
Member Author

mattab commented Aug 2, 2012

Is this enhancement integrated in the log importer python script in piwik 1.8.3 ?
not yet

@mattab
Copy link
Member Author

mattab commented Sep 3, 2012

(In [Fix for "PHP Fatal error: ) operator not supported for strings in /var/www/piwik/libs/PiwikTracker/PiwikTracker.php on line 849"

when php5-curl is not installed Refs #3134

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Critical Indicates the severity of an issue is very critical and the issue has a very high priority. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc.
Projects
None yet
Development

No branches or pull requests

4 participants