Opened 2 years ago

Closed 21 months ago

Last modified 20 months ago

#3134 closed New feature (fixed)

Improve Tracking API performance: track bulk page views as POST request

Reported by: matt Owned by:
Priority: critical Milestone: 1.12.x - Piwik 1.12.x
Component: Core Keywords:
Cc: Sensitive: no

Description

Currently the tracking API accepts a set of GET parameters. each request is one page view. For performance reasons, it sometimes is desired to send multiple page views at once:

  • We could batch page views for example 20 pages in one request in our server Log Analytics tool. This would greatly improve performance since Python wouldnt have to create HTTP request for each log line which has huge overhead
  • We could use this feature in a Piwik Mobile SDK for example, after connectivity is restored all pending requests are sent at once.

The changes to make are:

  • Allow piwik.php to loop itself to record all page views passed as POST
  • refs #703: we should modify the python script to batch requests in POST.
    • By trial we should determine what number of batched visit brings the best performance.

Attachments (2)

3134.diff.tar.gz (8.6 KB) - added by capedfuzz 23 months ago.
Patch for this issue.
3134.diff.tar.2.gz (8.6 KB) - added by capedfuzz 22 months ago.
New patch for this issue.

Download all attachments as: .zip

Change History (22)

comment:1 Changed 2 years ago by matt (mattab)

  • Milestone changed from 1.9 Piwik 1.9 to 1.8.x - Piwik 1.8.x

This is needed quickly for faster log imports

comment:2 Changed 2 years ago by matt (mattab)

  • Priority changed from major to critical

comment:3 Changed 23 months ago by matt (mattab)

  • I think we should generalize this concept and also apply to API requests, as it could be useful to request 5 reports from the analytics api at once rather than doing 5 http requests.
  • The PHP Tracking Client could have new function enableBulkRequests() that would automatically send requests by batch?

Changed 23 months ago by capedfuzz (diosmosis)

Patch for this issue.

comment:4 Changed 23 months ago by capedfuzz (diosmosis)

I uploaded a patch for this issue. It provides bulk tracking & bulk API requests.

Some notes on the patch:

  • Bulk tracking is done by POST-ing JSON to the tracker. I use JSON instead of the normal form encoded data since using JSON will result in less data being sent (we avoid re-encoding a list of URLs).
  • The JSON data will look like: array('requests' => array(...), 'token_auth' => ...).
  • I modified core/API/DocumentationGenerator to use Piwik_Url when generating an example URL.
  • Fixed a bug in the XML data table renderer that didn't format some values.

What do you think of my patch?

comment:5 Changed 22 months ago by matt (mattab)

Sorry for the long delay in replying.

  • It looks excellent! Very exciting & well done (quite funny to see multiple API output in one XML!). This change will improve so much performance of log import :-)
  • +if (empty($_GET) && empty($_POST) && $_SERVERREQUEST_METHOD? == 'POST')

Piwik_Config::getInstance()->Trackertracking_requests_require_authentication? = 0;

which use case has empty _POST but still requiring no authentication?

  • connectDatabase() is called within the loop, but we should connect only once to mysql for performance reasons
  • in sendRequest() I don't thinkw e need to supprot PUT or DELETE
  • in the tests for all chagnes such as:
  • $this->checkResponse($visitorB->doTrackAction('mailto:test@…', 'link'));

+ $visitorB->doTrackAction('mailto:test@…', 'link');

When bulk import is enabled, doTrack* should return "true" I think. Then we should test that the functions return true instead of not testing the output.

  • Would you mind also preparing a short new paragraph of documentation for: http://piwik.org/docs/analytics-api/reference that would explain the feature and show 1 example of multiple data requests in one http request? this feature is great and many users will enjoy performance benefits
  • why reading the _POSt values from php://input rather than $_POST?
  • I am wondering about the sanitizing then unsanitizing of values in core/Tracker.php construct () -- is it necessary? I thought we should just forward the raw values and the getRequestVar() calls would take care of the sanitizing? Maybe i'm missing something :)
  • can you please confirm my note in: http://dev.piwik.org/trac/ticket/3163#comment:22 is correct as per your API?

Changed 22 months ago by capedfuzz (diosmosis)

New patch for this issue.

comment:6 Changed 22 months ago by capedfuzz (diosmosis)

I uploaded a new patch, let me know what you think.

Replying to matt:

Piwik_Config::getInstance()->Trackertracking_requests_require_authentication? = 0;

which use case has empty _POST but still requiring no authentication?

  • why reading the _POSt values from php://input rather than $_POST?

When posting JSON (and not a form-encoded string), $_POST won't be populated. My code will POST JSON so urls don't get encoded (which saves space).

  • connectDatabase() is called within the loop, but we should connect only once to mysql for performance reasons

connectDatabase will check if the DB connection is already created, and the DB is disconnected only after the loop. I've renamed the function to make this clearer.

  • I am wondering about the sanitizing then unsanitizing of values in core/Tracker.php construct () -- is it necessary? I thought we should just forward the raw values and the getRequestVar() calls would take care of the sanitizing? Maybe i'm missing something :)

Oops :)

Yes, that's right. And I'll do some benchmarks to see what the best # of requests to send are.

comment:7 Changed 22 months ago by capedfuzz (diosmosis)

I did some performance testing of my patch:

REQUESTS SENT			TIME			TIME PER REQUEST
10(>15)				.4s			.04s
20(>15)				.5s			.025s
30(>15)				.6s			.02s
50(>15)				.85s			.017s
100(>15)			1.5s			.015s
200(>15)			2.55s			.01275s
300(>15)			3.55s			.0118s
400(10-15)			4.6s			.0115s
500(10-15)			5.9s			.0118
1000(5)				11.3s			.0113
2000(5)				23s			.0115
5000(5)				56.25s			.01125

The number in the parentheses is the number of times I ran the test. I was pessimistic and used the longest elapsed time I encountered. The sweet spot seems to be around 400 requests. After that it stays pretty constant (I tried 10000 once and it was one request every .0113s.

This was done going through localhost. I assume when tracking from a different machine, it would be faster just to send everything in one POST.

comment:8 Changed 22 months ago by Cyril (cbay)

capedfuzz: awesome, I'm eager to see how much faster the log import script will be. I'm also glad you chose to use JSON, it will make things easier.

comment:9 Changed 21 months ago by matt (mattab)

Benaka, beautiful patch, please commit, this is a significant improvement to the tracking API that Log Analytics users will enjoy a lot (since performance was the only negative feedback from the log import feature which MANY users like a lot!)

comment:10 Changed 21 months ago by capedfuzz (diosmosis)

  • Resolution set to fixed
  • Status changed from new to closed

(In [6524]) Fixes #3134, added bulk tracking feature and bulk request API method.

Notes:

  • Modified php tracker lib so bulk tracking can be used (via enableBulkTracking & doBulkTrack methods).
  • Modified DocumentationGenerator.php to generate example URLs using Piwik_Url instead of creating it by hand.
  • Fixed bug in XML DataTable renderer where row values weren't escaped when the row has one value and no columns.

comment:11 Changed 21 months ago by capedfuzz (diosmosis)

(In [6525]) Refs #3134, forgot to add expected output for getBulkRequest test.

comment:12 Changed 21 months ago by matt (mattab)

  • Resolution fixed deleted
  • Status changed from closed to reopened

Thanks - could you please also send the documentation to add to: http://piwik.org/docs/tracking-api/reference/ to explain how to do bulk requests?

There will be several users: mobile app piwik anynomous tracking, iOs SDK, Cyril with the Log Import script. A doc will be appreciated for sure :)

comment:13 Changed 21 months ago by matt (mattab)

(In [6551]) Refs #3134 Allowing POST to trigger bulk request + requiring authentication for bulk requests

comment:14 Changed 21 months ago by matt (mattab)

  • Resolution set to fixed
  • Status changed from reopened to closed

Thanks, doc is online at: bulk tracking API requests

comment:15 Changed 21 months ago by matt (mattab)

(In [6552]) Refs #3134 Fixing bug thanks Jenkins

comment:16 Changed 21 months ago by capedfuzz (diosmosis)

(In [6581]) Refs #3134, fixing build & make sure PiwikTracker throws when token auth not set and bulk track attempted.

comment:17 Changed 21 months ago by matt (mattab)

(In [6586]) Doing general reset checks Refs #3134

I think that otherwise, if some requests were forcing the IP, the forced IP or datetime would have be enforeced for the next request since not reset.

comment:18 Changed 21 months ago by pmontana

Hi capedfuzz,
Is this enhancement integrated in the log importer python script in piwik 1.8.3 ?

comment:19 Changed 21 months ago by matt (mattab)

Is this enhancement integrated in the log importer python script in piwik 1.8.3 ?

not yet

comment:20 Changed 20 months ago by matt (mattab)

(In [6908]) Fix for "PHP Fatal error: [] operator not supported for strings in /var/www/piwik/libs/PiwikTracker/PiwikTracker.php on line 849"

when php5-curl is not installed Refs #3134

Note: See TracTickets for help on using tickets.