Ticket #134 (new New feature)

Opened 2 years ago

Last modified 6 weeks ago

Bulk load Piwik logs with documented API: improved tracking performance, allow performance testing

Reported by: matt Owned by:
Priority: critical Milestone: Features requests - after Piwik 1.0
Component: Core Keywords:
Cc: Sensitive: no

Description (last modified by matt) (diff)

Currently data can be pushed in the database using the piwik.php script, called from the piwik.js tag.

For example an HTTP request that stores a page view in Piwik looks like:

http://piwik.org/demo/piwik.php?url=http%3A%2F%2Fpiwik.org%2F&action_name=&idsite=1&res=1440x900&h=16&m=22&s=20&fla=1&dir=0&qt=0&realp=0&pdf=1&wma=1&java=1&cookie=1&title=Piwik%20-%20Web%20analytics%20-%20Open%20source&urlref=

We want to improve the way Piwik logs data at tracking time.

  • Piwik will now create 'access logs', similar to apache access logs, containing all the REQUEST details (url, the 'piwik' cookie which depends on #409, user agent, referer url, IP, language, POST data, other $_SERVER, etc.).
  • Every 10s (or 30s or 1min) the 'Tracking Bulkloader' will be triggered by the Maintenance task (see #1184).
    • it will connect to the DB once, then read all log lines and process visits in memory,
    • creating flat files for the DB updates,
    • eventually using memcache for visits/pages/options/cookies data store,
    • then bulk inserts/updates visits/pages/conversions and cookies
  • The process would log reports 2010-09-08 04:03:02 - Loaded 1405 visits, 45000 page views, 345 goals - Duration 32s - Logs since 2010-09-08 04:02:52 . How do we handle timeout errors? memory errors?
  • Persists status of 'bulk logs loaded' in piwik_option
  • disconnect DB

Speed of tracking data load would be greatly improved and would compensate the performance loss resulting from the cookie loss in #409.

The log replay script would work in several modes

  • replay a single visit using a HTTP request containing the input data (the content of one log line from the log file).
    • The script API to record stats would become public and be documented. This will allow any user to record stats in Piwik from any source (mobile phones, php apps, desktop apps, etc.).
  • replay a set of visits given a log filename, will read the log file and load it in the DB

Use cases for tracking logs loader

This way, we can use the log replay script in the following use case

  • Tracker performance improvement described above: Every 10s (or 30s or 1min) the 'Tracking Bulkloader' loads all logs at once, which is much faster than connecting to the DB/updates/inserts at every page view.
    • There will also need to be a new 'Super user' setting
      • Enable tracking logs bulk loader
      • Record visits in DB every 10s
      • Enable Memcached wrapper
  • Performance testing replaying existing real logs make it easier to test performance changes in different releases and when doing code updates.

Notes

  • Maybe the tracking code has to be modified to remove the logic that 'selects' or tries to 'match a visitor md5config to a previous visit' as it can be done using the cookie store (#409) I believe.
  • can we use an existing code to parse apache logs in an efficient way, and which would work with several log formats automatically (including windows IIS formats)?
  • see also the  DB schema even if it's outdated, most fields and tables are unchanged.

Attachments

ActiveTrack.zip Download (8.1 KB) - added by vipsoft 12 months ago.
Proposed plugin

Change History

Changed 2 years ago by matt

  • description modified (diff)

Changed 22 months ago by matt

  • milestone changed from Stable release to Future features

Changed 21 months ago by vipsoft

  • type changed from Bug to New feature

Changed 20 months ago by matt

  • description modified (diff)

Changed 20 months ago by matt

  • description modified (diff)

Changed 19 months ago by matt

  • description modified (diff)

Changed 18 months ago by matt

  • priority changed from major to critical

Changed 18 months ago by matt

  • summary changed from API to push data into Piwik without using the javascript tag to API to push data into Piwik without using the javascript tag (to import data from logs, or any other data source)

Changed 18 months ago by matt

  • description modified (diff)

Changed 16 months ago by velmont

This would make Piwik easy to prefer over Analytics, as admins then can imports years worth of logs and data into Piwik and see the trends more easily.

I would be able to make the people at work to use Piwik instead of Google Analytics if I can import the common apache-logs into Piwik.

Changed 14 months ago by vipsoft

(In [1330]) Fixes #876 - Piwik_Tracker_Visit should not validate uninitialized $this->request in constructor; make it possible to push data (in the absence of a REST API; see #134); fix duplicate function error messages when calling Piwik::printSqlProfilingReportTracker() more than once

Changed 13 months ago by vipsoft

  • milestone changed from Features requests - after Piwik 1.0 to 2- DigitalVibes

Will tackle in conjunction with #762.

Changed 13 months ago by alivenk

Changed 13 months ago by vipsoft

  • owner set to vipsoft
  • sensitive unset

Changed 13 months ago by matt

  • milestone changed from 2- DigitalVibes to 2 - Piwik 0.8 - A Web Analytics platform

Changed 13 months ago by domtop

Changed 12 months ago by vipsoft

Proposed plugin

Changed 12 months ago by vipsoft

Proposed plugin attached.

  • API.php uses the Tracker.getNewVisitObject hook
  • ignore Tracker.php and API-alternate.php; these follow the implementation model of the (fake) visit generator are included for comparison only
  • @todo move translation(s) in lang/en.php into core
  • @todo add test of setVisitInformation()
  • @todo add return value to API methods and Piwik_API_Request() variant of existing tests

Changed 12 months ago by vipsoft

  • @todo investigate extending to handle reqts of TrackerSecondaryDb's replay script.

Changed 12 months ago by matt

really interesting!

* in the API, I think there is an issue with the setVisitInformation - in a REST api because there is no state, all parameters need to be passed at each request, so we might need to add cookies/ip/timestamp in each function.. this is quite ugly but I'm not sure how else it could work, without introducing token responses, caching, etc... * getVisitObject move to the Controller instead of API?

Before putting the plugin in core, I think we should do the cleanup of hook names and come up with a consistent naming.

Please post here your updated versions - really looking forward to have a better look.

Changed 11 months ago by vipsoft

  • owner vipsoft deleted

Changed 9 months ago by piwikxx

This would be extraordinary helpful when tracking webservices. I'd suggest even adding the remote IP as a parameter.

It would be possible to track very easy in cases like this:
Client-Application <-----interacts with ------> xml webservice ------> sends data to piwik in order to track webservice usage

Changed 8 months ago by matt

  • description modified (diff)

Changed 8 months ago by matt

  • description modified (diff)

Changed 8 months ago by matt

vipsoft, did you try this plugin with real traffic and see if it was working? How does it deal with cookies? Would it work in the real case of a traffic replay from apache logs?

Changed 7 months ago by set

I'm using this plugin with a python script that parses through apache logs with real traffic. It seems to ignore the timestamp and all visits are recorded on the current server time. Has anyone encountered this?

Changed 7 months ago by matt

set, I think this is a bug in the code. Also, I expect it doesn't deal well with cookies and therefore would leave to invalid statistics. This is an alpha/not ready for production release.

Changed 3 months ago by matt

See an example of a PHP function that creates a request to piwik.php

 http://www.burtonkent.com/piwik-tags/

 http://www.burtonkent.com/wp-content/uploads/piwik-tag.php

Changed 3 months ago by matt

  • description modified (diff)

Changed 3 months ago by matt

  • description modified (diff)

Changed 3 months ago by matt

  • summary changed from API to push data into Piwik without using the javascript tag (to import data from logs, or any other data source) to Bulk load Piwik logs with documented API: improved tracking performance, allow performance testing

Changed 2 months ago by matt

  • milestone changed from 2 - Piwik 0.8 - A Web Analytics platform to 1 - Piwik 0.7 - DigitalVibes

Changed 7 weeks ago by matt

  • description modified (diff)

Changed 7 weeks ago by halfdan

Will this module not create race conditions due to parallel writing to the logfile? In difference to Apache we don't have the file open the whole time - the file gets opened/closed every time piwik.php gets requested and writes data to it. Normal file system implementations don't care about simultaneous write, so we're gonna loose some log lines on high load piwik sites..

Even if there were a locking mechanism in the file system, it will probably screw the server load with high stat(), wait() and I/O.

Changed 6 weeks ago by matt

  • milestone changed from 1 - Piwik 0.7 - DigitalVibes to Features requests - after Piwik 1.0

moving to post 1.0 as it requires significant time / testing.

Note: See TracTickets for help on using tickets.