Opened 6 years ago

Closed 17 months ago

#134 closed New feature (fixed)

Bulk load Piwik logs with documented API: improved tracking performance, allow performance testing

Reported by: matt Owned by:
Priority: critical Milestone: 1.x - Piwik 1.x
Component: Performance Keywords:
Cc: Sensitive: no

Description (last modified by matt)

Currently data can be pushed in the database using the piwik.php script, called from the piwik.js tag.

For example an HTTP request that stores a page view in Piwik looks like:

http://piwik.org/demo/piwik.php?url=http%3A%2F%2Fpiwik.org%2F&action_name=&idsite=1&res=1440x900&h=16&m=22&s=20&fla=1&dir=0&qt=0&realp=0&pdf=1&wma=1&java=1&cookie=1&title=Piwik%20-%20Web%20analytics%20-%20Open%20source&urlref=

We want to improve the way Piwik logs data at tracking time.

  • Piwik will now create 'access logs', similar to apache access logs, containing all the REQUEST details (url, the 'piwik' cookie which depends on #409, user agent, referer url, IP, language, POST data, other $_SERVER, etc.).
  • Every 10s (or 30s or 1min) the 'Tracking Bulkloader' will be triggered by the Maintenance task (see #1184).
    • it will connect to the DB once, then read all log lines and process visits in memory,
    • creating flat files for the DB updates,
    • eventually using memcache for visits/pages/options/cookies data store,
    • then bulk inserts/updates visits/pages/conversions and cookies
  • The process would log reports 2010-09-08 04:03:02 - Loaded 1405 visits, 45000 page views, 345 goals - Duration 32s - Logs since 2010-09-08 04:02:52 . How do we handle timeout errors? memory errors?
  • Persists status of 'bulk logs loaded' in piwik_option
  • disconnect DB

Speed of tracking data load would be greatly improved and would compensate the performance loss resulting from the cookie loss in #409.

The log replay script would work in several modes

  • replay a single visit using a HTTP request containing the input data (the content of one log line from the log file).
    • The script API to record stats would become public and be documented. This will allow any user to record stats in Piwik from any source (mobile phones, php apps, desktop apps, etc.).
  • replay a set of visits given a log filename, will read the log file and load it in the DB

Use cases for tracking logs loader

This way, we can use the log replay script in the following use case

  • Tracker performance improvement described above: Every 10s (or 30s or 1min) the 'Tracking Bulkloader' loads all logs at once, which is much faster than connecting to the DB/updates/inserts at every page view.
    • There will also need to be a new 'Super user' setting
      • Enable tracking logs bulk loader
      • Record visits in DB every 10s
      • Enable Memcached wrapper
  • Performance testing replaying existing real logs make it easier to test performance changes in different releases and when doing code updates. See #2000

Notes

  • Maybe the tracking code has to be modified to remove the logic that 'selects' or tries to 'match a visitor md5config to a previous visit' as it can be done using the cookie store (#409) I believe.
  • can we use an existing code to parse apache logs in an efficient way, and which would work with several log formats automatically (including windows IIS formats)?
  • see also the DB schema even if it's outdated, most fields and tables are unchanged.

Attachments (3)

ActiveTrack.zip (8.1 KB) - added by vipsoft 5 years ago.
Proposed plugin
contrib syslog - parser.pl (27.1 KB) - added by matt 2 years ago.
contrib syslog - Tracker.php (16.5 KB) - added by matt 2 years ago.

Download all attachments as: .zip

Change History (68)

comment:1 Changed 6 years ago by matt (mattab)

  • Description modified (diff)

comment:2 Changed 5 years ago by matt (mattab)

  • Milestone changed from Stable release to Future features

comment:3 Changed 5 years ago by vipsoft (robocoder)

  • Type changed from Bug to New feature

comment:4 Changed 5 years ago by matt (mattab)

  • Description modified (diff)

comment:5 Changed 5 years ago by matt (mattab)

  • Description modified (diff)

comment:6 Changed 5 years ago by matt (mattab)

  • Description modified (diff)

comment:7 Changed 5 years ago by matt (mattab)

  • Priority changed from major to critical

comment:8 Changed 5 years ago by matt (mattab)

  • Summary changed from API to push data into Piwik without using the javascript tag to API to push data into Piwik without using the javascript tag (to import data from logs, or any other data source)

comment:9 Changed 5 years ago by matt (mattab)

  • Description modified (diff)

comment:10 Changed 5 years ago by velmont

This would make Piwik easy to prefer over Analytics, as admins then can imports years worth of logs and data into Piwik and see the trends more easily.

I would be able to make the people at work to use Piwik instead of Google Analytics if I can import the common apache-logs into Piwik.

comment:11 Changed 5 years ago by vipsoft (robocoder)

(In [1330]) Fixes #876 - Piwik_Tracker_Visit should not validate uninitialized $this->request in constructor; make it possible to push data (in the absence of a REST API; see #134); fix duplicate function error messages when calling Piwik::printSqlProfilingReportTracker() more than once

comment:12 Changed 5 years ago by vipsoft (robocoder)

  • Milestone changed from Features requests - after Piwik 1.0 to 2- DigitalVibes

Will tackle in conjunction with #762.

comment:13 Changed 5 years ago by alivenk

comment:14 Changed 5 years ago by vipsoft (robocoder)

  • Owner set to vipsoft
  • Sensitive unset

comment:15 Changed 5 years ago by matt (mattab)

  • Milestone changed from 2- DigitalVibes to 2 - Piwik 0.8 - A Web Analytics platform

comment:16 Changed 5 years ago by domtop

Changed 5 years ago by vipsoft (robocoder)

Proposed plugin

comment:17 Changed 5 years ago by vipsoft (robocoder)

Proposed plugin attached.

  • API.php uses the Tracker.getNewVisitObject hook
  • ignore Tracker.php and API-alternate.php; these follow the implementation model of the (fake) visit generator are included for comparison only
  • @todo move translation(s) in lang/en.php into core
  • @todo add test of setVisitInformation()
  • @todo add return value to API methods and Piwik_API_Request() variant of existing tests

comment:18 Changed 5 years ago by vipsoft (robocoder)

  • @todo investigate extending to handle reqts of TrackerSecondaryDb's replay script.

comment:19 Changed 5 years ago by matt (mattab)

really interesting!

  • in the API, I think there is an issue with the setVisitInformation - in a REST api because there is no state, all parameters need to be passed at each request, so we might need to add cookies/ip/timestamp in each function.. this is quite ugly but I'm not sure how else it could work, without introducing token responses, caching, etc...
  • getVisitObject move to the Controller instead of API?

Before putting the plugin in core, I think we should do the cleanup of hook names and come up with a consistent naming.

Please post here your updated versions - really looking forward to have a better look.

comment:20 Changed 5 years ago by vipsoft (robocoder)

  • Owner vipsoft deleted

comment:21 Changed 4 years ago by piwikxx

This would be extraordinary helpful when tracking webservices.
I'd suggest even adding the remote IP as a parameter.

It would be possible to track very easy in cases like this:
Client-Application <-----interacts with ------> xml webservice ------> sends data to piwik in order to track webservice usage

comment:22 Changed 4 years ago by matt (mattab)

  • Description modified (diff)

comment:23 Changed 4 years ago by matt (mattab)

  • Description modified (diff)

comment:24 Changed 4 years ago by matt (mattab)

vipsoft, did you try this plugin with real traffic and see if it was working? How does it deal with cookies? Would it work in the real case of a traffic replay from apache logs?

comment:25 Changed 4 years ago by set

I'm using this plugin with a python script that parses through apache logs with real traffic. It seems to ignore the timestamp and all visits are recorded on the current server time. Has anyone encountered this?

comment:26 Changed 4 years ago by matt (mattab)

set, I think this is a bug in the code. Also, I expect it doesn't deal well with cookies and therefore would leave to invalid statistics. This is an alpha/not ready for production release.

comment:28 Changed 4 years ago by matt (mattab)

  • Description modified (diff)

comment:29 Changed 4 years ago by matt (mattab)

  • Description modified (diff)

comment:30 Changed 4 years ago by matt (mattab)

  • Summary changed from API to push data into Piwik without using the javascript tag (to import data from logs, or any other data source) to Bulk load Piwik logs with documented API: improved tracking performance, allow performance testing

comment:31 Changed 4 years ago by matt (mattab)

  • Milestone changed from 2 - Piwik 0.8 - A Web Analytics platform to 1 - Piwik 0.7 - DigitalVibes

comment:32 Changed 4 years ago by matt (mattab)

  • Description modified (diff)

comment:33 Changed 4 years ago by halfdan

Will this module not create race conditions due to parallel writing to the logfile? In difference to Apache we don't have the file open the whole time - the file gets opened/closed every time piwik.php gets requested and writes data to it. Normal file system implementations don't care about simultaneous write, so we're gonna loose some log lines on high load piwik sites..

Even if there were a locking mechanism in the file system, it will probably screw the server load with high stat(), wait() and I/O.

comment:34 Changed 4 years ago by matt (mattab)

  • Milestone changed from 1 - Piwik 0.7 - DigitalVibes to Features requests - after Piwik 1.0

moving to post 1.0 as it requires significant time / testing.

comment:35 Changed 3 years ago by matt (mattab)

halfdan, I think we will have to lock the file before writing a new line into it, then release lock. Hopefully there is a large margin before this ends up causing performance issues.

comment:36 Changed 3 years ago by matt (mattab)

I just did a quick test: with 200 concurrent requests, 1000 total requests, calling a php file acquiring a lock and appending 8k to a log file. If the lock is non blocking it fails to acquire it in approx 50% of requests and 1000 requests take ~0.15s.
If the lock is blocking 1000 requests take 0.15 sec.

So the conclusion is that we are safe to write to log file as long as we lock, this is not going to be the bottleneck

I used this script with ab

    $fp = fopen("foo.txt", "a");
    if (flock($fp, LOCK_EX)) { // add | LOCK_NB to lock
        fwrite($fp, str_repeat(rand(1,9),4*1024) ."\n");
        flock($fp, LOCK_UN);
    } else {
        print "Could not get lock!\n";
    }
      fclose($fp);

comment:37 Changed 3 years ago by matt (mattab)

Random thought: to be safe implementing this change, it would be nice to research code coverage status of our tracker unit tests. I believe it could be pretty easy to cover close to 100% but we just don't know what is not tested with our current integration tests.

comment:38 Changed 3 years ago by matt (mattab)

  • Description modified (diff)

comment:39 Changed 3 years ago by matt (mattab)

Once we implement bulk loading, hopefully we can get rid of the Tracker/DB custom classes and use Zend all the time?

comment:40 Changed 3 years ago by matt (mattab)

  • Milestone changed from Feature requests to 1.x - Piwik 1.x

comment:41 Changed 3 years ago by fil

I need this (just mentioning it to follow the ticket changes :-) )

comment:42 Changed 3 years ago by kaystrobach

hello guys,

i'm currently working on an import script.
http://forge.typo3.org/issues/11791
http://forge.typo3.org/projects/extension-piwikintegration/repository/show/trunk/piwik_patches/plugins/KSVisitorImport

It takes a very long time to import just the logfiles from a singleday.
I used the VisitorGenerator as base.

Do you know anyway to speed that up?
I would like to have something like:

$entry = array('fullString'    => $matches[0],
					   	    'remoteHost'    => '',
						    'identUser'     => '',
						    'authUser'      => '',
						    'unixTimestamp' => strtotime(''),
						    'fullDateTime'  => '',
						    'date'          => '',
						    'time'          => '',
						    'h'             => '',
						    'm'             => '',
						    's'             => '',
						    'timezone'      => '',
						    'method'        => '',
						    'url'           => '',
						    'protocol'      => '',
						    'status'        => '',
						    'bytes'         => '',
						    'referrer'      => '',
						    'userAgent'     => '',
						    'siteName'      => preg_replace('/[\/\.]+/', ' ', ''),
						    'idsite'        => $this->idSite,
				);
Piwiktracker::track($entry);

It would be even better to have the chance to disable some plugins using that function.

Thanks
Kay

comment:43 Changed 3 years ago by vipsoft (robocoder)

I assume during import, there is no browser-triggered archiving or scheduled tasks.

It's been a while since I've looked at the tracker performance (the biggest issue for me was a memleak), so it would be timely to review, since Piwik 1.2 will have first party cookies.

It would be nice if the importer could also parse a custom log format (if defined), or handle the tables created by mod_log_dbd or mod_log_sql.

comment:44 Changed 3 years ago by kaystrobach

The KSVisitorImport Plugin can be easily extended with various importer plugins.

currently it can import 2 types of apache logs and has a dummy class for importing google data. (while this data is not really imported)

Additionally i started to integrate an ajax based progressbar to see, that the progress is still living, while it runs quite long.

http://forge.typo3.org/projects/extension-piwikintegration/repository/show/trunk/piwik_patches/plugins/KSVisitorImport

regards
Kay

comment:45 Changed 3 years ago by jdhildeb

I have been experimenting with the KSVisitorImport. In the imports I have run, my system shows low CPU usage and low IO, so I suspect the performance bottleneck is network IO, probably DNS lookups.

If this theory is correct, then DNS caching may yield a performance boost.

comment:46 follow-up: Changed 3 years ago by vipsoft (robocoder)

The gethostbyaddr() call in the Provider plugin can be a bottleneck. There are some interesting ideas in the user contributed notes for http://php.net/gethostbyaddr that we could (1) test for during installation to see which methods work
and how well for a given environment, and (2) use the best method at runtime.

comment:47 Changed 3 years ago by jdhildeb

Thanks for the tip. Commenting out the gethostbyaddr() call resulted in a 60x performance increase.

It should possible to substantially speed up the DNS lookups by using multiplexing (see http://wezfurlong.org/blog/2005/may/guru-multiplexing), i.e. doing a bunch of DNS lookups in parallel. So instead of waiting 1-3 seconds for each lookup, you run 10 at a time in parallel, then it only costs you 0.1-0.3 seconds per lookup.

I'm going to experiment with this.

comment:48 follow-up: Changed 3 years ago by matt (mattab)

jdhildeb, this is very interesting. We will very soon (few months) load logs in parrallel and I didn't take in consideration that this might be a bottleneck. Alternatively, we can do the DNS lookup on the piwik.php hit and log it. This might be easier and will not cause issues with DNS slowness until many requests per second. Also, the DNS server should be locally cached to the server itself (not sure how this is called again...)

comment:49 Changed 3 years ago by kaystrobach

hi guys,

to avoid problems with the anon plugin we should do the following:

first hit:

-> gethostbyip
-> geoip
-> provider
-> ...
-> md5 or something similar the ip and the hostname
-> store complete record set in a cache e.g in database with a valid_until date (e.g. 2-12 hours)

next hit with same ip

-> find record in db and use it if valid until is not expired.
-> skip geoip and provider, and gethostbyip ;)

What do you think?

regards
Kay

comment:50 Changed 3 years ago by matt (mattab)

Kay, no this is what DNS cache server is for..

comment:51 Changed 3 years ago by kaystrobach

We will very soon (few months) load logs in parrallel

Is there somebody working on an importer? - sounds so?
If true i think it would be even better to centralize the development ;) and make one really great instead of two great importers ;)

Kay, no this is what DNS cache server is for..

You're right, but the import shows that this would improve performance drastically, especially when additional information like geoip and provider and perhaps other calculated data is cached at this point, which doesn't need to be recalculated within some minutes ;) as it is relativly statically.

comment:52 Changed 3 years ago by vipsoft (robocoder)

re: comment:48

Right now, gethostbyaddr() is called for a new visit. I'm opposed to moving it to piwik.php because it would mean calling gethostbyaddr() on every request.

  • Where PHP's gethostbyaddr() is able to resolve the IP address, it typically takes about 0.04 seconds on my box/network.
  • However, if PHP's gethostbyaddr() is unable to resolve the IP address, it can take a long time (e.g., 5 seconds), e.g.,
    time php -r 'var_dump(gethostbyaddr("58.218.199.147"));'
    
  • Moreover, it doesn't appear that the failure is cached, so it's always a "miss" -- i.e., a subsequent call to gethostbyaddr() for the same IP address will take just as long.

---

I've created ticket #2152 to investigate performance enhancements for gethostbyaddr().

comment:53 Changed 3 years ago by vipsoft (robocoder)

re: comment:51

Before we add more caching to the tracker, we need to follow-up on #735 (tracker mem leak).

Caching geoip and provider information may have limited benefit because these plugins are called for a new visit (session).

comment:54 in reply to: ↑ 48 Changed 3 years ago by jdhildeb

Replying to matt:

jdhildeb, this is very interesting. We will very soon (few months) load logs in parrallel

Matt, what do you mean by "in parallel"? What are your plans and approach?

If piwik is logging visits from multiple websites, this will not likely be a bottleneck, as each webserver request for piwik.php (ultimately calling visit->handle(), and then invoking plugins ) is handled by a different process or thread (depending on webserver and configuration), and so your DNS lookups are inherently parallelized.

However in the case of KSVisitImport (or presumably any other importer) the hits are read from an apache log file and are all processed sequentially in a single PHP request. In this model we must wait for one DNS lookup to complete before the next record can be imported. This is the bottleneck I'm bumping into.

The optimization I have in mind is along the lines of:

  • in core, create a gethostbyaddr cache. This cache would only live for the duration of the current PHP script and would be stored in memory only (it would not be stored in the database).
  • importer can scan the import file and perform the DNS lookups in advance (using the multiplexing technique I described earlier), and pre-seed the cache.
  • then importer processes the records (ultimately calling visit->handle() for each record)
  • Provider plugin checks the cache before calling gethostbyaddr, saving lookup time.

comment:55 Changed 3 years ago by matt (mattab)

jdhildeb, the idea of the bulk import will be to keep using piwik.php as it is, but instead of logging in the DB, it will log a "pretty log" in a file, a bit like an apache log but enriched with all the data piwik.php receives.

  • Regarding the DNS lookups My idea was to have the server do the dns lookup before recording in the file, so that when we load data in bulk, we don't have to do any dns requests.

Alternatively, we could disable the DNS lookup since it is not so useful and costly..

  • Also, an idea is to make the queue logging system very trivial so that other scripts could directly log to the queue from apache/piwik logs rather than go through piwik.php again
  • Queue system should be pluggable: file, reddis, or even mysql driver available

comment:56 Changed 2 years ago by lorieri

Hi, I'm new on piwik, just installed it today. We do have a big concern logging the requests, we can't rely on mysql not because the daily amount of requests, but because the huge amount we get during odd peaks, like tv advertising. All we need are 2 scripts, one to print/export requests very fast, and another to load those requests. That way we can choose the method we will use to import data, by syslogd, queues, etc. It is important to have the timestamp and the requester IP.

We think we will need to set the archive to always run with 10 minutes delay to make sure the queue has no past queries that will not be processed, is that right/possible ?

We saw some discussion on MySql in the forums, and we do like MySql and its performance, but if you have thousands of connections at same time, you have to multiply it to the amount of memory you configured per connection on mysql and your max_connections, you can easily run out of slots or run out of memory.
If you run any report during that time and lock a table, even though for a very little time, those thousand connections accumulates.
We saw people talking about "insert delayed", but unfortunately it works only for myisam tables.
We also want to try to set minimum amount of memory and very low wait and connection time for mysql connections during the insert querys, that way we will not affect the archive and reports scripts. Specially sort_buffer_size, read_buffer_size, tmp_table_size. Any other ?

Of course we need more knowledge on piwik, maybe by the API it would be easy, we would appreciate if you have any good ideas to do those 2 scripts, we will keep you guys informed if we get any progress.

So far, our best guess is to hack the piwik.php to send a 200 header followed by an ob_end_flush and leave the rest running in background, and somehow log the mysql query or the parameters sent to the php script in a queue. Later run the mysql query or reproduce the php request in the original piwik.php using some kind of trick in the php fast-cgi to pass the original parameters.

We think we will also have to install lots of piwik virtualhosts to be able to spread each website in different databases and tables. Maybe we can use the mysql partition for that, lets try, but it would be great if each website has its own tables.

Our main goal is to not change, or change minimal lines of piwik code.

Thanks a lot for the initiative :)

comment:57 Changed 2 years ago by matt (mattab)

lorieri, please contact me at matt att piwik.org and we can discuss these things. WE are hoping to work on Scalable Piwik in the next few months and will create a brainstorm group. Email me with your traffic details and use case. Cheers

comment:58 Changed 2 years ago by lorieri

Hi guys,

I'm not sure if I can give the exact number, but you can see the pattern in the slide #19 in that presentation: http://www.slideshare.net/rgaiser/r7-no-aws-qcon-sp-2011

comment:60 Changed 2 years ago by lorieri

Hi,
still studying piwik, now we are trying to understand the archive process.

thanks


comment:61 in reply to: ↑ 46 Changed 2 years ago by lorieri

Replying to vipsoft:

I think piwik should resolve names only after archive run and aggregate similar ips

comment:62 Changed 2 years ago by matt (mattab)

From email

There was a problem registering visitors at a higher load.
Such problem has already been described in http://dev.piwik.org/trac/ticket/134.
We solved this problem by recording visitors in syslog and
processing of the log. Perhaps that can help you. 
Don't hesitate to contact me if you have any questions!

See attached file.

comment:63 Changed 21 months ago by matt (mattab)

The API to import Bulk Visits and Pages in Piwik is now implemented! See: #3134

It's now possible to import ie. 100 visits or 1000 pageviews in one POST http request

comment:64 Changed 18 months ago by matt (mattab)

  • Component changed from Core to Performance

comment:65 Changed 17 months ago by matt (mattab)

  • Resolution set to fixed
  • Status changed from new to closed

The feature is implemented.

We have integration tests as well (ImportLogsTest)

See the Tracking API reference in: http://piwik.org/docs/tracking-api/reference/

in particular the section "Advanced: Bulk Tracking Requests"

Note: See TracTickets for help on using tickets.