Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster & Reliable Tracking: piwik.php asynchronous tracking import, by replaying piwik.php access logs every N minutes #3632

Closed
mattab opened this issue Dec 21, 2012 · 24 comments
Assignees
Labels
c: Performance For when we could improve the performance / speed of Matomo. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc.
Milestone

Comments

@mattab
Copy link
Member

mattab commented Dec 21, 2012

This is a performance improvement ticket.

  • It is possible to disable piwik.php from tracking (ie. not select/update/insert) from the DB. See the Faq to enable the maintenance mode: http://piwik.org/faq/how-to/#faq_111
  • When this is enabled, the webserver will return very quickly the transparent GIF. It will also log the piwik.php request in the apache server logs. It turns out this request contains (by design) the full set of information required by Tracking.
  • We should provide a new CRON script that will, replay the piwik.php server logs every 1 minute for example. We could make use of Piwik Bulk Tracking features to import ie. 500 lines at once from the log.

Summary:
Power users will be able to setup Piwik where the Mysql is not required for tracking.
A script will run that will import, every N minutes (for example N=1 or N=60) the Webserver access logs into Piwik.
This script will use our log analytics tool to import the piwik.php requests.
It won't be as "real time" as before since loading logs is asynchronous, but could be set every 1 minute for near real time.

This will make Piwik tracking decoupled from Mysql, more resilient, faster, easier to scale.

@mattab
Copy link
Member Author

mattab commented Dec 21, 2012

Thanks to Thomas Seifert who provides Piwik hosted, for this patch!

@mattab
Copy link
Member Author

mattab commented Dec 21, 2012

Attachment:
piwikphp_import.patch

@mattab
Copy link
Member Author

mattab commented Dec 21, 2012

Thanks for the patch!

Code review/ TO Dos

  • if 'cdt' parameter is found in the piwik.php request, do not overwrite it with the access log date, but forward the original 'cdt'
  • same for 'cip': if found in the piwik.php? request, forward it and don't use log IP
  • idem for 'ua' which should overwrite the log user agent

@cbay
Copy link
Contributor

cbay commented Dec 21, 2012

Rather than doing the job in the Recorder class, I'd rather have it in the Parser class, as it's really a parsing thing. Basically, the parser creates a Hit object that has all required properties for the Recorder to use. I'll make changes so that the Hit object has an 'args' property where you can override anything you want.

query_arguments = urlparse.parse_qs(hit.query_string)
for k in query_arguments:
    query_arguments[k]=query_arguments[k].pop()
    query_arguments[k]=query_arguments[k].encode('raw_unicode_escape').decode('utf-8')

This is not very Pythonesque. Don't modify the dictionary as you iterate over it; instead, create a new one:

query_args = dict(
    (key, value.pop().encode('raw_unicode_escape').decode('utf-8') for key, value in query_arguments.iteritems())
)

Why is it required to do the encode/decode? What format has the string initially?

@anonymous-matomo-user
Copy link

This is not very Pythonesque. Don't modify the dictionary as you iterate over it; instead, create a new one

I know. As I already told to matt, my language of choice is PHP and for Python I'm just a beginner - you should really have a deep look into it :). I'd love to see what you make out of it.

Why is it required to do the encode/decode? What format has the string initially?

Well, the string is directly from the logfile and without that encode/decode I got garbled characters for page titles with umlauts in german.

Also you really should remove the return True in the beginning of check_http_error - I added it to get rid of some missing imported lines but from the code around it it can only be wrong.

One additional enhancement in my mind would be to skip requests being not piwik.php requests when the replay is enabled.

@cbay
Copy link
Contributor

cbay commented Dec 21, 2012

Could attach a log file with those umlauts?

@anonymous-matomo-user
Copy link

Attachment: Anonymized logfile with german data + umlauts
mysnip-stats.log-22-00.anon.gz

@cbay
Copy link
Contributor

cbay commented Dec 23, 2012

I've just commited a small change that will let you put your code in the Parser, as I suggested: you now have a 'args' attribute in the Hit object that you can use to override your args.

Regarding the Unicode thing, I'm not 100% sure this is the best way to handle it but this is really not an easy one, as there are so many parameters to think about: the log file encoding, the HTTP encoding, etc. Are we sure we always have UTF8 at this point? Does it depend on Piwik? Anyway, if it works that way, go for it ;)

Send me your next diff when it's ready so I can make a proper review before it gets committed.

@anonymous-matomo-user
Copy link

I wasn't aware ot that request / proposal, and I've already opened a topic on the forum.
http://forum.piwik.org/read.php?2,100735

@anonymous-matomo-user
Copy link

Any update since 2 months ? I am also looking into this and it would be great to have the latest status of this ticket. Thanks !

@mattab
Copy link
Member Author

mattab commented Feb 15, 2013

see pull request #28

@mattab
Copy link
Member Author

mattab commented Feb 26, 2013

Reopening, pending:

@diosmosis
Copy link
Member

Changeset [changeset:7e93e75012153e5f79d3bac98d9662e43b9df21d] refs this ticket.

@diosmosis
Copy link
Member

In 5dca1a2: Refs #3632, add integration test for import script replay tracking.

@diosmosis
Copy link
Member

In 4b54ac1: Refs #3632, Remove stray debugging code.

@diosmosis
Copy link
Member

In a06d961: Fixes #3632, refactor/tweak pull request #34 merge.

@mattab
Copy link
Member Author

mattab commented Mar 1, 2013

In 4d7f93b: Refs #3632 More Code coverage for log importer using multiple site IDs

@mattab
Copy link
Member Author

mattab commented Feb 4, 2014

See new FAQ Scaling Piwik Tracking

@mattab
Copy link
Member Author

mattab commented Mar 13, 2014

In 73f3346: Refs #3632 Adding more tests cases for log replay functionnality
Forcing all recorders and recorders max payload to 1, to prevent random behavior (eg. in Live.getLastVisitsDetails, the pageIdAction may be random order if recorders import data in random thread order)

@mattab
Copy link
Member Author

mattab commented Mar 13, 2014

In b6dfdc0: Refs #3632 test case that _idvc > 1 will set visitor returning status

@mattab
Copy link
Member Author

mattab commented Mar 13, 2014

In fd13666: Refs #3632 Creating a test case: same visitor (same IP + idvisitor) visits the website on two different days.
Visitors on second day is marked as "new" because window_look_back_for_visitor is not set.

@mattab
Copy link
Member Author

mattab commented Mar 13, 2014

In 029bb87: Refs #3632 New test case: replaying logs and forcing a window look back (forceLargeWindowLookBackForVisitor=1)
=> The visitor is now marked as "returning" as expected

@mattab
Copy link
Member Author

mattab commented Mar 13, 2014

In 4328fd8: Refs #3632 Small refactor and support parameter forceLargeWindowLookBackForVisitor=1 in tests when replaying tracking logs

@mattab
Copy link
Member Author

mattab commented Mar 14, 2014

In aae4433: Refs #3632 Check Engagement reports are properly set when replaying logs

@mattab mattab added this to the 1.11 - Piwik 1.11 milestone Jul 8, 2014
sabl0r pushed a commit to sabl0r/piwik that referenced this issue Sep 23, 2014
…ality

Forcing all recorders and recorders max payload to 1, to prevent random behavior (eg. in Live.getLastVisitsDetails, the pageIdAction may be random order if recorders import data in random thread order)
sabl0r pushed a commit to sabl0r/piwik that referenced this issue Sep 23, 2014
sabl0r pushed a commit to sabl0r/piwik that referenced this issue Sep 23, 2014
…visitor) visits the website on two different days.

Visitors on second day is marked as "new" because window_look_back_for_visitor is not set.
sabl0r pushed a commit to sabl0r/piwik that referenced this issue Sep 23, 2014
…ow look back (forceLargeWindowLookBackForVisitor=1)

=> The visitor is now marked as "returning" as expected
sabl0r pushed a commit to sabl0r/piwik that referenced this issue Sep 23, 2014
…indowLookBackForVisitor=1 in tests when replaying tracking logs
sabl0r pushed a commit to sabl0r/piwik that referenced this issue Sep 23, 2014
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Performance For when we could improve the performance / speed of Matomo. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc.
Projects
None yet
Development

No branches or pull requests

4 participants