Log analytics list of improvements
|Reported by:||matt||Owned by:|
|Priority:||critical||Milestone:||2.x - The Great Piwik 2.x Backlog|
Description (last modified by matt)
In Piwik 1.8 we released the great new feature to import access logs and generate statistics.
The V1 release works very well (it was tracked in #703), but there are ideas to improve it. This ticket is a placeholder of all ideas and discussions related to the Log Analytics feature!
- Track non-bot activity only. When --enable-bots is not specified, it would be a nice improvement if we:
- exclude visits with more than 150 actions per visitorID to block crawlers (detected at the python level by counting requests for that IP in the queue)
- exclude visits that do not have User Agent or beyond the very basic ones used by all bots
- exclude all requests when one of the first ones is for /robots.txt -- if we see a robots.txt in the middle we could stop tracking subsequent requests
- check that /index.php?minimize_js=file.js is counted as a static file since it ends in .js
After that bots & crawlers detection would be much better.
- Support Accept-Language header and forward to piwik via the &lang= parameter. That might also be useful to some users who need to use this data in a custom plugin.
- we could make it easy to delete logs for one day so to reimport one log file
- This would be a new option to the python script. It would reuse the code from the Log Delete feature, but would only delete one day. The python script would call the CoreAdmin API for example, deleting this single day for a given website. This would allow to easily re-import data that didn't work the first time or was bogus.
- Detect when log-lines are re-imported and only import them once.
- Implementation: add new table piwik_log_lines (hash_tracking_request, day ))
- In Piwik Tracker, before looping on the bulk requests, SELECT all the log lines that have already been processed on this day (WHERE hash_tracking_request IN (a,b,c,d) AND day=?) & Skip these requests from import
- After bulk requests are processed in piwik.php process, INSERT in bulk (hash, day)
- By default this feature would be enabled only for "Log import" script,
- via a parameter that we know is the log import (&li=1 /import_logs=1)
- but may be later useful to all users of Tracking API for general deduping service.
How to debug performance? First of all, you can run the script with --dry-run to see how many log lines per second are parsed. It typically should be between 2,000 and 5,000. When you don't do a dry run, it will insert new pageviews and visits calling Piwik API.
Change History (154)
comment:1 Changed 22 months ago by matt (mattab)
- Description modified (diff)
- Priority changed from normal to major