Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Piwik an alternative to AWStats and Urchin, build server log import script #703

Closed
anonymous-matomo-user opened this issue May 12, 2009 · 144 comments
Labels
Critical Indicates the severity of an issue is very critical and the issue has a very high priority. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc.
Milestone

Comments

@anonymous-matomo-user
Copy link

Urchin Alternative: Import your server logs in Piwik, the Free web analytics platform!

See blog post Piwik alternative to Urchin for more information.

Piwik is the Urchin alternative but also Webalyzer and AWStats alternative: with a Python script, you can now import webserver logs (apache, iis, and more) in Piwik, instead of using the javascript tracking.

Description
A Python script available in piwik/misc/log-analytics/ will parse server logs efficiently and automatically call the Piwik Tracking API to inject the visits/pageviews/downloads in Piwik.

How to install / how to use

  • Requires Piwik >= 1.7.2-rc2. Download the latest version from http://builds.piwik.org/?C=N;O=D
  • Requires at least Python 2.6
  • Requires one or many server log files, typically called access.log in Apache for example. These log files will be imported into Piwik.
  • You can also create a "test website" in Piwik to import all data into, rather than importing into your existing websites. Then, use the command --idsite=X to force all info from the log files to be imported into this idsite
  • You can use --dry-run command to have a test run and make sure you will not track data or create new websites

SEE FOLLOW UP TICKET #3163

How you can help?

  • please use the script and report your feedback and bugs here
  • if you are a hacker yourself, please review the code and consider submitting performance optimization, or improvements.
  • If you are a webhost or web agency and wish to offer Piwik to hundreds of your customers, please contact us
  • review the doc at Server log analytics

Tasks to do before final release

  • Test, test and test
  • Setup on demo.piwik.org in a new website
  • Check all code review feedback managed
  • Review Import Logs in Piwik doc page.
  • decomission apache2piwik (update blog post)

Feature requests for V2 or later

SEE FOLLOW UP TICKET #3163

@mattab
Copy link
Member

mattab commented Mar 14, 2012

First release of the script committed in [6046]

@mattab
Copy link
Member

mattab commented Mar 15, 2012

(In [6051]) Refs #703

  • Adding README contributed by Cyril

@robocoder
Copy link
Contributor

(In [6053]) refs #703 - propset eol-style

@oliverhumpage
Copy link

Performance-wise: I've set up piwik in its own jail now, turned off unnecessary PHP extensions, tweaked apache, and enabled APC. If I use --recorders=48 I get good import speeds (at least at first) without the load average going too high. However, something odd happens, and some way through importing a log file the recorders drop off (I can see fewer and fewer apache processes too, so clearly it's just not being hit as much):

2846 lines parsed, 233 lines recorded, 233 records/sec
4372 lines parsed, 506 lines recorded, 273 records/sec
[...]
8300 lines parsed, 7570 lines recorded, 9 records/sec
8300 lines parsed, 7579 lines recorded, 9 records/sec
8300 lines parsed, 7588 lines recorded, 9 records/sec
8300 lines parsed, 7598 lines recorded, 10 records/sec

I don't think I have any weird throttling going on - any ideas what might be up? There's nothing else being output during the processing even with debugging on. The drop-off seems to start roughly half way through any given logfile.

@cbay
Copy link
Contributor

cbay commented Mar 20, 2012

oliverhumpage: 48 is almost certainly too high, unless you have a 48-core machines. You shouldn't have to exceed the number of cores in your system, even a bit lower (as the import script and MySQL will run at the same time).

As for why your performance decreases over time, I don't know. What does a 'top' say? You'd have to find the bottleneck. It may be Apache, PHP, MySQL. On my system, I have a sustained 300 req/s for more than 3 hours.

Regarding the static files excluded, we'll add an option to include those (disabled by default). I'm sure the whole importing process will get better over time, it's only the beginning :)

@mattab
Copy link
Member

mattab commented Mar 20, 2012

(In [6070]) Refs #703 Removing images from "downloads", and improving TIP message in output debug

@mattab
Copy link
Member

mattab commented Mar 20, 2012

(In [6071]) Refs #703 Improving help message as per Cyril feedback

@mattab
Copy link
Member

mattab commented Mar 20, 2012

(In [6074]) Refs #703 Display response output when tracking request failed (this happens for example when debug is enabled in piwik.php)

@oliverhumpage
Copy link

Replying to Cyril:

oliverhumpage: 48 is almost certainly too high, unless you have a 48-core machines. You shouldn't have to exceed the number of cores in your system, even a bit lower (as the import script and MySQL will run at the same time).

I did quite a few experiments, and eventually found that 40 is about right. This is a VM running on a high powered Dell R710, so although the OS only thinks it has 4 CPUs I don't know how things actually pan out. All I know is that the number of records/sec increases pretty much linearly with --recorders up until 40. E.g. if I run at 32, I get more like 200r/sec rather than 250+r/sec. A single recorder manages around 6-7r/sec. After 40 the benefits tail off.

I also tried a few experiments to see where the bottleneck might lie, for instance I stuck in a mod_rewrite to send the importer to a basic PHP file that just returned the .gif without doing any processing, but weirdly the performance was about the same. However, running with --dry-run (or just removing the line which actually calls the script) means the python script runs at around 4000r/sec, so I can only conclude the limit is in apache/php (putting in APC definitely helped). I also tried hacking the script to run a PHP wrapper script that called piwik.php directly on the command line, but it went horribly slowly, presumably because of the lag in loading up PHP.

Anyway, I'm happy with 250-300r/sec. I may set up a separate VM with a tweaked kernel and optimised apache to deal with log imports anyway, so I'm sure I can improve on that figure.

Regarding the steady tailing-off, what I'm wondering is: when you specify lots of recorders, do they each grab an equal number of log lines at the start then work through them? That would explain why some finish earlier than others (if e.g. one gets a lot with non-loggable lines it'd finish sooner). I notice the number of apache processes starts tailing off around half to 2/3 of the way through the log, and then just steadily decline until only 1 recorder is left.

Regarding the static files excluded, we'll add an option to include those (disabled by default). I'm sure the whole importing process will get better over time, it's only the beginning :)

That'd be brilliant, thank you. Thanks to you all for being so responsive in general too.

Oliver.

@mattab
Copy link
Member

mattab commented Mar 21, 2012

FYI the new 1.7.2-rc4 was released which includes the most up to date code: Download from: http://builds.piwik.org/?C=N;O=D

@mattab
Copy link
Member

mattab commented Mar 21, 2012

oliverhumpag, thanks for your comments it's very interesting!
Since you seem keen, maybe you can consider running XHProf, the facebook php profiler: http://pecl.php.net/package/xhprof

I haven't run that for a long time and never under high load such as 300 req/s so it would be very interesting. If you install it, i would love to see the reports generated! The last time we ran XHPRof on Piwik we found 2-3 quick fixes that made things a lot better. I'm sure we can make tracker faster in many ways.

It would also be good to know the % of consumption of Apache/php VS mysql (not sure the best way to do this however?).

@cbay
Copy link
Contributor

cbay commented Mar 21, 2012

oliverhumpage: regarding the recorders, each request will be dispatched to a specific recorder based on its IP address. It means that if the IP address distribution of your log files isn't "even", some recorders will have more work to do than others. Which could explain the performance issues you're having, especially near the end of the import process.

This dispatching was required to make sure requests are imported in the correct order.

@oliverhumpage
Copy link

Actually, I do have one small request for piwik itself.

Would it be possible to choose on the fly between multiple database options: you see, I'm using one physical install of piwik at 2 different URLs - one for JS-based sites, and one for log-based, and therefore also 2 different sets of db tables so that --add-sites-new-hosts on the log-based system doesn't interfere with the JS websites (they'd have the same URLs). What I've done atm is set an environment var in apache and patch core/Config.php to set $config->database to either $config->database_weblog or $config->database_js depending on that env var.

However, being able to define a constant like DATABASE_CONFIG_SECTION_NAME in bootstrap.php, which Config.php then used to work out which section of the config file to use, would be much easier and more robust. I could of course just have 2 different installs of piwik, but then I have to update it twice with each release. Probably not worth enlarging the codebase just for my weird setup, but thought I'd ask - I can easily submit a patch if you're interested.

@cbay
Copy link
Contributor

cbay commented Mar 22, 2012

(In [6092]) Refs #703 import-logs.py renamed to import_logs.py and added a mini test suite which tests the format autodetection.

@cbay
Copy link
Contributor

cbay commented Mar 22, 2012

(In [6093]) Refs #703 Many improvements:

  • '-' can be specified as filename to read from stdin
  • --format is renamed to --log-format-name
  • --log-format-regex was added
  • user agent matching is now case insensitive
  • --enable-static added to track static files
  • --enable-bots added to track robots
  • --strip-query-string added to strip the query string (it was always stripped before, now it's not until
    this option is specified)
  • show help when the script is called with no filename

@cbay
Copy link
Contributor

cbay commented Mar 22, 2012

(In [6094]) Refs #703 Added option --output to redirect output to a file.

@mattab
Copy link
Member

mattab commented Mar 22, 2012

(In [6100]) Refs #703

  • Fixing encoding when tracking 404 and later other errors: by default urllib.quote does not encode the / but for our purposes we want to encode it so that the URL show up nicely in the reports
  • only tracking /From= if the referrer was actually set

@mattab
Copy link
Member

mattab commented Mar 23, 2012

(In [6102]) Refs #703 Add license notice, Shuffle help messages order, remove short notation for clarity, improve help messages, adding Java/ + bot- + bot/ + robot as a bot

@mattab
Copy link
Member

mattab commented Mar 24, 2012

(In [6108]) Refs #703 I'm learning Python (NOT!)

@cbay
Copy link
Contributor

cbay commented Mar 30, 2012

(In [6128]) Refs #703 Now works with Python 2.5.

@cbay
Copy link
Contributor

cbay commented Mar 30, 2012

(In [6129]) Refs #703 Show the summary when CTRL+C is pressed.

@cbay
Copy link
Contributor

cbay commented Mar 30, 2012

(In [6130]) Refs #703 Fixed bug with --log-format-regex (thanks oliverhumpage).

@cbay
Copy link
Contributor

cbay commented Mar 30, 2012

(In [6131]) Refs #703 Disable buffering when using --output.

@cbay
Copy link
Contributor

cbay commented Mar 30, 2012

(In [6132]) Refs #703 Added --query-string-delimiter

@cbay
Copy link
Contributor

cbay commented Mar 30, 2012

(In [6133]) Refs #703 Added --enable-http-errors and --enable-http-redirects

@cbay
Copy link
Contributor

cbay commented Mar 30, 2012

(In [6134]) Refs #703 Pretty print archives dates.

@cbay
Copy link
Contributor

cbay commented Mar 30, 2012

oliverhumpage: thanks for the bug report and the suggestions, I've normally committed everything you asked :)

Regarding the persistent connections, I haven't patched anything. It's a builtin feature of PHP/mysqli, see:

http://www.php.net/manual/en/mysqli.construct.php

"Prepending host by p: opens a persistent connection."

@mattab
Copy link
Member

mattab commented Mar 31, 2012

(In [6135]) Refs #703

  • Setting custom var for all errors or redirects
  • fixing typo in output

@mattab
Copy link
Member

mattab commented Mar 31, 2012

(In [6137]) Refs #703 README update + fixing --enable-reverse-dns now works + adding common bot names

@cbay
Copy link
Contributor

cbay commented Mar 31, 2012

(In [6140]) Refs #703 Catch URL exceptions during configuration

@cbay
Copy link
Contributor

cbay commented May 21, 2012

Good catch, the query string delimiter was always appended since the latest refactoring. That should be fixed now.

It's indeed possible to exclude some paths: check out --exclude-path and --exclude-path-from.

@anonymous-matomo-user
Copy link
Author

After 17.3M good lines, got another encode error, script version 6283

Fatal error: 'ascii' codec can't encode character u'\u2013' in position 20: ordinal not in range(128)

One line log: http://mike.org.uk/test5_log.txt

@cbay
Copy link
Contributor

cbay commented May 23, 2012

(In [6295]) Refs #703 Fixed encoding bug with a non-ASCII user-agent.

@cbay
Copy link
Contributor

cbay commented May 23, 2012

tiouk: thanks again for the bug report, it's fixed.

@thegcat
Copy link

thegcat commented May 27, 2012

Doesn't work on python 3. 2to3 fixes most of the stuff, there's still some little snags like the base64 decoder expecting bytes and getting a string.

@cbay
Copy link
Contributor

cbay commented May 27, 2012

It's not expected to work with Python 3. It may be supported later, but that's definitely not a requirement for a very first version.

Still, you're welcome to send patches to fix issues with 2to3, as long as they're rather simple (we wouldn't want to add much complexity just to support Python 3, at least not for now).

@thegcat
Copy link

thegcat commented May 27, 2012

AFAIK everything that runs on 3 should run on 2.7+, so I'd think also developing against 3 would avoid a later bigger update.

Regarding patches: I'm no python dev and I don't have the resources to take care of possible problems with an interpreter not supported by upstream. I'll be happy to help when python 3 is supported, until then I'll use a python 2 slot for this.

Thanks for the effort though, not being able to import log files has been a major blocker for piwik here :-)

@mattab
Copy link
Member

mattab commented May 28, 2012

There are only 2-3 days left before release / freeze - is everyone happy with the script for a V1?

@oliverhumpage
Copy link

I've just updated to latest and realised the addition of the file.seek(0) functions stop you being able to pipe in logs through stdin: is it possible to disable that with a flag? It's really, really useful being able to pipe logs straight from apache.

@cbay
Copy link
Contributor

cbay commented May 29, 2012

oliverhumpage: seek is actually only used when the log format is autodetected, and I really can't see a way to avoid this (due to IIS). So you have to explicitely specify the format (with --log-format-name) when reading from stdin.

Let me know if that doesn't work (it should).

@oliverhumpage
Copy link

@cyril

Ah, you're right - if I specify a regex or name then it stops complaining, which is fine. However, there is still a problem that no lines are being read from stdin. If I import from a file with a couple of lines in, I get results (it says lines have been parsed). However, if I specify the file as "-" and copy/paste those same lines, nothing gets logged, not even an error. I switched on --debug and did this for you:

#   /path/to/piwik/misc/log-analytics/import_logs.py --add-sites-new-hosts --config=/path/to/piwik/config/config.ini.php --url='http://piwik.local/' --recorders=1 --enable-static --log-format-name=common_vhost --debug -
2012-05-29 20:56:53,619: [DEBUG] Accepted hostnames: all
2012-05-29 20:56:53,778: [DEBUG] Piwik URL is: http://piwik.local/
2012-05-29 20:56:53,778: [DEBUG] No token-auth specified
2012-05-29 20:56:53,779: [DEBUG] No credentials specified, reading them from "/path/to/piwik/config/config.ini.php"
2012-05-29 20:56:53,780: [DEBUG] Using credentials: (login = admin, password = xxxxxxxxxxxxxxxxxxxxxxxxxx)
2012-05-29 20:56:54,067: [DEBUG] Authentication token token_auth is: xxxxxxxxxxxxxxxxxxxxxxxxxx
2012-05-29 20:56:54,067: [DEBUG] Resolver: dynamic
0 lines parsed, 0 lines recorded, 0 records/sec
2012-05-29 20:56:54,068: [DEBUG] Launched recorder
Parsing log /dev/stdin...
0 lines parsed, 0 lines recorded, 0 records/sec
0 lines parsed, 0 lines recorded, 0 records/sec
www.domain.co.uk 9.8.7.6 - - [29/May/2012:12:41:19 +0100] "GET /robots.txt HTTP/1.1" 403 212 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)"
0 lines parsed, 0 lines recorded, 0 records/sec
0 lines parsed, 0 lines recorded, 0 records/sec
www.domain.co.uk 9.8.7.6 - - [29/May/2012:12:41:21 +0100] "GET /robots.txt HTTP/1.1" 403 212 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)"
0 lines parsed, 0 lines recorded, 0 records/sec
0 lines parsed, 0 lines recorded, 0 records/sec
^C
Logs import summary
-------------------

    0 requests imported successfully
    0 requests were downloads
    0 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    0 requests imported to 0 sites
        0 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 6 seconds
    Requests imported per second: 0.0 requests per second


As you can see, pasting the lines in did nothing. Importing the exact same lines but from a file gave:

Logs import summary
-------------------

    0 requests imported successfully
    2 requests were downloads
    2 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        2 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Also, --show-progress appears to have got itself switched on even though it's not specified in the command-line (this happens with or without --debug).

Have I got something weird in my installation, or can you reproduce this?

@mattab
Copy link
Member

mattab commented May 30, 2012

(In [6403]) Fixes #3139 Adding new 'bots' parameter to the Tracking API. When set to 1 Piwik will record the request even if it is made by a bot (currently detected are only Googlebot and some Bing bots)

Refs #703 - Cyril, when --enable-bots is set, can you please make sure the parameter &bots=1 is also set to piwik.php request? only in this case though. Thanks!

@mattab
Copy link
Member

mattab commented May 30, 2012

EspadaV8 the bug is my fault, I packaged RC2 with a debug statement. Please try with with Rc3 it should work OK!

@anonymous-matomo-user
Copy link
Author

Replying to matt:

EspadaV8 the bug is my fault, I packaged RC2 with a debug statement. Please try with with Rc3 it should work OK!

Awesome, RC3 seems to be importing everything nicely :) Thanks

@mattab
Copy link
Member

mattab commented May 31, 2012

We have now setup a demo of log analytics piwik

The demo at: http://demo-log-analytics.piwik.org/ has only 1 day of data for now.

It has 2 websites to show default import mode and full mode (with bots, files, errors, etc.)

I will now close this ticket as it is getting quite long, but reopened another one to keep track of the next features: #3163

Please post all new bug reports, feature suggestions in this ticket: #3163

@cbay
Copy link
Contributor

cbay commented Jun 1, 2012

(In [6433]) Refs #703 Set bots=1 accordingly.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Critical Indicates the severity of an issue is very critical and the issue has a very high priority. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc.
Projects
None yet
Development

No branches or pull requests

7 participants