Opened 5 years ago

Closed 23 months ago

Last modified 23 months ago

#703 closed New feature (fixed)

Piwik an alternative to AWStats and Urchin, build server log import script

Reported by: bharatkalyan Owned by:
Priority: critical Milestone: Piwik 1.8
Component: Core Keywords:
Cc: Cyril Sensitive: no

Description (last modified by matt)

Urchin Alternative: Import your server logs in Piwik, the Free web analytics platform!

See blog post Piwik alternative to Urchin for more information.

Piwik is the Urchin alternative but also Webalyzer and AWStats alternative: with a Python script, you can now import webserver logs (apache, iis, and more) in Piwik, instead of using the javascript tracking.

Description
A Python script available in piwik/misc/log-analytics/ will parse server logs efficiently and automatically call the Piwik Tracking API to inject the visits/pageviews/downloads in Piwik.

How to install / how to use

  • Requires Piwik >= 1.7.2-rc2. Download the latest version from http://builds.piwik.org/?C=N;O=D
  • Requires at least Python 2.6
  • Requires one or many server log files, typically called access.log in Apache for example. These log files will be imported into Piwik.
  • You can also create a "test website" in Piwik to import all data into, rather than importing into your existing websites. Then, use the command --idsite=X to force all info from the log files to be imported into this idsite
  • You can use --dry-run command to have a test run and make sure you will not track data or create new websites

SEE FOLLOW UP TICKET #3163

How you can help?

  • please use the script and report your feedback and bugs here
  • if you are a hacker yourself, please review the code and consider submitting performance optimization, or improvements.
  • If you are a webhost or web agency and wish to offer Piwik to hundreds of your customers, please contact us
  • review the doc at Server log analytics


Tasks to do before final release

  • Test, test and test
  • Setup on demo.piwik.org in a new website
  • Check all code review feedback managed
  • Review Import Logs in Piwik doc page.
  • decomission apache2piwik (update blog post)

Feature requests for V2 or later

SEE FOLLOW UP TICKET #3163

Change History (201)

comment:1 Changed 5 years ago by vipsoft (robocoder)

  • Resolution set to invalid
  • Status changed from new to closed

comment:2 Changed 5 years ago by matt (mattab)

  • Milestone changed from Third Party Piwik Plugins to 1- RobotRock

comment:3 Changed 3 years ago by matt (mattab)

  • Component changed from UI (templates, javascript) to Core
  • Description modified (diff)
  • Milestone changed from RobotRock to 1.x - Piwik 1.x
  • Priority changed from critical to major
  • Resolution invalid deleted
  • Sensitive unset
  • Status changed from closed to reopened
  • Summary changed from HOW TO USE PIWIK LIKE AWSTATS WITH WEBSERVER LOGFILES to Piwik as an alternative to AWStats

comment:6 Changed 2 years ago by matt (mattab)

  • Description modified (diff)
  • Summary changed from Piwik as an alternative to AWStats to Piwik an alternative to AWStats and Urchin, build server log import script

comment:8 Changed 2 years ago by matt (mattab)

  • Milestone changed from 1.x - Piwik 1.x to 1.7.x - Piwik 1.7.2
  • Priority changed from major to critical

comment:9 Changed 2 years ago by matt (mattab)

First release of the script committed in [6046]

comment:10 follow-up: Changed 2 years ago by matt (mattab)

  • Cc Cyril added

comment:11 in reply to: ↑ 10 Changed 2 years ago by Cyril (cbay)

comment:12 Changed 2 years ago by matt (mattab)

(In [6051]) Refs #703

  • Adding README contributed by Cyril

comment:13 Changed 2 years ago by vipsoft (robocoder)

(In [6053]) refs #703 - propset eol-style

comment:14 Changed 2 years ago by oliverhumpage

comment:16 Changed 2 years ago by MichaelP08

comment:17 follow-up: Changed 2 years ago by Cyril (cbay)

comment:18 in reply to: ↑ 17 Changed 2 years ago by MichaelP08

comment:19 follow-up: Changed 2 years ago by Cyril (cbay)

comment:20 in reply to: ↑ 19 Changed 2 years ago by MichaelP08

comment:21 Changed 2 years ago by oliverhumpage

comment:22 Changed 2 years ago by matt (mattab)

  • Description modified (diff)

comment:24 Changed 2 years ago by hhamano

comment:26 Changed 2 years ago by oliverhumpage

comment:27 Changed 2 years ago by oliverhumpage

Performance-wise: I've set up piwik in its own jail now, turned off unnecessary PHP extensions, tweaked apache, and enabled APC. If I use --recorders=48 I get good import speeds (at least at first) without the load average going too high. However, something odd happens, and some way through importing a log file the recorders drop off (I can see fewer and fewer apache processes too, so clearly it's just not being hit as much):

2846 lines parsed, 233 lines recorded, 233 records/sec
4372 lines parsed, 506 lines recorded, 273 records/sec
[...]
8300 lines parsed, 7570 lines recorded, 9 records/sec
8300 lines parsed, 7579 lines recorded, 9 records/sec
8300 lines parsed, 7588 lines recorded, 9 records/sec
8300 lines parsed, 7598 lines recorded, 10 records/sec

I don't think I have any weird throttling going on - any ideas what might be up? There's nothing else being output during the processing even with debugging on. The drop-off seems to start roughly half way through any given logfile.

comment:28 Changed 2 years ago by oliverhumpage

comment:29 in reply to: ↑ 25 ; follow-up: Changed 2 years ago by hhamano

comment:30 follow-up: Changed 2 years ago by Cyril (cbay)

oliverhumpage: 48 is almost certainly too high, unless you have a 48-core machines. You shouldn't have to exceed the number of cores in your system, even a bit lower (as the import script and MySQL will run at the same time).

As for why your performance decreases over time, I don't know. What does a 'top' say? You'd have to find the bottleneck. It may be Apache, PHP, MySQL. On my system, I have a sustained 300 req/s for more than 3 hours.

Regarding the static files excluded, we'll add an option to include those (disabled by default). I'm sure the whole importing process will get better over time, it's only the beginning :)

comment:31 Changed 2 years ago by matt (mattab)

(In [6070]) Refs #703 Removing images from "downloads", and improving TIP message in output debug

comment:32 in reply to: ↑ 29 ; follow-up: Changed 2 years ago by matt (mattab)

comment:33 Changed 2 years ago by matt (mattab)

(In [6071]) Refs #703 Improving help message as per Cyril feedback

comment:34 in reply to: ↑ 32 Changed 2 years ago by hhamano

comment:35 Changed 2 years ago by matt (mattab)

(In [6074]) Refs #703 Display response output when tracking request failed (this happens for example when debug is enabled in piwik.php)

comment:36 in reply to: ↑ 30 Changed 2 years ago by oliverhumpage

Replying to Cyril:

oliverhumpage: 48 is almost certainly too high, unless you have a 48-core machines. You shouldn't have to exceed the number of cores in your system, even a bit lower (as the import script and MySQL will run at the same time).

I did quite a few experiments, and eventually found that 40 is about right. This is a VM running on a high powered Dell R710, so although the OS only thinks it has 4 CPUs I don't know how things actually pan out. All I know is that the number of records/sec increases pretty much linearly with --recorders up until 40. E.g. if I run at 32, I get more like 200r/sec rather than 250+r/sec. A single recorder manages around 6-7r/sec. After 40 the benefits tail off.

I also tried a few experiments to see where the bottleneck might lie, for instance I stuck in a mod_rewrite to send the importer to a basic PHP file that just returned the .gif without doing any processing, but weirdly the performance was about the same. However, running with --dry-run (or just removing the line which actually calls the script) means the python script runs at around 4000r/sec, so I can only conclude the limit is in apache/php (putting in APC definitely helped). I also tried hacking the script to run a PHP wrapper script that called piwik.php directly on the command line, but it went horribly slowly, presumably because of the lag in loading up PHP.

Anyway, I'm happy with 250-300r/sec. I may set up a separate VM with a tweaked kernel and optimised apache to deal with log imports anyway, so I'm sure I can improve on that figure.

Regarding the steady tailing-off, what I'm wondering is: when you specify lots of recorders, do they each grab an equal number of log lines at the start then work through them? That would explain why some finish earlier than others (if e.g. one gets a lot with non-loggable lines it'd finish sooner). I notice the number of apache processes starts tailing off around half to 2/3 of the way through the log, and then just steadily decline until only 1 recorder is left.

Regarding the static files excluded, we'll add an option to include those (disabled by default). I'm sure the whole importing process will get better over time, it's only the beginning :)

That'd be brilliant, thank you. Thanks to you all for being so responsive in general too.

Oliver.

comment:37 Changed 2 years ago by matt (mattab)

FYI the new 1.7.2-rc4 was released which includes the most up to date code: Download from: http://builds.piwik.org/?C=N;O=D

comment:38 Changed 2 years ago by matt (mattab)

oliverhumpag, thanks for your comments it's very interesting!
Since you seem keen, maybe you can consider running XHProf, the facebook php profiler: http://pecl.php.net/package/xhprof

I haven't run that for a long time and never under high load such as 300 req/s so it would be very interesting. If you install it, i would love to see the reports generated! The last time we ran XHPRof on Piwik we found 2-3 quick fixes that made things a lot better. I'm sure we can make tracker faster in many ways.

It would also be good to know the % of consumption of Apache/php VS mysql (not sure the best way to do this however?).

comment:39 follow-up: Changed 2 years ago by Cyril (cbay)

oliverhumpage: regarding the recorders, each request will be dispatched to a specific recorder based on its IP address. It means that if the IP address distribution of your log files isn't "even", some recorders will have more work to do than others. Which could explain the performance issues you're having, especially near the end of the import process.

This dispatching was required to make sure requests are imported in the correct order.

comment:40 in reply to: ↑ 39 Changed 2 years ago by oliverhumpage

comment:42 Changed 2 years ago by oliverhumpage

comment:43 Changed 2 years ago by oliverhumpage

Actually, I do have one small request for piwik itself.

Would it be possible to choose on the fly between multiple database options: you see, I'm using one physical install of piwik at 2 different URLs - one for JS-based sites, and one for log-based, and therefore also 2 different sets of db tables so that --add-sites-new-hosts on the log-based system doesn't interfere with the JS websites (they'd have the same URLs). What I've done atm is set an environment var in apache and patch core/Config.php to set $config->database to either $config->database_weblog or $config->database_js depending on that env var.

However, being able to define a constant like DATABASE_CONFIG_SECTION_NAME in bootstrap.php, which Config.php then used to work out which section of the config file to use, would be much easier and more robust. I could of course just have 2 different installs of piwik, but then I have to update it twice with each release. Probably not worth enlarging the codebase just for my weird setup, but thought I'd ask - I can easily submit a patch if you're interested.

comment:45 Changed 2 years ago by Cyril (cbay)

(In [6092]) Refs #703 import-logs.py renamed to import_logs.py and added a mini test suite which tests the format autodetection.

comment:46 Changed 2 years ago by Cyril (cbay)

(In [6093]) Refs #703 Many improvements:

  • '-' can be specified as filename to read from stdin
  • --format is renamed to --log-format-name
  • --log-format-regex was added
  • user agent matching is now case insensitive
  • --enable-static added to track static files
  • --enable-bots added to track robots
  • --strip-query-string added to strip the query string (it was always stripped before, now it's not until this option is specified)
  • show help when the script is called with no filename

comment:47 Changed 2 years ago by Cyril (cbay)

(In [6094]) Refs #703 Added option --output to redirect output to a file.

comment:48 Changed 2 years ago by matt (mattab)

(In [6100]) Refs #703

  • Fixing encoding when tracking 404 and later other errors: by default urllib.quote does not encode the / but for our purposes we want to encode it so that the URL show up nicely in the reports
  • only tracking /From= if the referrer was actually set

comment:49 Changed 2 years ago by matt (mattab)

(In [6102]) Refs #703 Add license notice, Shuffle help messages order, remove short notation for clarity, improve help messages, adding Java/ + bot- + bot/ + robot as a bot

comment:51 Changed 2 years ago by oliverhumpage

comment:52 in reply to: ↑ 50 ; follow-up: Changed 2 years ago by Cyril (cbay)

comment:54 in reply to: ↑ 52 ; follow-up: Changed 2 years ago by matt (mattab)

comment:56 Changed 2 years ago by matt (mattab)

(In [6108]) Refs #703 I'm learning Python (NOT!)

comment:57 in reply to: ↑ 54 Changed 2 years ago by Cyril (cbay)

comment:58 in reply to: ↑ 55 Changed 2 years ago by Cyril (cbay)

comment:59 Changed 2 years ago by guardian

comment:61 in reply to: ↑ 60 ; follow-up: Changed 2 years ago by guardian

comment:62 in reply to: ↑ 61 Changed 2 years ago by guardian

comment:63 Changed 2 years ago by guardian

comment:64 Changed 2 years ago by oliverhumpage

comment:65 Changed 2 years ago by oliverhumpage

comment:66 Changed 2 years ago by oliverhumpage

comment:67 Changed 2 years ago by Cyril (cbay)

(In [6128]) Refs #703 Now works with Python 2.5.

comment:68 Changed 2 years ago by Cyril (cbay)

(In [6129]) Refs #703 Show the summary when CTRL+C is pressed.

comment:69 Changed 2 years ago by Cyril (cbay)

(In [6130]) Refs #703 Fixed bug with --log-format-regex (thanks oliverhumpage).

comment:70 Changed 2 years ago by Cyril (cbay)

(In [6131]) Refs #703 Disable buffering when using --output.

comment:71 Changed 2 years ago by Cyril (cbay)

(In [6132]) Refs #703 Added --query-string-delimiter

comment:72 Changed 2 years ago by Cyril (cbay)

(In [6133]) Refs #703 Added --enable-http-errors and --enable-http-redirects

comment:73 Changed 2 years ago by Cyril (cbay)

(In [6134]) Refs #703 Pretty print archives dates.

comment:74 Changed 2 years ago by Cyril (cbay)

oliverhumpage: thanks for the bug report and the suggestions, I've normally committed everything you asked :)

Regarding the persistent connections, I haven't patched anything. It's a builtin feature of PHP/mysqli, see:

http://www.php.net/manual/en/mysqli.construct.php

"Prepending host by p: opens a persistent connection."

comment:75 Changed 2 years ago by matt (mattab)

(In [6135]) Refs #703

  • Setting custom var for all errors or redirects
  • fixing typo in output

comment:77 Changed 2 years ago by matt (mattab)

(In [6137]) Refs #703 README update + fixing --enable-reverse-dns now works + adding common bot names

comment:78 Changed 2 years ago by Cyril (cbay)

(In [6140]) Refs #703 Catch URL exceptions during configuration

comment:80 Changed 2 years ago by matt (mattab)

(In [6155]) Adding advanced use case in the README. Thanks Oliver for your help and submission!! Refs #703

comment:82 Changed 2 years ago by oliverhumpage

Cyril:

Have tested using - instead of /dev/stdin, seems to work fine.

Re the regex, I think that's explained in the comments: because I want it to pick up hostnames that are subsites and so have slashes (e.g. I want the hostname 'domain.com/subsite' to be picked up and created with that name in piwik), I needed to amend the normal vhost regex to allow "/" in the host character class. It's also a very good example of what and how to escape shell special characters in apache log pipes :)

(I spent a fun morning with a test apache installation and a perl script testing each special character in turn until I got it working... then a fun afternoon wondering why it wasn't working with import_logs.py, until I realised there wasn't a .compile for the custom regex!)

I did originally put things like "domain.com.subsite" in the hostname so the standard regex would work, but it looks ugly and non-user-friendly in piwik.

comment:84 Changed 2 years ago by oliverhumpage

comment:86 Changed 2 years ago by matt (mattab)

(In [6157]) refs #703 updating README as per feedback. please comment if the code does not work I haven't tested myself

comment:87 Changed 2 years ago by matt (mattab)

  • Description modified (diff)

Updated ticket with suggestions on tto improve script performance (ie. we should bulk send 50 requests at once in POST to have 50 times less http requests...) !!

comment:89 Changed 2 years ago by oliverhumpage

Just regarding the persistent database connections: using "p:localhost" only works for mysqli after PHP 5.3. It didn't work for me since we're still on 5.2 (going to upgrade soon...).

comment:90 Changed 2 years ago by Cyril (cbay)

Matt: that should do it I guess. I'll try to make the changes ASAP.

Sending bulk requests would be great, I'm sure that would improve the performance a lot!

comment:91 follow-up: Changed 2 years ago by tiouk

The script doesn't parse IIS6 or IIS7 log files (not tried IIS8). I tried the following regex that matches the log lines in kiki but no luck with the script. Any pointers?

Also minor change to line 1068

' the --format option'

needs updating to

' either the --log-format-name or --log-format-regex option'
_IIS6_FORMAT = (
    '(?P<date>^\d+[-\d+]+ [\d+:]+) '
    '\S+ \S+ [\d*.]+ \S+ '
    '(?P<path>\S+) '
    '\S+ \d+ \S+ '
    '(?P<ip>[\d*.]*) '
    '\S+ '
    '(?P<user_agent>\S+) '
    '\S+ '
    '(?P<referrer>\S+) '
    '\S+ '
    '(?P<status>\d+) '
    '\S+ \S+ '
    '(?P<length>\S+)'
)

comment:92 in reply to: ↑ 91 Changed 2 years ago by hhamano

+1

The only way I got it to work with IIS7 (just to test out the script) was to convert it to ncsa extended.

comment:93 follow-up: Changed 2 years ago by matt (mattab)

Can you please post example log format that does not work ?

comment:94 Changed 2 years ago by marekbecka

comment:95 Changed 2 years ago by Cyril (cbay)

(In [6165]) Refs #703 Fixed bug: stats.piwik_sites should not have None items.

comment:97 Changed 2 years ago by Cyril (cbay)

(In [6166]) Refs #703 Only show tips in summary if necessary.

comment:98 Changed 2 years ago by Cyril (cbay)

(In [6167]) Refs #703 Added --exclude-path and --exclude-path-from.

comment:99 Changed 2 years ago by Cyril (cbay)

(In [6168]) Refs #703 Replaced tabs with spaces.

comment:100 in reply to: ↑ 93 Changed 2 years ago by tiouk

Replying to matt:

Can you please post example log format that does not work ?

About 20 lines from two logs.

http://mike.org.uk/iis6_iis7_log.txt

comment:101 Changed 2 years ago by Cyril (cbay)

(In [6169]) Refs #703 Added custom variable Not-Bot.

comment:102 Changed 2 years ago by Cyril (cbay)

(In [6170]) Refs #703 Updated error string.

comment:103 Changed 2 years ago by marekbecka

Trunk version works well also without --idsite-fallback.

comment:104 Changed 2 years ago by matt (mattab)

Cyril, thanks for the recent fixes, very nice!!

comment:105 Changed 2 years ago by schrefel

for some years I was looking for an alternative to awstats and with your import script I think I've found it - great work so far.

But I've troubles with the log file. We use a Lotus Notes clusters and for each server in the cluster we've a seperate log file per day.
The import is working but the result isn't ok and I think it because of the log file format.

It looks like this:
192.168.1.1 bene.com - [01/Apr/2012:00:00:01 +0200] "GET /mobiliario-de-oficina/news-filo-design-preis-2009.html HTTP/1.1" 200 20719 "" "Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@…)" 1453 "" "D:/Notes/Data/benecom/cont_es.nsf"

in awstats I can describe the log file format like this:
LogFormat=%host %virtualname %lognamequot %time1 %methodurl %code %bytesd %refererquot %uaquot %other %other %other

and http error and redirects are also not found:

17758 requests imported successfully
542 requests were downloads
0 requests ignored:

0 invalid log lines
0 requests done by bots, search engines, ...
0 HTTP errors
0 HTTP redirects
0 requests to static resources (css, js, ...)
0 requests did not match any known site
0 requests did not match any requested hostname

See more log at: http://pastebin.com/zSMXqEpu

comment:106 Changed 2 years ago by matt (mattab)

  • Also the log format Websphere is not detected:
    • example 20 lines log: http://pastebin.com/eY7N5weT
    • Error is
      Traceback (most recent call last):
        File "C:\Python26\lib\threading.py", line 522, in __bootstrap_inner
          self.run()
        File "C:\Python26\lib\threading.py", line 477, in run
          self.__target(*self.__args, **self.__kwargs)
        File "c:\wamp\www\piwik\misc\log-analytics\import-logs.py", line 756, in _run
          self._record_hit(hit)
        File "c:\wamp\www\piwik\misc\log-analytics\import-logs.py", line 794, in _reco
      rd_hit
          'url': main_url + hit.path[:1024],
      UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 148: ordinal not in range(128)
      

comment:107 Changed 2 years ago by guardian

I just faced a situation where the first line in an nginx log is an invalid request:

110.164.252.2 - - [14/Apr/2012:06:26:37 +0200] "-" 400 0 "-" "-" 

I know which log format I'm using and I'll just use --log-format-name ncsa_extended but maybe the script could try several lines before giving up?

comment:108 follow-up: Changed 2 years ago by Cyril (cbay)

matt: that's because there are UTF8 characters in the logs, and the script expects the logs to be plain ASCII. I suggest we add a new option --encoding that allows to specify if the log files are in a specific encoding rather than ASCII, what do you think?

comment:109 in reply to: ↑ 108 Changed 2 years ago by matt (mattab)

Replying to Cyril:

matt: that's because there are UTF8 characters in the logs, and the script expects the logs to be plain ASCII. I suggest we add a new option --encoding that allows to specify if the log files are in a specific encoding rather than ASCII, what do you think?

That sounds nice, but would it be possible to test both ASCII and UTF-8 automatically when such decoding error occurs?

I suppose most logs these days are in UTF8 so it would be nice to work by default :)

comment:110 Changed 2 years ago by guardian

I'm having problems with the script which enters an infinite loop

6846 lines parsed, 1919 lines recorded, 0 records/sec
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.6/threading.py", line 484, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 860, in _run
    self._record_hit(hit)
  File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 924, in _record_hit
    headers={'User-Agent' : hit.user_agent},
  File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 694, in call
    return self._call_wrapper(self._call, expected_content, path, args, headers)
  File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 676, in _call_wrapper
    response = func(*args, **kwargs)
  File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 625, in _call
    response = urllib2.urlopen(request)
  File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.6/urllib2.py", line 391, in open
    response = self._open(req, data)
  File "/usr/lib/python2.6/urllib2.py", line 409, in _open
    '_open', req)
  File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.6/urllib2.py", line 1178, in https_open
    return self.do_open(httplib.HTTPSConnection, req)
  File "/usr/lib/python2.6/urllib2.py", line 1143, in do_open
    r = h.getresponse()
  File "/usr/lib/python2.6/httplib.py", line 990, in getresponse
    response.begin()
  File "/usr/lib/python2.6/httplib.py", line 391, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.6/httplib.py", line 355, in _read_status
    raise BadStatusLine(line)
BadStatusLine

6846 lines parsed, 1919 lines recorded, 0 records/sec
6846 lines parsed, 1919 lines recorded, 0 records/sec
6846 lines parsed, 1919 lines recorded, 0 records/sec
6846 lines parsed, 1919 lines recorded, 0 records/sec
6846 lines parsed, 1919 lines recorded, 0 records/sec
...

and it keep repeating

6846 lines parsed, 1919 lines recorded, 0 records/sec

The offending line 1920 in my nginx log is:

62.83.238.31 - - [17/Apr/2012:12:53:35 +0200] "-" 400 0 "-" "-"

comment:111 Changed 2 years ago by Cyril (cbay)

matt: defaulting to UTF8 is indeed better, since ASCII is UTF8 compatible anyway.

comment:112 Changed 2 years ago by Cyril (cbay)

(In [6212]) Refs #703 Catch httplib exceptions raised by urllib2.

comment:113 Changed 2 years ago by Cyril (cbay)

guardian: I don't know why you're getting this BadStatusLine exception, but the latest commit at least detects it correctly. Can you try again?

comment:114 Changed 2 years ago by Cyril (cbay)

(In [6213]) Refs #703 Added --encoding, defaults to UTF8.

comment:115 Changed 2 years ago by julianito

Hi all,

I have big problems with the import, I'm only importing 15-20 lines per second. I need to import more than 20Milions of lines. Really I need more than 10 days to import these logs?

I tried to use the command with --dry-run, but this command not insert the lines in the database, it's just to check if the comand works well.

How can I improve the import speed? (I tried also with recorders=16, but don't work well. I only have 2 cpus -4 cores-)

Do you have any guide or something like that to follow it and improve the import speed?

Thanks in advance for your feedback.

I'm using this command:
python /var/www/piwik/misc/log-analytics/import_logs.py --url=http://localhost/piwik access_log.0 --idsite=2 --recorders=4 --enable-http-errors --enable-http-redirects --enable-static --enable-reverse-dns --enable-bots

comment:116 Changed 2 years ago by guardian

6846 lines parsed, 215 lines recorded, 2 records/sec
6846 lines parsed, 215 lines recorded, 0 records/sec
6846 lines parsed, 215 lines recorded, 0 records/sec
6846 lines parsed, 215 lines recorded, 0 records/sec
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.6/threading.py", line 484, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 865, in _run
    self._record_hit(hit)
  File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 929, in _record_hit
    headers={'User-Agent' : hit.user_agent},
  File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 699, in call
    return self._call_wrapper(self._call, expected_content, path, args, headers)
  File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 691, in _call_wrapper
    message = e.reason
AttributeError: 'HTTPError' object has no attribute 'reason'

It's the very same log file, which tells it's not an invalid line causing that problem. So far I don't know why it returns an empty reply, nginx and php-fpm logs remain empty about that.

comment:117 Changed 2 years ago by Cyril (cbay)

(In [6215]) Refs #703 Circumvent a Python bug with urllib2.HTTPErrors.

comment:118 Changed 2 years ago by Cyril (cbay)

(In [6216]) Refs #703 Call close() manually instead of relying on the garbage collector. It seems to help reducing the concurrent connections count.

comment:119 Changed 2 years ago by guardian

Thank you Cyril,

closing the connection indeed mitigates the problem. still I'm facing SIGSEGV when using the import script. so far Piwik is the only PHP software causing SIGSEGV on my servers. I'm investigating. As per Piwik's wiki I disabled APC but that didn't help.

comment:120 Changed 2 years ago by asterixcapri

I tried the importer but it is too slow for real web sites with huge log files. Just an idea... how about rewrite it in PHP to avoid slow HTTP requests, and using instead internal piwik classes?

If I want to try to do it, do you have any suggestions?
Thank you

comment:121 follow-up: Changed 2 years ago by Cyril (cbay)

asterixcapri: what's your import speed? What's the limiting factor? Python or PHP? How large are your log files?

The Python import script can easily max out 8 Piwik PHP processes on my machines, so I doubt the HTTP requests represent a large overhead. Besides, an easy way to reduce that overhead would be to aggregate hits in a single request, which is already planned.

comment:122 Changed 2 years ago by matt (mattab)

julianito, 15 req per second really is too slow. What kind of server do you use? Is it busy doing other things or is it mostly idle? Piwik is pretty IO intensive.

What req/s do you get in "dry-run"? it should be very high since Piwik does not do any http request then (this would help making sure the problem is with the http requests)

comment:123 Changed 2 years ago by oliverhumpage

Just wanted to add that I get really bad speeds unless I ramp up the reporters way beyond core number: I may have said above, I get to about 40 (on a 4 core virtual machine) before noticing no further improvements. There is definitely a lag somewhere in the http requests or mysql connections on some setups. There is also the issue of the "tail end", noted above, where the number of recorders slowly drops off as IPs run out (I did try altering the script to give new IPs to the recorder with the lowest workload, but it didn't make much odds).

Having a PHP script that talked to the piwik system directly, instead of via http requests, would likely speed things up hugely for all users.

comment:124 Changed 2 years ago by oliverhumpage

Could I check something? There's an --enable-static option, which I think you put in at my request (thanks!). I've noticed it seems to put static files (.jpg, .css, .js etc) into both Downloads and Pageviews. I seem to remember without --enable-static they went into neither.

Is it possible to have static files *only* put into Downloads? I'm not running the very very latest version, so apologies if this is already fixed, but I didn't see it in a changelog.

comment:125 Changed 2 years ago by guardian

I also find Piwik is very slow per se :/

But matt saying it's really IO intensive made me think about my PHP-FPM configuration again. The pool used by Piwik has:

php_admin_value[open_basedir] = /var/www:/usr/local/share/www:/usr/share/php5:/tmp

And I guess open_basedir is part of the explanation why Piwik is so slow on my setup. I'm now using database session storage and using piwik feels faster already

comment:126 Changed 2 years ago by tgrondin

The script works fine as a import, with one exception. We use varnish/pound proxy's in front of the websites and have to pass the incoming website IP via the X-Forwarded-For variable.

Is there a way to have that picked up in replacement to the %v.

Example log format: "\"%{X-Forwarded-For}i\" %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

comment:127 follow-up: Changed 2 years ago by Cyril (cbay)

tgrondin: use mod_rpaf http://stderr.net/apache/rpaf/

comment:128 in reply to: ↑ 127 Changed 2 years ago by tgrondin

Replying to Cyril:

tgrondin: use mod_rpaf http://stderr.net/apache/rpaf/

That works.

Thank you

comment:129 Changed 2 years ago by matt (mattab)

From email report:

I try to use the module import_logs.py for my apache logs, I have a problem because I use HAProxy and the script does not seem to consider my rgex for my log formatfollowing:

  LogFormat "% v% {X-Forwarded-For} i% l% u% t \"% r \ "%> s% b \"% {Referer} i \"\"% {User-Agent} i \ ""vhost_combined


X.com 90.28.198.22 - - [11/Apr/2012:20:52:12 +0200] "GET /index.php HTTP/1.1" 200 - "http://www.X.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0) Gecko/20100101 Firefox/10.0"

I lunch the script :  
python /var/www/piwik/misc/log-analytics/import_logs.py --url=http://127.0.0.1/piwik /var/log/apache2/access_webmail.log --log-format-regex "%v %{X-Forwarded-For}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" --add-sites-new-hosts --recorders=1 

0 lines parsed, 0 lines recorded, 0 records/sec
Parsing log /var/log/apache2/access_webmail.log...

Is it the same problem as the IIS log, or maybe a different problem? or is the --log-format-regex wrong maybe?

comment:130 in reply to: ↑ 121 Changed 2 years ago by matt (mattab)

  • Description modified (diff)

Replying to Cyril:

asterixcapri: what's your import speed? What's the limiting factor? Python or PHP? How large are your log files?

The Python import script can easily max out 8 Piwik PHP processes on my machines, so I doubt the HTTP requests represent a large overhead. Besides, an easy way to reduce that overhead would be to aggregate hits in a single request, which is already planned.

I created a ticket for this specific performance improvement: #3134

comment:131 Changed 2 years ago by ma2thieu

I access piwik at www.mydomain.com/piwik (so I can use mydomain.com SSL certificate) but
the import script calls www.mydomain.com/piwik/piwik.php which generate logs, that are imported, which generate logs, and so on ... how to avoid being stuck in a loop like that ?

comment:132 Changed 2 years ago by matt (mattab)

ma2thieu, good point, we should probably dela with this issue in the script itself

  • By default the script should discard all log lines containing piwik.php
  • To fix this issue in the meantime, you can setup a different log file for the Piwik Vhost or path, in the apache configuration. Then, only import the main www log file which will not contain the piwik requests.

comment:133 Changed 2 years ago by matt (mattab)

  • Milestone changed from 1.7.x - Piwik 1.7.x to 1.7.2 - Piwik 1.7.2

The last known important bug is the ISS log parsing.

Appart from that is everyone here happy with the script as it is?

Appart from performance which can be slow for some of you, is the script ready for prime time?

Thanks for your feedback

comment:134 Changed 2 years ago by Dominux

Hi, I want to import Lotus Domin logs but I have this error when I lunch import_logs.py :

# python26 /var/www/html/piwikbeta/misc/log-analytics/import_logs.py --url=http://www.dominux.fr/dominux/blog.nsf /usr/local/domino/domzi/weblogs/access.log --idsite=1234 --recorders=4 --enable-http-errors --enable-http-redirects --enable-static --enable-bots
Fatal error: Piwik returned an invalid response: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0

:
Any idea ?

PS: sorry I can't post all the error because my post are rejected : too many external urls ...

comment:135 Changed 2 years ago by oliver

Hi, I tried to import apache logs into an empty Piwik database. The log-file has a lot of different hosts, that should be added automatically.

./import_logs.py --login=user --password=pwd --add-sites-new-hosts --url=http://site/piwik ./apache.log

However, the scripts shows the following error and won't exit.

Purging Piwik archives for dates: 2012-05-10
Traceback (most recent call last):
  File "./import_logs.py", line 1221, in <module>
    main()
  File "./import_logs.py", line 1197, in main
    Recorder.invalidate_reports()
  File "./import_logs.py", line 949, in invalidate_reports
    idSites=','.join(stats.piwik_sites),
TypeError: sequence item 2: expected string or Unicode, int found

comment:136 Changed 2 years ago by matt (mattab)

@Dominux the --url should point to your piwik base URL it seems it's pointing somewhere

@oliver and all other users who have problems parsing specific logs. It would be great if you could look at the import_logs.py file and try to patch it to add the format of your logs. Then please submit the patch here once your logs are parsed, we will add it.

The code is simple to understand at the start of: http://dev.piwik.org/svn/trunk/misc/log-analytics/import_logs.py

comment:137 Changed 2 years ago by tiouk

Patch to enable IIS6/7/7.5 log imports, probably needs some work as I don't do python!

IIS can have varied data in the logs, this should work with default log settings for 6/7/7.5 and when selecting all options from IIS6. Log format in default is W3C. No timezone offset in my logs, so TZ = 0.

Seems to work, I am in process of importing logs, at 1.5M line parsed at present, 1.47 recorded.

http://mike.org.uk/import_logs_py_diff.txt

comment:138 follow-up: Changed 2 years ago by matt (mattab)

@tiouk, great thanks for the patch!

Can you please confirm you tested your patch on the 3 IIS log formats?

I will test & commit after your confirmation, your patch is very appreciated!

comment:139 in reply to: ↑ 138 Changed 2 years ago by tiouk

Replying to matt:

Can you please confirm you tested your patch on the 3 IIS log formats?

I have tested the iis6_w3c_all on real logs. I think there maybe an issue with the IIS7/7.5, as the although they work, the logs I tested aginst appear to have extra options selected and I can't find a definitive default format on the MS site apart from for IIS6. I have some tweaked format regex, will post when tested.

Can anyone using IIS please post the first 5 lines from a few logs and state whether you have added any non default options to the logging.

IIS logs seperate the query string from the base URL, I haven't addressed that.

comment:140 Changed 2 years ago by Cyril (cbay)

(In [6260]) Refs #703 Fixed bug when invalidating reports with --add-sites-new-hosts.

comment:141 Changed 2 years ago by Cyril (cbay)

oliver: thanks for the bug report, can you try again? That should be fixed.

comment:142 follow-ups: Changed 2 years ago by Cyril (cbay)

tiouk: thanks. Unless matt insists on doing so, I'd like to commit your patch myself as I'd like to refactor it a bit.

Can you provide some log lines for each format? I'll put them in the tests/logs directory which provides automatic testing for each log format.

comment:143 Changed 2 years ago by Cyril (cbay)

(In [6261]) Refs #703 Typo (renamed file logs/ncsa_extended.log)

comment:144 in reply to: ↑ 142 Changed 2 years ago by tiouk

Replying to Cyril:

tiouk: thanks. Unless matt insists on doing so, I'd like to commit your patch myself as I'd like to refactor it a bit.

Can you provide some log lines for each format? I'll put them in the tests/logs directory which provides automatic testing for each log format.

Will see if I can do tomorrow. There's an updated file on the same URL as the old patch, it has a bit of work to skip lines in an IIS log with --check-iis-logs-format and displays the log options line in --debug along with updated regexs. Cheers Mike

comment:145 Changed 2 years ago by tiouk

Typo above, the IIS log detection should be --check-iis-log-option

Short IIS6 log with all options:
http://mike.org.uk/iis6_all_options.txt

comment:146 in reply to: ↑ 142 Changed 2 years ago by matt (mattab)

Replying to Cyril:

tiouk: thanks. Unless matt insists on doing so, I'd like to commit your patch myself as I'd like to refactor it a bit.

Please commit after checking all is working well I'm glad you're back :)

comment:147 Changed 2 years ago by tiouk

Above diff [post 137] has been updated, small regex changes as status only recorded 2 digits in some log types.

IIS7.5 Default short log (Has extra header lines due to IIS restart and 3 lines IPV6 [recorded as invalid log lines] both could appear in live logs)

http://mike.org.uk/iis75_default_log.txt

comment:148 Changed 2 years ago by tiouk

IIS logs the cs-uri-stem & cs-uri-query separately, do they need concatenating?

Replace \S+ after the <path> match with '(?P<querystr>\S+) ' in the regex's

line 1199 
            if config.options.check_iis_log_format:
                hit = Hit(
                    filename=filename,
                    lineno=lineno,
                    status=match.group('status'),
                    full_path=match.group('path') + config.options.query_string_delimiter + match.group('querystr'),
                    is_download=False,
                    is_robot=False,
                    is_error=False,
                    is_redirect=False,
                )
            else:
                hit = Hit(
                    filename=filename,
                    lineno=lineno,
                    status=match.group('status'),
                    full_path=match.group('path'),
                    is_download=False,
                    is_robot=False,
                    is_error=False,
                    is_redirect=False,
                )

comment:149 follow-up: Changed 2 years ago by aspectra

Hi, I got following error during importing log files (apache):

419375 lines parsed, 6906 lines recorded, 57 records/sec
419703 lines parsed, 6960 lines recorded, 54 records/sec
419896 lines parsed, 6976 lines recorded, 16 records/sec
419896 lines parsed, 6976 lines recorded, 0 records/sec
419896 lines parsed, 6976 lines recorded, 0 records/sec
419896 lines parsed, 6976 lines recorded, 0 records/sec
Fatal error: didn't receive the expected response. Response was <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
        <title>Piwik &rsaquo; Error</title>
        <meta http-equiv="Conte..
You can restart the import of "/opt/piwiktests/logfile.gz" from the point it failed by specifying --skip=296312 on the command line.

with debug:

2012-05-13 12:17:59,837: [DEBUG] Error when connecting to Piwik: <urlopen error didn't receive the expected response. Response was <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   "SAME URL as above">
<html>
<head>
        <title>Piwik &rsaquo; Error</title>
        <meta http-equiv="Conte.. >

The end of the log file shows:

5190432 lines parsed, 759119 lines recorded, 28 records/sec
5190432 lines parsed, 759147 lines recorded, 28 records/sec
Purging Piwik archives for dates: 2012-05-12
2012-05-13 13:49:58,052: [DEBUG] Error when connecting to Piwik: <urlopen error Piwik returned an invalid response:
<div style='word-wrap: break-word; border: 3px solid red; padding:4px; width:70%; background-color:#FFFF96;'>
            <strong>There is an error. Please report the message and full backtrace in the <a href='?module=Proxy&action=redirect&url=http://forum.piwik.org' target='_blank'>Piwik forums</a> (plea>

I'm running the script on a 12 Core (24 with HT) server with this command:

python import_logs.py --url=http://piwiktest.aspectra.com/piwiklog/ --idsite=6 --recorders=12 --output=logtest_20120513.log --skip=358639 -d -d /opt/piwiktests/logfile.gz &

Sometimes the script recovers from the errors and continuous recording, sometimes it stops working and shows the --skip= option. As far as I can see it seems that the script stops working if the error occurs 4 times in a row. Is this some kind of a timeout and can it be set?

Best regards,
André

comment:150 follow-up: Changed 2 years ago by tiouk

Diff for the auto detecting of IIS logs based on log header line 4, this should be able to decode any IIS log file whatever the options selected. One previso, it does need both the date and time options which can be de-selected. However without these options it is not possible to generate stats unless you don't mind all site visits to occur at the same time!

http://mike.org.uk/import_logs_py_diff_2.txt

Ready patched file based on version 6170 for people to test on different IIS formats, let me know how you get on. http://mike.org.uk/import_logs_py.txt (rename to .py)

comment:151 in reply to: ↑ 150 Changed 2 years ago by tiouk

Replying to tiouk:

Ready patched file based on version 6170 for people to test on different IIS formats, let me know how you get on. http://mike.org.uk/import_logs_py.txt (rename to .py)

USE ON TEST SYSTEM ONLY, NOT LIVE. If you must, use --debug to check first!

comment:152 Changed 2 years ago by tiouk

Patch 6213 applied.
Error generated by +íå+íàøëîñü+ôîðìû+äëÿ+îòïðàâêè; in iis log.

  File "/usr/lib64/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xed in position 154: invalid continuation byte

2012-05-03 04:29:30 W3SVC230092276 SERVER1 172.16.65.25 GET /Discussions/tabid/56/forumid/1/postid/36/scope/posts/language/en-GB/Default.aspx+Result:+íå+íàøëîñü+ôîðìû+äëÿ+îòïðàâêè; - 80 - 94.153.71.194 HTTP/1.0 Mozilla/0.6+Beta+(Windows) - http://xxx-xxxxxx.net/Discussions/tabid/56/forumid/1/postid/36/scope/posts/language/en-GB/Default.aspx+Result:+%ED%E5+%ED%E0%F8%EB%EE%F1%FC+%F4%EE%F0%EC%FB+%E4%EB%FF+%EE%F2%EF%F0%E0%E2%EA%E8; xxx-xxxxxx.net 404 0 2 1814 715 62

comment:153 Changed 2 years ago by tiouk

Above patches tweaked for the add-sites-new-hosts option, so please download again.

Issue with option --add-sites-new-hosts.

The following are created as different hosts, they should all be considered as the same, this is with IIS log files with auto IIS log patch.

domain.dom
www.domain.dom
www.domain.dom:80

comment:154 Changed 2 years ago by matt (mattab)

Thanks for your work we'll commit the patch soon, please keep posting if you improve it!

To all users with IIS logs please test @tiouk patch above if you can, thanks

comment:155 Changed 2 years ago by tiouk

Diff for IIS patch based on latest version 6260 from trunk.
http://mike.org.uk/import_logs_py_diff_3.txt

IIS patched full version 6260 this is the latest version from trunk with Cyril's latest committed fixes. Please use to test IIS log import, with usual live system warning, although I have used on mine!
http://mike.org.uk/import_logs_py.txt Just rename to .py

comment:156 Changed 2 years ago by Cyril (cbay)

Thanks, I'll take care of it as soon as possible.

comment:157 Changed 2 years ago by tiouk

How do you want to handle decode errors? I have a couple of logs that bomb due to decode errors on a single line, I was thinking of just invadidating the line.

I replaced  1127

            line.decode(config.options.encoding)
with
            try:
                line = line.decode(config.options.encoding)
            except UnicodeError, err:
                # Unicode Decode Error, the line is badly formatted.
                logging.debug('Unicode decode error  ' + line)
                logging.debug(err)
                invalid_line(line)
                continue

comment:158 in reply to: ↑ 149 Changed 2 years ago by aspectra

Replying to aspectra:

Hi, I got following error during importing log files (apache):
419375 lines parsed, 6906 lines recorded, 57 records/sec
419896 lines parsed, 6976 lines recorded, 0 records/sec

I changed following constants:

PIWIK_MAX_ATTEMPTS = 9
PIWIK_DELAY_AFTER_FAILURE = 5

The errors are still occurring but the script is able to reconnect and does not exit.

comment:159 Changed 2 years ago by reetz

Hi,

i have played around with the logfile importer in piwik 1.7.2rc8. I was surprised that there were a lot of static files in the results as I had not enabled --enable-static on command line.

I checked the logfile and looked in the importer code and found out, that many static-files of - at least - Typo3-Websites are not recognized, as long as they are suffixed with ?timestamp by Typo3 and the importer-regex just checks the end of the filesname (e.g. typo3temp/javascript_0b12553063.js?1283017207).

Can this be adjusted? Would be great!

Many thanks and a nice weekend

comment:160 Changed 2 years ago by tiouk

Version 6260 with IIS patch.

Previously decoded line generates error when posting to Piwik.

2012-05-16 14:15:08,670: [DEBUG] Error when connecting to Piwik: 'ascii' codec can't encode characters in position 126-128: ordinal not in range(128)

Raw log line:
2010-09-29 07:00:25 W3SVC3 172.16.65.22 GET /xxxxxxxxx/xxxxxxxxxxxxxxx/xxxxxxxxxxxxxxxxxxxxxxxxxxxx/tabid/129/language/en-US/Default.aspx+Result:+ýòî+íå+ôîðóì+/+ãîñòåâàÿ+êíèãà+(ëèáî+îòñóòñòâóåò+ïîäêëþ÷åíèå+ê+èíòåðíåòó) 

args from def _call_wrapper(self, func, expected_response, *args, **kwargs):
('/piwik.php', {'cdt': '2010-09-29 07:00:25', '_cvar': u'{"1":["Not-Bot","Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+Deepnet+Explorer+1.5.0;+.NET+CLR+1.0.3705)"]}', 'apiv': '1', 'cip': u'173.75.247.80', 'urlref': '', 'token_auth': u'b13583ea29ddb846f48d1ea1721a8eac', 'idsite': '3', 'url': u'http://xxxxxxxxxxx.xxx.xx/xxxxxxxxx/xxxxxxxxxxxxxxx/xxxxxxxxxxxxxxxxxxxxxxxxxxxx/tabid/129/language/en-US/Default.aspx+Result:+\xfd\xf2\xee+\xed\xe5+\xf4\xee\xf0\xf3\xec+/+\xe3\xee\xf1\xf2\xe5\xe2\xe0\xff+\xea\xed\xe8\xe3\xe0+(\xeb\xe8\xe1\xee+\xee\xf2\xf1\xf3\xf2\xf1\xf2\xe2\xf3\xe5\xf2+\xef\xee\xe4\xea\xeb\xfe\xf7\xe5\xed\xe8\xe5+\xea+\xe8\xed\xf2\xe5\xf0\xed\xe5\xf2\xf3)?-', 'rec': '1', 'dp': '0'}, {'Content-type': 'application/x-www-form-urlencoded', 'User-Agent': u'Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+Deepnet+Explorer+1.5.0;+.NET+CLR+1.0.3705)'})

comment:161 Changed 2 years ago by Cyril (cbay)

(In [6268]) Refs #703 Do not crash on encoding errors (thanks invalid_line(line)

comment:162 Changed 2 years ago by Cyril (cbay)

Oops, I screwed my commit message, I meant to say "thanks tiouk" (bad copy/paste). Centralized version control software sucks :)

comment:163 Changed 2 years ago by Cyril (cbay)

(In [6269]) Refs #703 Added support for IIS logs, thanks to tiouk.

comment:164 Changed 2 years ago by Cyril (cbay)

IIS parsing is now supported in trunk. I had to refactor quite a bit of code, so I highly suggest everyone to test the script again, I may have introduced bugs.

The patch was inspired by tiouk's own diff, thanks to him, but the code itself is quite different as I wanted to have a more generic approach. There are no new options, IIS is expected to work just like other log formats.

comment:165 Changed 2 years ago by tiouk

Nice work, glad to see you've integrated rather than included as an addon.

Seems to work fine, so far with logs that I know worked with the old patched version, but I still get the issue in post 160 with lines containing extended chars being posted to Piwik causing the script to choke.

I was going to throw a pile of logs at it, but my test VM tops out at 25 rec/sec and the DL380G7 I got down to install as a dedicated Piwik server has issues, HP engineer on site tomorrow!

comment:166 follow-up: Changed 2 years ago by Cyril (cbay)

I can't reproduce the error you get in post 160. I copy/pasted the line you specified and saved it in a IIS log file, maybe the resulting file has a different encoding than yours. Could you somehow create a minimal file that exhibits this issue and give me a link to download it? What command line did you use?

comment:167 in reply to: ↑ 166 Changed 2 years ago by tiouk

Replying to Cyril:

I can't reproduce the error you get in post 160. I copy/pasted the line you specified and saved it in a IIS log file, maybe the resulting file has a different encoding than yours. Could you somehow create a minimal file that exhibits this issue and give me a link to download it? What command line did you use?

Head, tail & sed = http://mike.org.uk/badiis7_log.txt

comment:168 Changed 2 years ago by tiouk

Forgot the command line:

python /home/piwik/public_html/misc/log-analytics/import_logs.py --url=http://192.168.10.113:88 badiis7.log --idsite=25 --recorders=4 --enable-http-redirects --enable-static --enable-bots --enable-reverse-dns --debug

comment:169 Changed 2 years ago by mlrxnet

Hi, we have some issues with the latest build of the import on our IIS 7.5 logs. The following messages are shown and the import stopped:

...
2012-05-17 22:07:07,918: [DEBUG] Error when connecting to Piwik: 'ascii' codec c
an't encode character u'\xe4' in position 32: ordinal not in range(128)
Fatal error: 'ascii' codec can't encode character u'\xe4' in position 32: ordina
l not in range(128)
You can restart the import of "u_ex120513.log" from the point it failed by specifying --skip=214 on the command l
ine.

Attached are the following Lines 213-215 with the field specification for an review:

Fields:

#Fields: date time s-sitename s-computername s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs-version cs(User-Agent) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken

Line 213:

2012-05-13 01:26:04 W3SVC7 XNW-WEB01 46.4.192.150 GET / - 443 - 213.133.113.84 HTTP/1.1 Hetzner+System+Monitoring - www.wolke.com 302 0 0 677 91 31

Line 214:

2012-05-13 01:26:36 W3SVC7 XNW-WEB01 46.4.192.150 GET /imagefilm/deutsch.swf - 80 - 216.246.45.86 HTTP/1.0 gsa-crawler+(Enterprise;+S5-HKQ2CJT3FSJJT;+info@tobaccopeople.com) - www.m600.com 200 0 0 34601 236 468

Line 215:

2012-05-13 01:27:03 W3SVC7 XNW-WEB01 46.4.192.150 GET /de/kontakt/anfahrt - 80 - 95.211.139.1 HTTP/1.0 Mozilla/5.0+(compatible;+AcoonBot/4.10.8;++http://www.acoon.de/robot.asp) - www.wolke.com 200 0 0 21376 193 124

Command Line:

import_logs.py  --debug --url=http://statistik.mlr-xnet.de u_ex120513.log --idsite=8 --recorders=4 --enable-http-errors --enable-http-redirects --enable-static --enable-bots --enable-reverse-dns

comment:170 Changed 2 years ago by Cyril (cbay)

(In [6270]) Refs #703 Fixed an encoding issue with non-ascii paths or referrers.

comment:171 Changed 2 years ago by Cyril (cbay)

Can you try with the latest commit? That should be fixed.

comment:172 Changed 2 years ago by tiouk

Yes, works for bad log in comment 167.

comment:173 Changed 2 years ago by matt (mattab)

Thanks for all your work and feedack.

AS per the last comments it seems IIS log parsing is now fully working which was the last critical open bug.

To all users listening here, is your log format now recognized as expected?

AS per comments in this thread the last remaining changes to make are:
Todo per comments

  • Change constants

PIWIK_MAX_ATTEMPTS = 9
PIWIK_DELAY_AFTER_FAILURE = 5
The errors are still occurring but the script is able to reconnect and does not exit.

  • recognize static files not ending with the filetype but can have cache buster parameter strings x.js?ts=319743413

I checked the logfile and looked in the importer code and found out, that many static-files of - at least - Typo3-Websites are not recognized, as long as they are suffixed with ?timestamp by Typo3 and the importer-regex just checks the end of the filesname (e.g. typo3temp/javascript_0b12553063.js?1283017207).

These sound like easy changes. Otherwise looks like the script is ready for prime time?

comment:174 Changed 2 years ago by Cyril (cbay)

I don't think changing constants is a good idea. If your Piwik install is returning frequent errors, you'd have to find out why and fix it. Increasing the constants is, in my opinion, sweeping dust under the carpet.

I'll have a look at the static files issue, that doesn't seem normal at all, since the query string was supposed to be trimmed anyway.

comment:175 Changed 2 years ago by matt (mattab)

I'll have a look at the static files issue, that doesn't seem normal at all, since the query string was supposed to be trimmed anyway.

I think query string is not stripped anymore by default (good).

If your Piwik install is returning frequent errors, you'd have to find out why and fix it.

It's true but sometimes PHP errors are random and can happen frequently, so either we allow user to change it easily or better put safe defaults allowing happy user experience : )

comment:176 Changed 2 years ago by Cyril (cbay)

I don't really like PHP, as you must know, but what do you mean by random errors? What kind of fatal error can be triggered randomly and still not be considered a bug?!

Anyway, change the constants if you think it's safer. But please at least change the logging.debug to something like logging.warning so that we don't silently fail. Otherwise, people may (and will) complain that the script is horribly slow, which is expected if it has to make each request to Piwik multiple times.

Regarding the query string, you're right, I'll try to fix that behaviour.

comment:177 Changed 2 years ago by Cyril (cbay)

(In [6271]) Refs #703 Better handling of query strings.

comment:178 follow-up: Changed 2 years ago by Cyril (cbay)

reetz: can you try again with the latest commit? That should be fixed.

comment:179 Changed 2 years ago by mlrxnet

Last Build 6274 works like a charm with thee IIS logs. Thanks a lot for the good work.

comment:180 in reply to: ↑ 178 Changed 2 years ago by reetz

Replying to Cyril:

reetz: can you try again with the latest commit? That should be fixed.

Hi, unfortunately now it is not working at all.

First I just replaces import_logs.py but now I did a complete reinstall with trunk-r6281 and any time I start

python /www/trunk/misc/log-analytics/import_logs.py --url=http://piwik.xxxxx.de /access-log-201203 --idsite=1

I got following result:

0 lines parsed, 0 lines recorded, 0 records/sec
Parsing log /access-log-201203...
Traceback (most recent call last):
  File "/www/trunk/misc/log-analytics/import_logs.py", line 1287, in <module>
    main()
  File "/www/trunk/misc/log-analytics/import_logs.py", line 1251, in main
    parser.parse(filename)
  File "/www/trunk/misc/log-analytics/import_logs.py", line 1184, in parse
    hit.path, hit.query_string = hit.full_path.split(config.options.query_string_delimiter, 1)
ValueError: need more than 1 value to unpack
1 lines parsed, 0 lines recorded, 0 records/sec
1 lines parsed, 0 lines recorded, 0 records/sec
1 lines parsed, 0 lines recorded, 0 records/sec
1 lines parsed, 0 lines recorded, 0 records/sec
1 lines parsed, 0 lines recorded, 0 records/sec
1 lines parsed, 0 lines recorded, 0 records/sec
1 lines parsed, 0 lines recorded, 0 records/sec
[...repeating...]

The same happens if I try to import your examples logs in /trunk/misc/log-analytics/tests/logs

Has there been some changes to command-line? On original 1.7.2-rc8 there are no problems with my logfile or your test-files

comment:181 Changed 2 years ago by Cyril (cbay)

(In [6282]) Refs #703 Fixed a bug introduced in the IIS parsing refactoring, thanks reetz.

comment:182 follow-up: Changed 2 years ago by Cyril (cbay)

reetz: oops, that should be fixed, thanks for the report.

comment:183 in reply to: ↑ 182 Changed 2 years ago by reetz

Replying to Cyril:

reetz: oops, that should be fixed, thanks for the report.

Hi,

Yes, it's working now. Many thanks.

Just one little thing: All Action-Urls in "Visitor Log" have a "?" at the end. It doesn't bother me, but perhaps other will find this "confusing"

By the way: is there a possibility to exclude certain pathes from being imported?

comment:184 Changed 2 years ago by Cyril (cbay)

(In [6283]) Refs #703 Do not append the query string delimiter if there's no query string, thanks reetz.

comment:185 Changed 2 years ago by Cyril (cbay)

Good catch, the query string delimiter was always appended since the latest refactoring. That should be fixed now.

It's indeed possible to exclude some paths: check out --exclude-path and --exclude-path-from.

comment:186 Changed 2 years ago by tiouk

After 17.3M good lines, got another encode error, script version 6283

Fatal error: 'ascii' codec can't encode character u'\u2013' in position 20: ordinal not in range(128)

One line log: http://mike.org.uk/test5_log.txt

comment:187 Changed 2 years ago by Cyril (cbay)

(In [6295]) Refs #703 Fixed encoding bug with a non-ASCII user-agent.

comment:188 Changed 2 years ago by Cyril (cbay)

tiouk: thanks again for the bug report, it's fixed.

comment:189 Changed 23 months ago by thegcat

Doesn't work on python 3. 2to3 fixes most of the stuff, there's still some little snags like the base64 decoder expecting bytes and getting a string.

comment:190 Changed 23 months ago by Cyril (cbay)

It's not expected to work with Python 3. It may be supported later, but that's definitely not a requirement for a very first version.

Still, you're welcome to send patches to fix issues with 2to3, as long as they're rather simple (we wouldn't want to add much complexity just to support Python 3, at least not for now).

comment:191 Changed 23 months ago by thegcat

AFAIK everything that runs on 3 should run on 2.7+, so I'd think also developing against 3 would avoid a later bigger update.

Regarding patches: I'm no python dev and I don't have the resources to take care of possible problems with an interpreter not supported by upstream. I'll be happy to help when python 3 is supported, until then I'll use a python 2 slot for this.

Thanks for the effort though, not being able to import log files has been a major blocker for piwik here :-)

comment:192 Changed 23 months ago by matt (mattab)

There are only 2-3 days left before release / freeze - is everyone happy with the script for a V1?

comment:193 Changed 23 months ago by oliverhumpage

I've just updated to latest and realised the addition of the file.seek(0) functions stop you being able to pipe in logs through stdin: is it possible to disable that with a flag? It's really, really useful being able to pipe logs straight from apache.

comment:194 Changed 23 months ago by Cyril (cbay)

oliverhumpage: seek is actually only used when the log format is autodetected, and I really can't see a way to avoid this (due to IIS). So you have to explicitely specify the format (with --log-format-name) when reading from stdin.

Let me know if that doesn't work (it should).

comment:195 Changed 23 months ago by oliverhumpage

@Cyril

Ah, you're right - if I specify a regex or name then it stops complaining, which is fine. However, there is still a problem that no lines are being read from stdin. If I import from a file with a couple of lines in, I get results (it says lines have been parsed). However, if I specify the file as "-" and copy/paste those same lines, nothing gets logged, not even an error. I switched on --debug and did this for you:

#	/path/to/piwik/misc/log-analytics/import_logs.py --add-sites-new-hosts --config=/path/to/piwik/config/config.ini.php --url='http://piwik.local/' --recorders=1 --enable-static --log-format-name=common_vhost --debug -
2012-05-29 20:56:53,619: [DEBUG] Accepted hostnames: all
2012-05-29 20:56:53,778: [DEBUG] Piwik URL is: http://piwik.local/
2012-05-29 20:56:53,778: [DEBUG] No token-auth specified
2012-05-29 20:56:53,779: [DEBUG] No credentials specified, reading them from "/path/to/piwik/config/config.ini.php"
2012-05-29 20:56:53,780: [DEBUG] Using credentials: (login = admin, password = xxxxxxxxxxxxxxxxxxxxxxxxxx)
2012-05-29 20:56:54,067: [DEBUG] Authentication token token_auth is: xxxxxxxxxxxxxxxxxxxxxxxxxx
2012-05-29 20:56:54,067: [DEBUG] Resolver: dynamic
0 lines parsed, 0 lines recorded, 0 records/sec
2012-05-29 20:56:54,068: [DEBUG] Launched recorder
Parsing log /dev/stdin...
0 lines parsed, 0 lines recorded, 0 records/sec
0 lines parsed, 0 lines recorded, 0 records/sec
www.domain.co.uk 9.8.7.6 - - [29/May/2012:12:41:19 +0100] "GET /robots.txt HTTP/1.1" 403 212 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)"
0 lines parsed, 0 lines recorded, 0 records/sec
0 lines parsed, 0 lines recorded, 0 records/sec
www.domain.co.uk 9.8.7.6 - - [29/May/2012:12:41:21 +0100] "GET /robots.txt HTTP/1.1" 403 212 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)"
0 lines parsed, 0 lines recorded, 0 records/sec
0 lines parsed, 0 lines recorded, 0 records/sec
^C
Logs import summary
-------------------

    0 requests imported successfully
    0 requests were downloads
    0 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    0 requests imported to 0 sites
        0 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 6 seconds
    Requests imported per second: 0.0 requests per second


As you can see, pasting the lines in did nothing. Importing the exact same lines but from a file gave:

Logs import summary
-------------------

    0 requests imported successfully
    2 requests were downloads
    2 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        2 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Also, --show-progress appears to have got itself switched on even though it's not specified in the command-line (this happens with or without --debug).

Have I got something weird in my installation, or can you reproduce this?

comment:196 Changed 23 months ago by EspadaV8

comment:197 Changed 23 months ago by matt (mattab)

(In [6403]) Fixes #3139 Adding new 'bots' parameter to the Tracking API. When set to 1 Piwik will record the request even if it is made by a bot (currently detected are only Googlebot and some Bing bots)

Refs #703 - Cyril, when --enable-bots is set, can you please make sure the parameter &bots=1 is also set to piwik.php request? only in this case though. Thanks!

comment:198 follow-up: Changed 23 months ago by matt (mattab)

EspadaV8 the bug is my fault, I packaged RC2 with a debug statement. Please try with with Rc3 it should work OK!

comment:199 in reply to: ↑ 198 Changed 23 months ago by EspadaV8

Replying to matt:

EspadaV8 the bug is my fault, I packaged RC2 with a debug statement. Please try with with Rc3 it should work OK!

Awesome, RC3 seems to be importing everything nicely :) Thanks

comment:200 Changed 23 months ago by matt (mattab)

  • Description modified (diff)
  • Resolution set to fixed
  • Status changed from reopened to closed

We have now setup a demo of log analytics piwik

The demo at: http://demo-log-analytics.piwik.org/ has only 1 day of data for now.

It has 2 websites to show default import mode and full mode (with bots, files, errors, etc.)

I will now close this ticket as it is getting quite long, but reopened another one to keep track of the next features: #3163

Please post all new bug reports, feature suggestions in this ticket: #3163

comment:201 Changed 23 months ago by Cyril (cbay)

(In [6433]) Refs #703 Set bots=1 accordingly.

Note: See TracTickets for help on using tickets.