Opened 23 months ago

Last modified 4 days ago

#3163 new New feature

Log analytics list of improvements

Reported by: matt Owned by:
Priority: normal Milestone: Future releases
Component: Core Keywords:
Cc: Sensitive: no

Description (last modified by matt)

In Piwik 1.8 we released the great new feature to import access logs and generate statistics.

The V1 release works very well (it was tracked in #703), but there are ideas to improve it. This ticket is a placeholder of all ideas and discussions related to the Log Analytics feature!

New features

  • Track non-bot activity only. When --enable-bots is not specified, it would be a nice improvement if we:
    • exclude visits with more than 150 actions per visitorID to block crawlers (detected at the python level by counting requests for that IP in the queue)
    • exclude visits that do not have User Agent or beyond the very basic ones used by all bots
    • exclude all requests when one of the first ones is for /robots.txt -- if we see a robots.txt in the middle we could stop tracking subsequent requests
    • check that /index.php?minimize_js=file.js is counted as a static file since it ends in .js


After that bots & crawlers detection would be much better.

  • Support Accept-Language header and forward to piwik via the &lang= parameter. That might also be useful to some users who need to use this data in a custom plugin.


  • we could make it easy to delete logs for one day so to reimport one log file
    • This would be a new option to the python script. It would reuse the code from the Log Delete feature, but would only delete one day. The python script would call the CoreAdmin API for example, deleting this single day for a given website. This would allow to easily re-import data that didn't work the first time or was bogus.
  • Detect when log-lines are re-imported and only import them once.
    • Implementation: add new table piwik_log_lines (hash_tracking_request, day ))
    • In Piwik Tracker, before looping on the bulk requests, SELECT all the log lines that have already been processed on this day (WHERE hash_tracking_request IN (a,b,c,d) AND day=?) & Skip these requests from import
    • After bulk requests are processed in piwik.php process, INSERT in bulk (hash, day)
  • By default this feature would be enabled only for "Log import" script,
    • via a parameter that we know is the log import (&li=1 /import_logs=1)
    • but may be later useful to all users of Tracking API for general deduping service.

PERFORMANCE '

How to debug performance? First of all, you can run the script with --dry-run to see how many log lines per second are parsed. It typically should be between 2,000 and 5,000. When you don't do a dry run, it will insert new pageviews and visits calling Piwik API.

Other tickets

  • #3867 cannot resume with line number reported by skip for ncsa_extended log format
  • #4045 autodetection hangs on a weird formatted line

Attachments (7)

vhost_combined.patch (639 bytes) - added by aseques 23 months ago.
Document the debian vhost_combined format
force_hostname.patch (2.1 KB) - added by aseques 23 months ago.
Force hostname patch
README.apache_log_recorders.patch (2.4 KB) - added by oliverhumpage 22 months ago.
u_ex120813.zip (1.9 KB) - added by aspectra 21 months ago.
WinZip compressed file
test_c-ip_iis_log.log (1.2 KB) - added by jamesvl011 20 months ago.
Sample IIS file for testing variations of c-ip field
README_nginx_log_format.diff (873 bytes) - added by phikai 18 months ago.
Log Parser README Update with Nginx Log Format for Common Complete
WMS_20130523.log (2.2 KB) - added by brgsousa 11 months ago.
Log for WMS 9.0

Download all attachments as: .zip

Change History (160)

comment:1 Changed 23 months ago by matt (mattab)

  • Description modified (diff)
  • Priority changed from normal to major

comment:2 Changed 23 months ago by oliverhumpage

Could I just re-ask an unanswered problem from ticket #703 - http://dev.piwik.org/trac/ticket/703#comment:195 ? If instead of specifying a file you do

cat /path/to/log | log_import.py [options] -

then does it work for you, or do you just get 0 lines imported? Because with the latest version I'm getting 0 lines imported, and that means I can't log straight from apache (and hence the README is wrong too).

comment:3 Changed 23 months ago by Cyril (cbay)

oliverhumpage: I couldn't reproduce this issue. Do you get it with --dry-run too? Could you send a minimal log file?

comment:4 Changed 23 months ago by ddeimeke

Counting Downloads:

In a podcast project I want to count only the downloads of file type "mp3" and "ogg". In an other project it would be nice only to count the pdf-Downloads.

Another topic in this area is, how are downloads counted? Not every occurence of the file in the logs is a download. For instance, I am using a html5-player. Users might here one part of the podcast on their first visit and other parts on succeeding visitis. All together would be one download.

A possible "solution" (or may be a workaround): Sum up all the "bytes transferred" and divide it by the largest "bytes transferred" for a certain file.

comment:5 Changed 23 months ago by cdgraff

Feature Request: Support Icecast Logs currently we use Awstats but will be great can move to PIWIK.

comment:6 Changed 23 months ago by oliverhumpage

@Cyril

Having spent some time looking into it, and working out exactly which revision caused the problem, I think it's down to the regex I used in --log-format-regex not working any more. Turns out the regex format in import_logs.py has had the group <timezone> added to it, which seems to be required by code further down the script.

Could you update the readme so the middle of the regex changes from:



[(?P<date>.*?)

]

to



[(?P<date>.*?) (?P<timezone>.*?)

]

This will then make it all work.

Thanks,

Oliver.

comment:7 Changed 23 months ago by Cyril (cbay)

(In [6471]) Refs #3163 Fixed regexp in README.

comment:8 Changed 23 months ago by Cyril (cbay)

Oliver: indeed, I've just fixed it, thanks.

comment:9 Changed 23 months ago by aseques

I've been fiddling with this tool, it looks really nice, the biggest issue I've found is when using --add-sites-new-hosts
It's quite difficult in my case (using a control panel) to add the required %v:%p fields in the custom log format.
What I do have is a log for every domain, so being able to specify the hostname manually would do the trick for me.

In the current situation launching this:

python /var/www/piwik/misc/log-analytics/import_logs.py 
  --url=https://server.example.com/tools/piwik --recorders=4 --enable-http-errors 
  --enable-http-redirects --enable-static --enable-bots --add-sites-new-hosts  /var/log/apache2/example.com-combined.log

Just produces this:

Fatal error: the selected log format doesn't include the hostname: 
  you must specify the Piwik site ID with the --idsite argument

By having a --hostname example.com (the same as the filename in my case) that fixed the hostname (such as -idsite-fallback=) would fix my issues.

comment:10 Changed 23 months ago by oliverhumpage

I'm not a piwik dev, but what I think you're trying to do is:

For every logfile, get its filename (which is also the hostname), check if a site with that hostname exists in piwik: if it does exist, import the logfile to it; if it doesn't exist, create it, then import the logfile to it.

The way I'd do this is to write an import script which:

  • looks at all logfile names
  • accesses the piwik API to get a list of all existing websites (URLs and IDs)
  • for any logfile which doesn't appear in the list uses another API call to create the site and get the ID of the newly created site
  • imports all the logfiles with --idsite set to the right value for the logfile

http://piwik.org/docs/analytics-api/reference/ gives the various API calls, looks like SitesManager.getAllSites and SitesManager.addSite will do the job (e.g. call http://your.piwik.install/?module=API&method=SitesManager.getAllSites&format=xml&token_auth=xxxxxxxxxx to get all current sites, etc).

HTH (a real piwik person might have a better idea)

Oliver.

comment:11 Changed 23 months ago by aseques

Thanks for your answer Oliver, your process is perfectly fine, but I'd rather like to avoid having to code something that could be avoided by extending just a little the funtionality of --add-sites-new-hosts.
And thanks for the links too, I'll have look.

Changed 23 months ago by aseques

Document the debian vhost_combined format

comment:12 Changed 23 months ago by aseques

It would be nice to document the standard format provided (at the moment only debian/ubuntu) that would give piwik the hostname that is required.

The format is this:

LogFormat "%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined

You can see the latest version from debian's apache2.conf http://anonscm.debian.org/gitweb/?p=pkg-apache/apache2.git;a=blob;f=debian/config-dir/apache2.conf;h=50545671cbaeb1f170d5f3f1acd20ad3978f36ea;hb=HEAD

See attached a small change to the README file.

Changed 23 months ago by aseques

Force hostname patch

comment:13 Changed 23 months ago by aseques

After looking at the code I created a patch to add a new option called --force-hostname that expects an string with the hostname.
In case it's set, the value of host will be ALWAYS the one entered by --force-hostname.
This allows to deal with logfiles of ncsa_extended or common as if they were complete formats. (creating idsites when needed and so on..)

comment:14 follow-up: Changed 23 months ago by Cyril (cbay)

(In [6474]) Refs #3163 Added the --log-hostname option, thanks to aseques.

comment:15 Changed 23 months ago by Cyril (cbay)

(In [6475]) Refs #3163 Added a reference in README to the Debian/Ubuntu default vhost_combined, thanks aseques.

comment:16 Changed 23 months ago by Cyril (cbay)

Thanks aseques, both your feature request and your patch were fine, I've just committed it. Attention: I renamed the option to --log-hostname to keep coherence with the --log prefix.

comment:17 Changed 23 months ago by aseques

Great, this will be so useful for me :)

comment:18 Changed 22 months ago by Hexxer

Hi,

im not sure that im right in here for a ticket or problem?
I have a problem importing access_logs from my shared webspace. I copy test from here http://forum.piwik.org/read.php?2,90313


Hi,

im on a shared webspace with ssh support. I try your import script to analyse my apache logs.
I get it to work, but there are sometime some "Fatal errors" and i have no idea why. It is, if i restart it without "skip" every time the same "skip-line"

Example:

4349 lines parsed, 85 lines recorded, 0 records/sec

4349 lines parsed, 85 lines recorded, 0 records/sec

4349 lines parsed, 85 lines recorded, 0 records/sec

4349 lines parsed, 85 lines recorded, 0 records/sec

Fatal error: Forbidden

You can restart the import of "/home/log/access_log_piwik" from the point it failed by specifying --skip=326 on the command line.


I try to figure out on what line these script end with that fata error, but i cant. If restart it at "skip=327" that it runs to the end and all works fine. Same problem is on some other access_logs "access_log_1.gz" and so on. But im not sure why it ends. If that is a misconfigured line in accesslog? Which line should i check?

Regards

comment:19 Changed 22 months ago by Cyril (cbay)

Hexxer: you're getting a HTTP Forbidden from your Piwik install when importing the logs, you need to find out why.

comment:20 Changed 22 months ago by Hexxer

How do you now that?
It stops every time at the same line and if i skip that it runs 10 oder 15 minutes without a problem (up to this line it need 2 minutes or so).

Regards

comment:21 Changed 22 months ago by matt (mattab)

Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks!

comment:22 Changed 22 months ago by matt (mattab)

Benaka is implementing Bulk tracking in the ticket #3134 - The python script will simply have to send a JSON array:

["requests":[url1,url2,url3],"token_auth":"xyz"]

I suppose we can do some basic test to see which value works best?
Maybe 50 or 100 tracking requests at once? :)

comment:23 Changed 22 months ago by Hexxer

Hi,

.............
Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks!
.............

No, thats my problem. It stops (see above) with the hint to restart "--skip=326". But i dont now what it means. Line 326 in accesslog looks like all the others.

Replying to matt:

I suppose we can do some basic test to see which value works best?
Maybe 50 or 100 tracking requests at once? :)

Do you mean me? I cant test over the day because im sitting behind a proxy @work. I can do something in the evening - but, sorry, i have 5 month young lady who needs my love and attention :-)

Changed 22 months ago by oliverhumpage

comment:24 Changed 22 months ago by oliverhumpage

Could I submit a request for an alteration to the README? I've just had a massive spike in traffic, and --recorders=1 just doesn't cut it when piping directly from apache's customlog :) Because each apache process hangs around waiting to log its request before moving onto the next request, it started jamming the server.

Setting a higher --recorders seems to have eased it, and there are no side effects that I can see so far.

Suggested patch attached to this ticket.

comment:25 Changed 22 months ago by ludopaquet

Hi,

Is there a doc about the regex format for import_logs.py ?

We would like to import a file with awstat logFormat :

%time2 %other %cluster %other %method %url %query %other %logname %host %other %ua %referer %virtualname %code %other %other %bytesd %other %other
Thanks for your help,

Ludovic

comment:26 Changed 22 months ago by lewmat21

I am trying to set up a daily log import from the previous day. my issue is that my host date stamps the log file, how can I set it to import a log file with yesterdays date on it?

Here is the format of my log files
access.log.%Y-%m-%d.log

comment:27 Changed 22 months ago by sc_

Thanks a lot for all your great work!
The server log file analytics works great on my server.

I am using a lighttpd server and added the Accept-Language header to accesslog.format:

accesslog.format = "%h %V %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Accept-Language}i\""
(see http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs:ModAccessLog)

I wonder if it would be possible to add support for the Accept-Language header to import_logs.py?
So that the country could then be guessed from the Accept-Language header when GeoIP isn't installed.

comment:28 in reply to: ↑ 14 Changed 22 months ago by law

Replying to Cyril:

(In [6474]) Refs #3163 Added the --log-hostname option, thanks to aseques.

Thanks for possibilities to import logs and also thanks for the log-hostname patch.
Not sure whether it is the patch or it is caused by using --recorders > 1, but with the first run with --add-sites-new-host I got 13 sites for the same hostname created.

comment:29 Changed 22 months ago by andrewc

I'm having a similar problem to Hexxer. When I do a --dry-run I get no errors, but when adding to Piwik it falls over at about the same spot. It's not one offending log file or line of a log file that's causing it. I'll attach the output with debugging on below. I've run the script multiple times, by removing the line where the script has fallen over, removing the log file where it has fallen over etc. It always dies line ~9000-10000 in the 3rd log file.

I'm not sure if this is of interest but when doing a dry run the script does ~600lines/sec when importing to Piwik it does ~16,

comment:30 Changed 22 months ago by andrewc

The output file is here. Akismet was marking the attachment as spam

comment:31 Changed 22 months ago by Cyril (cbay)

(In [6509]) Refs #3163 updated README to suggest increasing the --recorders value.

comment:32 Changed 22 months ago by Cyril (cbay)

oliverhumpage: thanks, I've committed your diff.

ludopaquet: no doc yet, I suggest you take a look at the code, taking _COMMON_LOG_FORMAT as example.

lewmat21: I suppose each log line has its own date anyway, so it doesn't matter what the filename is.

sc_: I don't think using Accept-Language to guess the country is a good idea. As the header name says, it's about languages (locales), not countries. First, many languages are spoken in several countries. If the Accept-Language says you accept English, what country would you pick? Second, people can have Accept-Languages that don't match their country. I personnally surf using English as Accept-Languages, whereas I'm French and live in France.

law: can you reproduce the issue? If so, can you give me the access log as well as the full command line you used?

andrewc: can you edit line 741 and increase the 200 value to something like 10000? It will print the full error message instead of only the first 200 characters, which is not enough to get the Piwik error.

comment:33 Changed 22 months ago by sc_

@Cyril:
As far as I know Piwik does the same when the GeoIP plugin isn't used:
http://piwik.org/faq/troubleshooting/#faq_65[[BR]]
The location is then guessed from en-us, fr-fr etc.

But the more important point is that it would be useful for website development to know what languages people use who visit my website. So it would be great if support for the Accept-Language header could be added.

comment:34 Changed 22 months ago by sc_

Sorry for the wrong formatting (the preview didn't work).
Here is the correct link:

http://piwik.org/faq/troubleshooting/#faq_65

comment:35 follow-up: Changed 22 months ago by andrewc

@Cyril:
Here's the output file with the full error messages.

comment:36 in reply to: ↑ 35 Changed 22 months ago by andrewc

Replying to andrewc:

@Cyril:
Here's the output file with the full error messages.

Sorry this is the link https://www.dropbox.com/sh/zat1m6lqphndpny/wH6n4mDaD6/output0907.txt

comment:37 Changed 22 months ago by Cyril (cbay)

sc_: OK, I didn't know about this. Considering GeoIP will be integrated into Piwik soon (see #1823), which is a much better solution, I don't think we should modify the import script to use Accept-Language headers.

andrewc: your Piwik install (the PHP part) is returning errors:
Only one usage of each socket address (protocol/network address/port) is normally permitted

You need to find out why and fix it. It's unrelated to the import script.

comment:38 Changed 22 months ago by fjohn

Thanks for your great work, we gave import logs some time now and have few ideas/problems.

I don't know, should I open new tickets or write here?

One major thing is how to bring number of visitors/unique visitors down to make it more similar to javascript tracking and google analytics.

I understand that we don't have cookies and other config information to identify visitor.

We've managed to bring number of pageviews/actions down few times (from 5 times to 2 times more than javascript tracking). Or many more in few cases (like from 100 times more than javascript).

Our ideas and changes include (we assumed that we should get numbers as close as in javascript tracking):

  • counting only GET requests (not HEAD and POST)
  • not counting visits without OS data (we've tracked 5 websites with 10k/day visits and 30k views and there are very small number of real users without OS)


few workarounds :)

  • limit to 100 actions per visitorID to block crawlers that are not on the list (that also works for ajax websites that are getting parts of website with GET and normal PHP files)


  • changed checking for static files to block all sort of minimizers example: /index.php?minimize_js=file.js (very common)


  • custom code for image thumbnails /img_thumb.php?file=picture.png&w=100&h=300


We ended up at number of actions (pageviews) about twice the number of javascript without influence number of visitors (about 50% bigger than javascript).

Our extreme case is 300 views (javascript tracking) and 30 000 views with import script after changes - about 570 views with import script.

comment:39 follow-up: Changed 22 months ago by Cyril (cbay)

fjohn:

  • why shouldn't we count POST requests? HEAD, I can agree, but POST are legitimate requests made by regular browsers
  • what kind of user-agent doesn't have OS data? Aren't they bots anyway?
  • limiting actions: that's on the PHP-side, I'll let matt answer this

Regarding excluding some specific paths (index.php?minimize_js, img_thumb.php, etc.): there are a gazillion "popular" paths that could be excluded, but I don't think it's a good idea to include those by default, for several reasons:

  • it would be a cumbersome list to maintain, and people could argue what paths deserve to be included or not, depending on how popular the script is
  • there would be false positives (what if I have a legitimate img_thumb.php that should be included in page views?)
  • most importantly, such a list would be quite large, and that would really slow down the importing process (as each hit would have to be compared with all excluded paths).

So it's not something that we should do by default. We have --exclude-path and --exclude-path-from options that allow you to create your own list of paths to exclude, depending on your site.

What we may do is create such a list in Piwik (in an external file), but not enable it by default. People that want to use this could add --exclude-path-from=common_excluded_paths.txt (for instance). What do you think of this, matt?

comment:40 in reply to: ↑ 39 Changed 22 months ago by fjohn

Replying to Cyril:

fjohn:

  • why shouldn't we count POST requests? HEAD, I can agree, but POST are legitimate requests made by regular browsers

But POST is also used by ajax requests all the time (and this is not what we would count with JS). We've just simplified that to drop anything other than GET.

  • what kind of user-agent doesn't have OS data? Aren't they bots anyway?

for me question is - does "real user" always send OS data. On our logs there were for example, curl, python libs, xrumer, scrapers and many more odd requests that weren't on the bot list.

  • limiting actions: that's on the PHP-side, I'll let matt answer this

yeap, it is. But we have a lot of bots that were not on list, don't know how it is working but they were in import log profile, not in javascript profile.

Regarding excluding some specific paths (index.php?minimize_js, img_thumb.php, etc.): there are a gazillion "popular" paths that could be excluded, but I don't think it's a good idea to include those by default, for several reasons:

  • it would be a cumbersome list to maintain, and people could argue what paths deserve to be included or not, depending on how popular the script is

I agree with you, we identified 2 of them (thumbs and minimizers) and we have very universal code for it - example (if picture and &w and &h) those identify 3 most popular thumb scripts (including those in wordpress and oscommerce).

We did it because on oscommerce shop we had 1000 more page views than on javascript - should we accept that?

  • there would be false positives (what if I have a legitimate img_thumb.php that should be included in page views?)

Does it with javascript tracking? From our tests not.

  • most importantly, such a list would be quite large, and that would really slow down the importing process (as each hit would have to be compared with all excluded paths).

We have only 2 more "if statements" on current FOR loops. Still you're right, that can grow :)

So it's not something that we should do by default. We have --exclude-path and --exclude-path-from options that allow you to create your own list of paths to exclude, depending on your site.

What we may do is create such a list in Piwik (in an external file), but not enable it by default. People that want to use this could add --exclude-path-from=common_excluded_paths.txt (for instance). What do you think of this, matt?

That could be a good idea, would be nice to test this on larger number of websites/scripts, we've tested 5 regular websites and few other scripts.

comment:41 follow-up: Changed 22 months ago by Cyril (cbay)

Ajax requests do not use POST all the time at all. For instance, jQuery (the most popular Javascript library) uses GET by default:
http://api.jquery.com/jQuery.ajax/

Regarding the rest of comments: just to make things clear, I wasn't advocating against what you did for your specific site, but against doing this by default in the script. I very much prefer to add options to the import script (it has quite many already) to allow users to customize it for their own needs rather than try to have sane defaults, which we really can't do as there's too much diversity on the Web :)

comment:42 in reply to: ↑ 41 Changed 22 months ago by fjohn

Cyril:

About Ajax - that is why we made limit of 100 page views per visitor. We found a case when one user made from 700 to 1000 views thanks to ajax by GET requests.

About whole thing. Sure, I understand that. But we wanted to use it for hosting company, and we are not making any "special case" we are trying to test log import on as many websites as we can.

So we just wanted to share some of our tests and ideas. In most cases everything works good, but wordpress or oscommerce are very popular.

Showing customers 30k views instead of 300 is not the best way to prove that import logs is working fine. On IPB forum we've had 5 times more pageviews, now less than twice JS.

comment:43 Changed 21 months ago by matt (mattab)

@oliverhumpage and to all listening in this ticket: is there any other pending bug or important missing feature in this script?

Are you all happy with it? Note: we are working on performance next.

comment:44 Changed 21 months ago by bjrubble

My Apache log gives hostnames rather than IP addresses. It looks like the import script sends the hostname, which the server side tries to interpret as a numeric IP value, with the result that all hostnames translate to 0.0.0.0. I added a call to socket.gethostbyname() in the import script, but it's undone all the performance gains I got through the bulk request patch.

Is there some simple fix that I'm missing here?

comment:45 follow-up: Changed 21 months ago by jamesvl011

Some IIS logs do the same as bjrubble mentioned in the comment above - for their c-ip section, a host name may be found instead of just an IP address.

This causes the regex (which only accept digits) to fail when parsing that line, and I believe the line gets thrown out, resulting in a bad import.

comment:46 Changed 21 months ago by stefanx

Because Piwik lacks the capabilities of tracking news feed subscribers (and I don't want to use feedburner) I would like to import the particular information from the Apache logs. All other web requests are tracked successfully by Piwik and I want the feed users information merged into the same Piwik website. For instance my news feed is located at www.domain.com/rss.xml, how can I import only the particular information into Piwik?

comment:47 Changed 21 months ago by fjohn

Hi guys,

We found one odd case.

On 2 servers (one dedicated and one vps) each new visit = new idvisitor (despite same configId).

BUT the same log file, the same piwik (fresh download and installation) on localhost at mac os x uniqe visitors are counted correctly.

Do you have any ideas why and how it supposed to work? I've spent some time in visit.php and when no cookie and visit less than 30 minutes = new idvisitor.

comment:48 Changed 21 months ago by matt (mattab)

BUT the same log file, the same piwik (fresh download and installation) on localhost at mac os x uniqe visitors are counted correctly.

Could you somehow find an example of the log file showing the problem on both installations, with a few lines like 3 or 4 lines, to replicate the bug? this would help finding out the fix. Thanks

comment:49 Changed 21 months ago by fjohn

Yes matt , I will have them tomorrow (day off today) but how it should work? Should log parsing count unique visitors or not ?

comment:50 Changed 21 months ago by geos_one

I have activated logimport via apache macro to have live stats but wee have a 20 sites with high load and the problem that we have now ist that the acces via the url is blocking (30 or more import_log.py accessing piwik)
could we get some direct logimport that si not going throug the http interface ? and directly trhoug a console php load ?

thx Mario
and keep up the great work

comment:51 Changed 21 months ago by matt (mattab)

  • Description modified (diff)

comment:52 Changed 21 months ago by aspectra

Hi @all

We are testing the python import_logs.py script. Actually we are not able to import IIS log files which are compressed by WinZip or 7zip. If we unzip the archive befor running the script it works quiet well.

It seems the python script is not able to uncompress the files...

Attached an example archive

Changed 21 months ago by aspectra

WinZip compressed file

comment:53 Changed 20 months ago by capedfuzz (diosmosis)

(In [6734]) Refs #3163, add integration tests (in PHP) for log importer.

comment:54 Changed 20 months ago by capedfuzz (diosmosis)

(In [6737]) Refs #3163, modified log importer to use bulk tracking capability.

Notes:

  • Added 'ua' & 'lang' tracker parameters to override user agent & language present in HTTP header.
  • Modified the tracker so if there's an error when doing bulk tracking, the number of succeeded requests is returned.

comment:55 Changed 20 months ago by matt (mattab)

(In [6739]) Refs #3163 - clarifying this option shouldn't be used by default

comment:56 Changed 20 months ago by capedfuzz (diosmosis)

(In [6740]) Refs #3163, made size of parsing chunk == to max payload size * recorder count.

comment:57 Changed 20 months ago by matt (mattab)

(In [6743]) Refs #3163

  • Fixing Log Analytics integration - adding a new index.php proxy to force to use test DB
  • Refactored call to get browser language forgotten earlier


TODO:

  • Benaka, could you please remove --tracker-url feature from the import_logs.py ? it's not used anymore

comment:58 Changed 20 months ago by matt (mattab)

(In [6745]) Fixing build? Refs #3163

comment:59 Changed 20 months ago by matt (mattab)

(In [6749]) Refs #3163

  • removing tracker-url param
  • fixing build?

comment:60 Changed 20 months ago by capedfuzz (diosmosis)

(In [6756]) Refs #3163, show average records/s along w/ current records/s in log importer.

comment:61 in reply to: ↑ 45 ; follow-up: Changed 20 months ago by matt (mattab)

Replying to jamesvl011:

Some IIS logs do the same as bjrubble mentioned in the comment above - for their c-ip section, a host name may be found instead of just an IP address.

This causes the regex (which only accept digits) to fail when parsing that line, and I believe the line gets thrown out, resulting in a bad import.

@jrbubble and james, could you please submit the correct REGEX? we would be glad to commit the fix, thanks.

comment:62 Changed 20 months ago by matt (mattab)

  • Description modified (diff)

Adding "Heuristics to not track bot visits" in the ticket description.

If you have a suggestion or request for the script - or any problem or bug, please post a new comment here.

comment:63 Changed 20 months ago by matt (mattab)

  • Description modified (diff)
  • Adding "Support Accept-language" as feature request, since Piwik allows to define user language with the parameter &lang= so this should be easy and useful for some users.

comment:64 in reply to: ↑ 61 Changed 20 months ago by jamesvl011

Replying to matt:

@jrbubble and james, could you please submit the correct REGEX? we would be glad to commit the fix, thanks.

Matt -

The regex for c-ip (line 134 of import_logs.py when I looked at svn) ought to be like the line for User-Agent, allowing any text string without spaces:

'c-ip': '(?P<ip>\S+)'

I'm assuming the Piwik API can handle an IP address input as host name? If not, Python will have to do hostname lookups (preferably with its own mini-cache) as it parses the file.

I'll attach a file to this ticket with an example IIS log file that you can use for testing - it will have four rows, three with host names in the c-ip field and one with an IP address.

Changed 20 months ago by jamesvl011

Sample IIS file for testing variations of c-ip field

comment:65 follow-up: Changed 20 months ago by oliverhumpage

I've just tried a fresh install of 1.8.3 (to make sure it works before I move everything over from my current 1.7.2rc4 install).

When I import a sample log (for just one vhost) using --add-sites-new-hosts, I get the same "website" created multiple times. It seems that if you set --recorders to something greater than 1, then several recorders will independently create the new vhost's website for you. Changing --recorder-max-payload-size doesn't seem to affect this behaviour, it's just --recorders.

I'm sure this didn't happen in the older 1.7.2 version.

Can you replicate, and if so, is there an easy fix?

Thanks.

comment:66 Changed 20 months ago by capedfuzz (diosmosis)

(In [6824]) Refs #3163, fix concurrency bug in import script where sites get created more than once when --add-sites-new-hosts is used.

comment:67 in reply to: ↑ 65 ; follow-up: Changed 20 months ago by capedfuzz (diosmosis)

Replying to oliverhumpage:

I've just tried a fresh install of 1.8.3 (to make sure it works before I move everything over from my current 1.7.2rc4 install).

When I import a sample log (for just one vhost) using --add-sites-new-hosts, I get the same "website" created multiple times. It seems that if you set --recorders to something greater than 1, then several recorders will independently create the new vhost's website for you. Changing --recorder-max-payload-size doesn't seem to affect this behaviour, it's just --recorders.

I'm sure this didn't happen in the older 1.7.2 version.

Can you replicate, and if so, is there an easy fix?

Just committed a fix for this bug. Can you use the file in svn?

comment:68 Changed 20 months ago by capedfuzz (diosmosis)

(In [6826]) Refs #3163, added more integration tests for log importer & removed some unnecessary xml files.

comment:69 in reply to: ↑ 67 Changed 20 months ago by oliverhumpage

Replying to capedfuzz:

Just committed a fix for this bug. Can you use the file in svn?

Perfect, that's fixed it - thank you.

Oliver.

comment:70 Changed 20 months ago by capedfuzz (diosmosis)

(In [6887]) Refs #3163, #3227, make sure no exception thrown in tracker when no 'ua' parameter & no HTTP_USER_AGENT. (fix for bug in [6737]).

comment:71 follow-up: Changed 20 months ago by unaidswebmaster

I'm trying to import our IIS logs using import_logs.py but it keeps hitting a snag somewhere in the middle. The message simply says:

Fatal error: None
You can restart the import of "d:\tmp\logfiles\ex120803.log" from the point it failed by specifying --skip=215201 on the command line.

When I restart it with the skip parameter, it would not record any more lines and fail again a few lines down (see output below)

C:\Python27>python "d:\websites\piwik\misc\log-analytics\import_logs.py" --url=h
ttp://piwikpre.unaids.org/ "d:\tmp\logfiles\ex120803.log" --idsite=2 --skip=2152
01
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log d:\tmp\logfiles\ex120803.log...
182921 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
218630 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
222550 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
227111 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
231539 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
235666 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
240261 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
244780 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Fatal error: None
You can restart the import of "d:\tmp\logfiles\ex120803.log" from the point it failed by specifying --skip=215225 on the command line.

The format we are using is W3C Extended Log File Format and we are tracking extended properties, such as Host, Cookie, and Referer. I'd like to send the log file that I used for this example, but it's too big to be attached (20Mb even when zipped). Can I send it by some other means?

Thanks a lot!
-Jo

comment:72 Changed 20 months ago by cbernard

Hi,

Nice module we're currently assessing.
I have 2 questions :

1/ We have several servers load balanced. Each server is generating its own log files, but for the same FQDN. How can we process and aggregate the log files within the same Website, as the log files need to be order by date ?

2/ Log files contain consumed bandwidth. Is it envisageable to enhance this module in order to parse and log this information ? Or if we need this information, should we consider to create a plugin ?

Thanks for your feedback.

comment:73 Changed 20 months ago by pmontana

The import_logs.py script should be able to handle and order the dates of your differents logs when computing statistics. It's the main purpose of the "invalidate" function within this script.

The best would be to import all your logs at once and then to run the archive job so that it can compute statistics for the "invalidated" dates.

comment:74 Changed 19 months ago by smartkit

Hi,

I try to use "import_logs.py" to parse the Java Play's log, log file sample as follows:

15.185.97.217 127.0.0.1 - - [Tue Sep 04 18:28:38 PDT 2012] "/facedetect?url_pic=http%3A%2F%2Ffarm4.staticflickr.com%2F3047%2F2699553168_325fb5509b.jpg" 200 345 "" "Jakarta Commons-HttpClient/3.1" 5683 ""

But the Python thrown: "invalid log lines"

Actually the Java Play's log file is similar with Lighttpd's access.log,Any easy way to adapter this Python file to parse other log file?

comment:75 follow-up: Changed 19 months ago by alfred_e_neuman

It was suggested by Matt that I add my issue to this ticket:

I'm running Piwik 1.8.3 on IIS 7. I've installed the GeoIP plugin, and also tweaked based on http://forum.piwik.org/read.php?2,71788. It is working. However, my installation is only tracking US-based visits.

My IIS instance archives its log hourly. I've attached one recent log for review, on the chance that it will contain clues as to why I'm only seeing US-based visits.

comment:76 in reply to: ↑ 75 Changed 19 months ago by alfred_e_neuman

Attached log file is named u_ex12091212.log.

comment:77 Changed 19 months ago by capedfuzz (diosmosis)

[7030] refs this ticket.

comment:78 follow-up: Changed 19 months ago by jason

I have a log with the following format where www.website.com represents the hostname of the web hosts hosted on the server. I get an error that the log format doesn't include the hostname.

188.165.230.147 www.website.com - [10/Oct/2012:01:49:34 -0400] "GET / HTTP/1.1" 200 10341 "http://www.orangeask.com/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" "-"

I have tried a series of tests with --log-format-regex= and I can't get it to work. Any help would be greatly appreciated.

Thanks

comment:79 Changed 19 months ago by matt (mattab)

To everyone with questions in this ticket, thank you for your bug reports. You can try to modify the script python to make it work for your log files. It's really simple code at the start of the script.

If you are stuck and need help, Piwik experts can help with any issue related to the log import script. Contact them at: http://piwik.org/consulting/

Otherwise, we may fix some of these requests posted here, but it might take a while..

We hope you enjoy Log Analytics!

Last edited 19 months ago by matt (previous) (diff)

comment:80 in reply to: ↑ 78 Changed 19 months ago by smartkit

Replying to jason:

I have a log with the following format where www.website.com represents the hostname of the web hosts hosted on the server. I get an error that the log format doesn't include the hostname.

188.165.230.147 www.website.com - [10/Oct/2012:01:49:34 -0400] "GET / HTTP/1.1" 200 10341 "http://www.orangeask.com/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" "-"

I have tried a series of tests with --log-format-regex= and I can't get it to work. Any help would be greatly appreciated.

Thanks

Last time, I have adapted the code base at "import_logs.py" esp for Java Play log file parsing successfully, then I think you should hard code to remove the hostname pattern with "http://" prefix, or string replace it.

comment:81 Changed 19 months ago by tim

Had a very minor problem with the script today:
I have daily log rotation enabled, and when no user visits a site on a given day, the log file for that day will be empty. This means the log format guessing fails, leading to an error.
Preferably, when a log file is empty, one would like to skip the file without throwing an error. This is easily achieved by changing the line that checks for log file existence to also check if the log file has contents:

if not os.path.exists(filename) or os.path.getsize(filename) == 0:

Changed 18 months ago by phikai

Log Parser README Update with Nginx Log Format for Common Complete

comment:82 Changed 18 months ago by matt (mattab)

@cyril in the next update, can yu please include this patch from @phikai: "Log Parser README Update with Nginx Log Format for Common Complete"

To everyone else: please consider submitting patches, README improvements, or new log format in the script, we will make an update in a few days.

comment:83 Changed 18 months ago by matt (mattab)

(In [7313]) Refs #3163 Adding libwww in excluded user agents, since libwww-perl is a common bot
As reported in: http://forum.piwik.org/read.php?3,95844

comment:84 Changed 18 months ago by Cyril (cbay)

(In [7382]) Refs #3163: Log Parser README Update with Nginx Log Format for Common Complete, thanks to phikai.

comment:85 Changed 18 months ago by Cyril (cbay)

(In [7383]) Refs #3163: don't fail to autodetect the format for empty files.

comment:86 Changed 18 months ago by matt (mattab)

Hey guys, there have been many updates in trunk on the script, please let us know if your suggestion or report hasn't yet been committed.

Kuddos Cyril for the updates!

edit: Check also this ticket: #3558

Last edited 17 months ago by matt (previous) (diff)

comment:87 Changed 18 months ago by Cyril (cbay)

For the record, with the current trunk, I can sustain 2000 requests/second in dry-run mode on a Xeon 2.7 GHz. And 1000 requests/second without dry-run, with --recorder=10 and the default payload (Piwik is installed on another server, 4 cores).

Not to say that you should get the same numbers as it depends on a LOT of factors (raw processing power, number of recorders, payload, PHP configuration, log files, network, etc.), but if you only get 50 requests/second and you have a strong machine, something is probably wrong.

Running with --dry-run is a good way to know how fast the Python script can go without really importing to Piwik, which already excludes many factors.

comment:88 Changed 17 months ago by ottodude125

I am running Piwik 1.9.2 on a RHEL 5.7 server running Apache.

I am trying to implement the Apache CustomLog that directly imports into Piwik as described in this [url=http://dev.piwik.org/svn/trunk/misc/log-analytics/README]README[/url]. I am not sure if I have a problem with my configuration or if there is a potential bug in the Piwik import_logs.py script. After some poking around on the command line it seems that the script works perfectly when it is given an entire file but when you try to feed it a single line from a log file it crashes. I have included my cmd output below for you to view. Any help would be greatly appreciated. Also if you need any additional information please let me know!!

Firstly let me pull the first line of my logfile to show its syntax:

[katonj@mimir2:log-analytics ] $ head -1 boarddev-beta.teradyne.com.log
boarddev-beta.teradyne.com 131.101.52.31 - - [12/Nov/2012:11:16:24 -0500] "GET /boarddev/ HTTP/1.1" 200 10541 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4"

Now when I run the file as the apache configuration suggests I get the following (Note: if I do not put the "-" at the end of the command the line from the logfile is ignore and the script simply outputs the README file):

[katonj@mimir2:log-analytics ] $ head -1 boarddev-beta.teradyne.com.log | ./import_logs.py  --add-sites-new-hosts --config=../../config/config.ini.php --url='http://boarddev-beta.teradyne.com/analytics/' -
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log (stdin)...
Traceback (most recent call last):
  File "./import_logs.py", line 1462, in <module>
    main()
  File "./import_logs.py", line 1426, in main
    parser.parse(filename)
  File "./import_logs.py", line 1299, in parse
    file.seek(0)
IOError: [Errno 29] Illegal seek

And finally if I run the file itself through the script I get the following showing that it loves processing the logfile as long as it gets an entire file fed to it all at once:

[katonj@mimir2:log-analytics ] $ ./import_logs.py  --add-sites-new-hosts --config=../../config/config.ini.php --url='http://boarddev-beta.teradyne.com/analytics/' boarddev-beta.teradyne.com.log
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log boarddev-beta.teradyne.com.log...
Purging Piwik archives for dates: 2012-11-12
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.


Logs import summary
-------------------

    8 requests imported successfully
    0 requests were downloads
    0 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    8 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 0 seconds
    Requests imported per second: 24.01 requests per second

comment:89 Changed 17 months ago by Cyril (cbay)

ottodude125: log detection + reading from stdin is actually not supported; you have to pick one. I'll fix the bug later on though.

comment:90 Changed 17 months ago by ottodude125

When you setup the apache customlog you are piping the system log messages into the script as soon as they appear. This is the same as stdin right? I was just trying to simulate that process by running the head -1 on a log file to get a log message and piping that into the script.

comment:91 Changed 17 months ago by oliverhumpage

Since auto format detection relies on having several lines to decode, it doesn't work on stdin (it tries to seek to points in the file, hence the "bug" - seek obviously fails on stdin).

When using stdin as the log source you have to use either --log-format-name or --log-format-regex flags on the command line to force a particular format. You might find --log-format-name="common_vhost" is what you want.

comment:92 Changed 17 months ago by ottodude125

So you are complete right. Adding --log-format-name='common_vhost' to the command now allows a logfile to be read in from stdin on the command line. So running the following command works great from the command line:

[katonj@mimir2:applications ] $ head -8 babyfat | /hwnet/dtg_devel/web/beta/applications/piwik/misc/log-analytics/import_logs.py --add-sites-new-hosts --url='http://mimir2.icd.teradyne.com/analytics' --log-format-name='common_vhost' --output=/tmp/junk.log -

As a side note I've tried the common_complete name and I tried using the --log-format-regex included in the readme and neither of them had any magical side effects either

Unfortunately porting that exact same thing into the apache http.conf file does not work. I have the configuration below and while the logfile "babyfat" gets populated piwik doesnt seem to process any input.

LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" baby

CustomLog "|/hwnet/dtg_devel/web/beta/applications/piwik/misc/log-analytics/import_logs.py --add-sites-new-hosts --url='http://mimir2.icd.teradyne.com/analytics' --log-format-name='common_vhost' --output=/tmp/junk.log -" baby

CustomLog logs/babyfat baby

Lastly the output logfile junk.log gets input when the command is run from the command line but the only time it gets populated from apache is when you add several -d to the CustomLog command and restart apache and then you get:

2012-11-13 15:44:12,517: [DEBUG] Accepted hostnames: all
2012-11-13 15:44:12,517: [DEBUG] Piwik URL is: http://mimir2.icd.teradyne.com/analytics
2012-11-13 15:44:12,517: [DEBUG] No token-auth specified
2012-11-13 15:44:12,517: [DEBUG] No credentials specified, reading them from "/hwnet/dtg_devel/web/beta/applications/piwik/config/config.ini.php"
2012-11-13 15:44:12,520: [DEBUG] Using credentials: (login = piwik, password = a0a582ec5eda9c506a6f30dc8b2bbcf3)
2012-11-13 15:44:13,249: [DEBUG] Accepted hostnames: all
2012-11-13 15:44:13,249: [DEBUG] Piwik URL is: http://mimir2.icd.teradyne.com/analytics
2012-11-13 15:44:13,249: [DEBUG] No token-auth specified
2012-11-13 15:44:13,249: [DEBUG] No credentials specified, reading them from "/hwnet/dtg_devel/web/beta/applications/piwik/config/config.ini.php"
2012-11-13 15:44:13,251: [DEBUG] Using credentials: (login = piwik, password = a0a582ec5eda9c506a6f30dc8b2bbcf3)
2012-11-13 15:44:14,341: [DEBUG] Authentication token token_auth is: 582b588b9568840fa6f1e208a8702b93
2012-11-13 15:44:14,342: [DEBUG] Resolver: dynamic
2012-11-13 15:44:14,342: [DEBUG] Launched recorder
Last edited 17 months ago by ottodude125 (previous) (diff)

comment:93 Changed 17 months ago by matt (mattab)

(In [7490]) Fixes #3548 Refs #3163
Any visitor with a user agent containing "spider" will be classified a bot

comment:94 Changed 17 months ago by elm

I have the same issue as ottodude125. Piping one single line from the access.log into import_logs.py works but using the same command directly from apache nothing gets logged.

EDIT: I noticed the log messages are appearing in the import_logs log when I restart apache. So it seem like this triggers either apache to send the messages to stdin or import_logs to read from stdin.

2EDIT: CustomLog with rotatelog works. So the issue must be the import_logs.py

Last edited 17 months ago by elm (previous) (diff)

comment:95 Changed 17 months ago by oliverhumpage

@elm @ottodude125

I noticed in ottodude125's customlog, there's no path to the config file and no auth token: that would explain the errors shown in junk.log. You need to specify one or the other so that import_logs.py can authenticate itself to the piwik PHP scripts.

I'm wondering if the same problem is happening for elm's logs too? @elm, if that doesn't fix it, could you paste your customlog section here too?

comment:96 Changed 17 months ago by matt (mattab)

There was another user in the forums reporting an error: view post

Could we explain the bug when it happens, and fail with a relevant error/notice message?

comment:97 Changed 17 months ago by elm

Here is my CustomLog line (line breaks for better reading):

CustomLog "|/var/www/piwik.skweez.net/piwik/misc/log-analytics/import_logs.py
--url=http://piwik.skweez.net/ --add-sites-new-hosts
--output=/var/www/update.skweez.net/logs/piwik.log --recorders=4
--log-format-name=common_vhost -dd -" vhost_combined

Here is the log that is generated:

...
2012-11-23 22:35:07,759: [DEBUG] Launched recorder
2012-11-23 22:35:07,761: [DEBUG] Launched recorder
2012-11-23 22:35:07,762: [DEBUG] Launched recorder
2012-11-23 22:35:07,763: [DEBUG] Launched recorder
2012-11-24 06:30:01,375: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-24 06:30:01,378: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-24 06:30:01,633: [DEBUG] Accepted hostnames: all
2012-11-24 06:30:01,633: [DEBUG] Piwik URL is: http://piwik.skweez.net/
2012-11-24 06:30:01,633: [DEBUG] No token-auth specified
2012-11-24 06:30:01,633: [DEBUG] No credentials specified, reading them from "/var/www/piwik.skweez.net/piwik/config/config.ini.php"
2012-11-24 06:30:01,648: [DEBUG] Using credentials: (login = piwikadmin, password = ...)
2012-11-24 06:30:02,065: [DEBUG] Site ID for hostname update.skweez.net: 7
2012-11-24 06:30:02,709: [DEBUG] Site ID for hostname update.skweez.net: 7
Purging Piwik archives for dates: 2012-11-23 2012-11-24
2012-11-24 06:30:02,935: [DEBUG] Authentication token token_auth is: ...
2012-11-24 06:30:02,935: [DEBUG] Resolver: dynamic
2012-11-24 06:30:02,936: [DEBUG] Launched recorder
2012-11-24 06:30:02,938: [DEBUG] Launched recorder
2012-11-24 06:30:02,940: [DEBUG] Launched recorder
2012-11-24 06:30:02,941: [DEBUG] Launched recorder

Logs import summary
-------------------

    5 requests imported successfully
    14 requests were downloads
    15 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        1 HTTP errors
        0 HTTP redirects
        14 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    5 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 28495 seconds
    Requests imported per second: 0.0 requests per second

2012-11-25 06:33:02,723: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-25 06:33:02,723: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-25 06:33:02,724: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-25 06:33:03,104: [DEBUG] Site ID for hostname update.skweez.net: 7
2012-11-25 06:33:03,136: [DEBUG] Site ID for hostname update.skweez.net: 7
2012-11-25 06:33:03,141: [DEBUG] Site ID for hostname update.skweez.net: 7
2012-11-25 06:33:03,372: [DEBUG] Accepted hostnames: all
2012-11-25 06:33:03,372: [DEBUG] Piwik URL is: http://piwik.skweez.net/
2012-11-25 06:33:03,372: [DEBUG] No token-auth specified
2012-11-25 06:33:03,372: [DEBUG] No credentials specified, reading them from "/var/www/piwik.skweez.net/piwik/config/config.ini.php"
2012-11-25 06:33:03,373: [DEBUG] Using credentials: (login = piwikadmin, password = ...)
2012-11-25 06:33:03,492: [DEBUG] Authentication token token_auth is: ...
2012-11-25 06:33:03,492: [DEBUG] Resolver: dynamic
2012-11-25 06:33:03,493: [DEBUG] Launched recorder
2012-11-25 06:33:03,494: [DEBUG] Launched recorder
2012-11-25 06:33:03,495: [DEBUG] Launched recorder
2012-11-25 06:33:03,495: [DEBUG] Launched recorder
Purging Piwik archives for dates: 2012-11-25 2012-11-24

Logs import summary
-------------------

    9 requests imported successfully
    42 requests were downloads
    42 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        3 HTTP errors
        0 HTTP redirects
        39 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    9 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 86580 seconds
    Requests imported per second: 0.0 requests per second


Logs import summary
-------------------

    0 requests imported successfully
    0 requests were downloads
    0 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    0 requests imported to 0 sites
        0 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 12 seconds
    Requests imported per second: 0.0 requests per second

2012-11-25 06:33:16,016: [DEBUG] Accepted hostnames: all
2012-11-25 06:33:16,016: [DEBUG] Piwik URL is: http://piwik.skweez.net/
2012-11-25 06:33:16,016: [DEBUG] No token-auth specified
2012-11-25 06:33:16,016: [DEBUG] No credentials specified, reading them from "/var/www/piwik.skweez.net/piwik/config/config.ini.php"
2012-11-25 06:33:16,017: [DEBUG] Using credentials: (login = piwikadmin, password = ...)
2012-11-25 06:33:16,156: [DEBUG] Authentication token token_auth is: ...
2012-11-25 06:33:16,156: [DEBUG] Resolver: dynamic
2012-11-25 06:33:16,157: [DEBUG] Launched recorder
2012-11-25 06:33:16,157: [DEBUG] Launched recorder
2012-11-25 06:33:16,159: [DEBUG] Launched recorder
2012-11-25 06:33:16,159: [DEBUG] Launched recorder

So it is getting the logs when apache is reloading, which it does at night after logrotate.

comment:98 Changed 17 months ago by aspectra

Hi,
I would be glad, if you could add a new option to the script. It should only import the loglines with a specified path included. So do exactly the opposite of the --exclude-path-from option. As far as I understand we could just copy/paste the def check_path part and change the "True" and "False" return values. I posted the part with the changes.

    def check_path(self, hit):
        for include_path in config.options.included_paths:
            if fnmatch.fnmatch(hit.path, included_path):
                return True
        return False

Unfortunately I don't know where to modify the script to add this option.

Many thanks for your help.

comment:99 Changed 16 months ago by asentinel

Hi all,
I am new to this piwik. So, I installed piwik on an apache webserver and I tried to import a log file from a Tomcat webserver but I get the following error:
Fatal error: Cannot guess the logs format. Please give one using either the --log-format-name or --log-format-regex option
This is the command that I used:
python /var/www/piwik/misc/log-analytics/import_logs.py --url=http://192.168.1.100/piwik/ /home/user/app1/catalina.2012-12-10.log --idsite=1 --recorders=1 --enable-http-errors --enable-http-redirects --enable-static --enable-bots
And this is what the log file contains:
Dec 10, 2012 12:02:50 AM org.apache.catalina.core.StandardWrapperValve invoke
INFO: 2012-12-10 00:02:50,000 - DEBUG InOutCallableStatementCreator#<init> - Call: AdminReports.GETAPPLICATIONINFO(?)

I tried googling it but I didn't find much. Also I tried the piwik forum but the same. Can you help me? What parameter shall use with --log-format-name or --log-format-regex option?

comment:100 Changed 16 months ago by matt (mattab)

  • Priority changed from major to critical

comment:101 Changed 16 months ago by matt (mattab)

In trunk, when I CTRL+C the script, it does not exit directly, it takes 5-10 seconds before the software stops running an then outputs the log. I think it is a recent regression ?

comment:102 Changed 16 months ago by hflautert

Suggestion - Bandwidth Usage

I used to see it on awstats...
http://forum.piwik.org/read.php?2,98279,98330#msg-98330

There is no size information on logs, but i guess awstats check the acessed files on logs, and count it.

comment:103 Changed 16 months ago by matt (mattab)

For piwik.php performance improvements and asynchronous data imports, see #3632

comment:104 in reply to: ↑ 71 Changed 16 months ago by mikemarksjr

Has anyone found a solution to this yet? I'm having the same problem with my IIS logs not importing.

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log Z:\logs\W3SVC14\u_ex121218.log...
1648 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current
)
1648 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current
)
1648 lines parsed, 43 lines recorded, 14 records/sec (avg), 43 records/sec (curr
ent)
1648 lines parsed, 43 lines recorded, 10 records/sec (avg), 0 records/sec (curre
nt)
Fatal error: None
You can restart the import of "Z:\logs\W3SVC14
\u_ex121218.log" from the point it failed by specifying --skip=3 on the command
line.

Replying to unaidswebmaster:

I'm trying to import our IIS logs using import_logs.py but it keeps hitting a snag somewhere in the middle. The message simply says:

Fatal error: None
You can restart the import of "d:\tmp\logfiles\ex120803.log" from the point it failed by specifying --skip=215201 on the command line.

When I restart it with the skip parameter, it would not record any more lines and fail again a few lines down (see output below)

C:\Python27>python "d:\websites\piwik\misc\log-analytics\import_logs.py" --url=h
ttp://piwikpre.unaids.org/ "d:\tmp\logfiles\ex120803.log" --idsite=2 --skip=2152
01
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log d:\tmp\logfiles\ex120803.log...
182921 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
218630 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
222550 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
227111 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
231539 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
235666 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
240261 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
244780 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Fatal error: None
You can restart the import of "d:\tmp\logfiles\ex120803.log" from the point it failed by specifying --skip=215225 on the command line.

The format we are using is W3C Extended Log File Format and we are tracking extended properties, such as Host, Cookie, and Referer. I'd like to send the log file that I used for this example, but it's too big to be attached (20Mb even when zipped). Can I send it by some other means?

Thanks a lot!
-Jo

comment:105 follow-up: Changed 16 months ago by wpballard

Checking in on the IIS logs not importing issue. I'm having the same issue as Jo reported here. The errors are the same.

comment:106 Changed 16 months ago by tannerjt

I am running into the same problem as Jo as well. Please let me know if there are any suggestions or possible solutions. We have been trying to diagnose the problem for a couple days but still have not found a solution. Thanks.

comment:107 in reply to: ↑ 105 Changed 16 months ago by wpballard

Replying to wpballard:

Checking in on the IIS logs not importing issue. I'm having the same issue as Jo reported here. The errors are the same.

One thing I've noticed is that --dry-run works perfectly. That might help narrow down where the problem is. Likely in the code that commits the changes to the DB.

comment:108 Changed 16 months ago by dsampson

Hey Folks,

Glad to see there is good interest in the log file processing.

The first feature I would like to see added is the opposite of --exclude-path, would be --include-path

In our architecture we have MANY web assets under a single domain and weblogs are done by domain. This is out of our control. This would include multiple applications, API's, and web services. it would be nice to process the log files by including only the paths we want. The exclusion route is just cumbersome as each call would require 5-10 excludes instead of a single include.

comment:109 Changed 16 months ago by dsampson

The Second Feature I would like to see is support for the XFERLOG format (http://www.castaglia.org/proftpd/doc/xferlog.html) for handling FTP logs.

much of our business is based on the downloading of data and files via FTP, so these types of stats and analysis is valuable.

comment:110 Changed 16 months ago by dsampson

The third feature I would like to see added today is the ability to process log files rotated on a monthly basis. I know this goes contrary to the recommendations however in our business we do not manage the IT infrastructure, only the line of business services and apps on top of that infrastructure.

Currently I am handling this by way of a BASH script. Before I process the log file I count the number of lines (using $wc -l) then I store that in a loglines.log file. The next time I run the script I use tail on the loglines.log and grab the last line count ans use that to populate --skip param.

To capture the monthly log rotation if the current wc -l is less than loglines.log then I set --skip to zero (0).

it is crude, but works. Having this built in native python would be fairly straight forward and allow for support of rotating monthly.

The added bonus is that the same log file can be processed multiple times in a day even for daily rotated logs. This is a happy compromise between real-time javascript and daily log processing, especially for high volume sites with huge log files.

Cron is handy for this.

comment:111 follow-up: Changed 16 months ago by Cyril (cbay)

Those having errors with IIS: please upload a log file with lines causing the error. A single line is probably causing it, so it'd be better to upload that single line(s) rather than a big file. The skip value will help you find that line.

comment:112 follow-up: Changed 16 months ago by Cyril (cbay)

dsampson: agree for the --include-path suggestion. I'll add it later.

FTP logs: that's definitely not something that should be included to Piwik. You can define your own log format with a regexp, have you tried?

Log rotating: not easy. Right now, the Python script has no memory, so it can't store data (such as the latest position for log files). Besides, how would the script know when the log file has been rotated and we must reset the position?

The real solution, to me, would be that Piwik (the PHP/MySQL part) would know if a log line has already been imported, so that you can basically reimport any log file at any time, and it would skip lines already imported. It cannot be as fast as --skip=n, but it would be safe and easy to use.

comment:113 in reply to: ↑ 112 ; follow-up: Changed 16 months ago by dsampson

See comments inline...

Replying to Cyril:

dsampson: agree for the --include-path suggestion. I'll add it later.

Thanks for this. Appreciated

FTP logs: that's definitely not something that should be included to Piwik. You can define your own log format with a regexp, have you tried?

For those of us in the big data business a FOSS solution offering all the features of piwik for FTP would be great. An unlikely fork, so thought it could be a posible feature.

Working on the regex for XFERLOG. having trouble re-casting a new regex group based on values of other groups. For instance the date field is not a clean YYYY-MM-DD so I need to figure out how to create a regex group based on values of three other regex groups. I am a regex greenhorn for sure.

Log rotating: not easy. Right now, the Python script has no memory, so it can't store data (such as the latest position for log files). Besides, how would the script know when the log file has been rotated and we must reset the position?

I do it by comparing the last line count to the new one. for instance #linesyesterday will be greater than #linestoday if the logfile has been rotated. I have done logfiles in python using just regular text files in the past. They get big but the head can be severed when it gets too big. a no-sql db approach or data object could also work.

The real solution, to me, would be that Piwik (the PHP/MySQL part) would know if a log line has already been imported, so that you can basically reimport any log file at any time, and it would skip lines already imported. It cannot be as fast as --skip=n, but it would be safe and easy to use.

This would be a good alternative with some hit on performance.

Thanks again for the reply

comment:114 in reply to: ↑ 113 Changed 15 months ago by dsampson

Did either of these feature make it into the latest 1.10.1 release?

Replying to dsampson:

See comments inline...

Replying to Cyril:

dsampson: agree for the --include-path suggestion. I'll add it later.

Log Rotation: The real solution, to me, would be that Piwik (the PHP/MySQL part) would know if > a log line has already been imported, so that you can basically reimport any log file at any time, > and it would skip lines already imported. It cannot be as fast as --skip=n, but it would be safe

and easy to use.

comment:115 Changed 15 months ago by dsampson

Working on the regex for XFERLOG.

Here is my first cut, however the DATE field will not be recognized. Dates in XFERLOG are not like those in Apache logs. I am not sure how to concatenate these groups based on other named groups.

I included some test strings. yes I used the public Google DNS for IP's for privacy reasons.

I captured everything I could according to the EXFER documentation. perhaps overkill but the best way I knew to work through the expression. manpage for XFERLOG here (http://www.castaglia.org/proftpd/doc/xferlog.html)

I also provided the example script call and the output from the script.

Looks like the issue is the DATE group. no surprise. But again I am not sure how to contruct it based on the input.

Any thoughts are appreciated


Mon Nov 1 04:18:56 2012 4 8.8.4.4 1628134 /pub/geobase/official/cded/250k_dem/026/026a.zip b _ o a User@ ftp 0 *
Thu Nov 10 04:18:56 2012 4 8.8.4.4 1628134 /pub/geobase/official/cded/250k_dem/026/026a.zip b _ o a User@ ftp 0 * c
Tue Jan 1 14:12:36 2013 1 8.8.4.4 88048 /pub/cantopo/250k_tif/MCR2010_01.tif b _ o a ftp@… ftp 0 * i
Tue Jan 1 14:15:57 2013 4 8.8.4.4 8769852 /pub/geott/ess_pubs/211/211354/gscof_3759r_b_2000_mn01.pdf b _ o a googlebot@… ftp 0 * c
Tue Jan 1 16:06:49 2013 11 8.8.4.4 7198877 /pub/toporama/50k_geo_tif/095/d/toporama_095d02_geo.zip b _ o a user@… ftp 0 * c
Tue Jan 1 17:10:54 2013 1 8.8.4.4 168502 /pub/geott/eo_imagery/gcdb/W102/N49/N49d50mW102d12m_2.tif b _ o a googlebot@… ftp 0 * c
Tue Jan 1 17:10:54 2013 1 8.8.4.4 168502 /pub/geott/eo_imagery/gcdb/W102/N49/N49d50mW102d12m_2.tif b _ o a googlebot@… ftp 0 * c
Tue Jan 1 06:59:59 2013 1 8.8.4.4 1679 /pub/geott/eo_imagery/gcdb/W073/N60/N60d50mW073d40m_1.summary b _ o a googlebot@… ftp 0 * c
Tue Jan 1 07:02:53 2013 1 8.8.4.4 168087 /pub/geott/eo_imagery/gcdb/W108/N50/N50d58mW108d28m_3.tif b _ o a googlebot@… ftp 0 * c
Tue Jan 1 07:04:39 2013 1 8.8.4.4 16958 /pub/geott/cli_1m/e00_pro/fcomfins.gif b _ o a googlebot@… ftp 0 * c


(?x)
(?P<weekday>Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s
(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\s
(?P<day>[\d]{1,})\s
(?P<time>[\d+:]+)\s
(?P<year>[\d]{4})\s

(?P<unknown>[\d]+)\s
(?P<ip>[\d]{1,3}.[\d]{1,3}.[\d]{1,3}.[\d]{1,3})\s
(?P<length>[\d]{,})\s
(?P<path>/[\w+/]+)/
(?P<file>[\w\d-]+\.\w+)\s
(?P<type>[a|b])\s
(?P<action>[C|U|T|_])\s
(?P<direction>[o|i|d])\s
(?P<mode>[a|g|r])\s
(?P<user>[\w\d]+@|[\w\d]+@[\w\d.]+)\s
(?P<service>[\w]+)\s
(?P<auth>[0|1])\s
(?P<userid>[*])\s
(?P<status>[c|i])
(?P<stuff>)


./misc/log-analytics/import_logs.py --url=http://PIWIKSERVER --token-auth=AUTHSTRING --output=proclogs/procFtpPiwik.log --enable-reverse-dns --idsite=17 --skip=0 --dry-run --log-format-regex="(?x)(?P<weekday>Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\s(?P<day>[\d]{1,})\s(?P<time>[\d+:]+)\s(?P<year>[\d]{4})\s(?P<unknown>[\d]+)\s(?P<ip>[\d]{1,3}.[\d]{1,3}.[\d]{1,3}.[\d]{1,3})\s(?P<length>[\d]{,})\s(?P<path>/[\w+/]+)/(?P<file>[\w\d-]+\.\w+)\s(?P<type>[a|b])\s(?P<action>[C|U|T|_])\s(?P<direction>[o|i|d])\s(?P<mode>[a|g|r])\s(?P<user>[\w\d]+@|[\w\d]+@[\w\d.]+)\s(?P<service>[\w]+)\s(?P<auth>[0|1])\s(?P<userid>[*])\s(?P<status>[c|i])(?P<stuff>)"-


0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log logs/ftpLogsJunco/xferlog2...
Traceback (most recent call last):

File "./misc/log-analytics/import_logs.py", line 1411, in <module>

main()

File "./misc/log-analytics/import_logs.py", line 1375, in main

parser.parse(filename)

File "./misc/log-analytics/import_logs.py", line 1299, in parse

date_string = match.group('date')

IndexError: no such group

comment:116 Changed 14 months ago by motin

@ottodude125 and @elm: I have the same issue and reported it as a separate ticket here: http://dev.piwik.org/trac/ticket/3757#ticket

comment:117 Changed 14 months ago by Spanner

How to exclude more than 150 user's visits on site?

comment:118 in reply to: ↑ 111 Changed 14 months ago by m81571

Replying to Cyril:

Those having errors with IIS: please upload a log file with lines causing the error. A single line is probably causing it, so it'd be better to upload that single line(s) rather than a big file. The skip value will help you find that line.

My web logs have additional fields logged. Some of these do resolve/transfer over when using AWStats, others are excluded in AWStats with %other% values. I tried to exclude the additional field data by creating new, but unused lines in the import iis format section but was not able to get past the error "'IisFormat' object has no attribute 'regex'". Forum/web searches bring this up as a common problem but I haven't found a fix. Any suggestions? Sample log file inline.

#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-02-23 00:00:01
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken
2013-02-23 00:00:01 192.168.1.202 GET /pages/AllItems.aspx - 443 DOMAIN\username 2.3.4.5 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+6.1;+WOW64;+Trident/4.0;+chromeframe/24.0.1312.57;+SLCC2;+.NET+CLR+2.0.50727;+.NET+CLR+3.5.30729;+.NET+CLR+3.0.30729;+Media+Center+PC+6.0;+.NET4.0C;+.NET4.0E;+InfoPath.3) 200 0 0 499
2013-02-23 00:00:01 192.168.1.202 GET /pages/logo.jpg - 443 DOMAIN\username 2.3.4.5 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+6.1;+WOW64;+Trident/4.0;+chromeframe/24.0.1312.57;+SLCC2;+.NET+CLR+2.0.50727;+.NET+CLR+3.5.30729;+.NET+CLR+3.0.30729;+Media+Center+PC+6.0;+.NET4.0C;+.NET4.0E;+InfoPath.3) 304 0 0 312

comment:119 Changed 14 months ago by matt (mattab)

Piwik Log Analytics is now being used by hundreds of users and seems to be working well! We are always interested in new feature requests and suggestions. You can post them here and if you are a developer, please consider opening a pull request

Last edited 13 months ago by matt (previous) (diff)

comment:120 Changed 13 months ago by cbernard

Hi,

The log analytic script does not accept any time argument.
Thus, is it assumed that the log files to be processed have already been filtered ( timestamp range) in order to avoid duplicate processing ?

Thanks.

comment:121 follow-up: Changed 13 months ago by lyrrr

Hi

I've been trying to import some logs from a tomcat/valve access log.

According to this http://tomcat.apache.org/tomcat-5.5-doc/config/valve.html, my app server.xml define

<Valve className="org.apache.catalina.valves.AccessLogValve" directory="/sillage/logs/performances" pattern="%h %l %u %t %r %s %b %D Referer=[%{Referer}i]" prefix="access." resolveHosts="false" suffix=".log"/>

Here is a couple of line from one of my access-datetime.log

10.10.40.85 - - [08/Apr/2013:11:02:49 +0200] POST /...t.do HTTP/1.1 200 39060 629 Referer=[http://.....jsp]
10.10.40.60 - - [08/Apr/2013:11:02:49 +0200] GET /...e&typ_appel=json HTTP/1.1 200 2895 2 Referer=[-]
10.10.40.85 - - [08/Apr/2013:11:02:48 +0200] POST /...r.jsp?cmd=tracer HTTP/1.1 200 90 63 Referer=[http://....jsp]

Shortly said, trying to get the proper --log-format-regex has been a nightmarish failure. Improving the documentation on this complex but sometime unavoidable option is necessary. Having a simple array matching the usual

%h => (?P<host>[\\\\w\\\\-\\\\.\\\\/]*)(?::\\\\d+)?

(guess reading README exemple...) would help. Maybe...

comment:122 in reply to: ↑ 121 Changed 13 months ago by oliverhumpage

Replying to lyrrr:

Shortly said, trying to get the proper --log-format-regex has been a nightmarish failure. Improving the documentation on this complex but sometime unavoidable option is necessary. Having a simple array matching the usual

%h => (?P<host>[\\\\w\\\\-\\\\.\\\\/]*)(?::\\\\d+)?

(guess reading README exemple...) would help. Maybe...

If you're using --log-format-regex on the command line then I don't think the escaping is necessary. It's only if you're piping directly to piwik via (in my case) apache's ability to send logs to programmes that you need to work out how to do the multiple-escape thing.

comment:123 Changed 13 months ago by lyrrr

I'll try tomorrow, but I'm skeptical: I copied the

stuff from the README.md example.

comment:124 Changed 13 months ago by oliverhumpage

I've just double-checked the README.md, and the only time I can see that weird escaping is in the bit I wrote called "Apache configuration source code". It's meant to be apache config, not CLI - apologies if that's not clear.

You may need to put a bit of escaping in depending on your shell, but nowhere near the amount that apache requires (since you've got to escape the initial parsing of the config file, then the shell escaping as it runs the command, and still be left with backslashes).

I think if you single quote it's mostly OK, i.e. with tcsh or bash

--log-format-regex='(?P<host>[\w...])'

would pass the regex in unscathed, or with my copy of ancient sh you just need one extra backslash, i.e.

--log-format-regex='(?P<host>[\\w...])'

etc.

HTH

comment:125 Changed 13 months ago by matt (mattab)

Maybe we are missing a few examples in the doc for how to call the script. Would you mind sharing your examples, if you're reading this?

we will add such help text in the README.

comment:126 Changed 13 months ago by lyrrr

Okay, finaly this worked:

python misc/log-analytics/import_logs.py --url=http://localhost/piwik log_analysis/access.2013-04-02.log --idsite=1 --log-format-regex='(?P<ip>\S+) (?P<host>\S+) (?P<user_agent>\S+) \[(?P<date>.*?) (?P<timezone>.*?)\] (?P<query_string>\S*) (?P<path>\S+) HTTP\/1\.1 (?P<status>\S+) (?P<length>\S+) (?P<time>.*?) (?P<referrer>.*?)'

This would be an interesting example for your doc I guess

  • matching a regex to a valve format
  • considering %l (remote hostname) %u (remote user auth name) whhich default to - if unavailable
  • dealing with %r which is the request first line, combining POST/GET + path + protocol, somehow matched with query_string + path + hardcoded value to ignore

I now have to play with piwik to ponder the relevancy of the tool in my use case (analyzing client's call to a server managing schedules, client's information, etc; to get a better idea, a big picture on topic like network/database/cpu).
I guess I'm not very clear and twisting piwik out of the "web analyzis" intended usage. Any suggestion on this topic is welcome.

Last technical thing for this post: my time fiels is millisecond, not second. How to specify that?

Thanks for the help!

comment:127 Changed 12 months ago by bangpound

I have set this up on a varnish server that is logging through varnishncsa. However, the requests that varnish logs include the host name as the "request."

123.456.78.9 - - [23/Apr/2013:07:05:51 -0400] "GET http://asite.org/thing/471 HTTP/1.1" 200 13970 "http://www.google.com/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"

When I import this with import_logs.py, piwik was registering hits at http://asite.org/http://asite.org/thing/471 so I was able to work around this by using the log-format-regex parameter.

--log-format-regex='(?P<ip>\S+) \S+ \S+ \[(?P<date>.*?) (?P<timezone>.*?)\] "\S+ https?://asite\.org(?P<path>.*?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.*?)" "(?P<user_agent>.*?)"'

It would be great if this were more directly supported and documented (varnishncsa tracking through import_logs.py). I suspect my method isn't ideal for situations where more than one site is being cached with varnish and also if visitors to those sites are being logged by piwik. This method probably only works with one domain.

comment:128 Changed 12 months ago by oliverhumpage

Hi bangpound,

I'm not a piwik dev so I can't comment on including a varnishncsa in the import_logs.py itself, but if you change your regex slightly to replace

https?://asite\.org

with

(?P<host>https?://[^/]+)

then that will pick up the hostname of the site and therefore work well with multiple vhosts (either define them in piwik in advance, or use --add-sites-new-hosts to add them automatically).

Hope that helps.

comment:129 Changed 11 months ago by brgsousa

Similar to cdgraff's request:
Feature Request: Support WMS (Windows Media Services) logs. Currently we use Awstats but it would be great to be able to move it to PIWIK.

I have attached a sample of WMS version 9.0 log file: WMS_20130523.log

Changed 11 months ago by brgsousa

Log for WMS 9.0

comment:130 Changed 11 months ago by rvthof

I've noticed that the local time for imported logs is not set correctly. Is this correct or am I doing something wrong?

It seems as if Piwik is using the timezone of the web server that created the logs to set the local visitor time. I don't know if this is part of the importer or part of Piwik itself, but I would like to see the local visitor time be related to the timezone the visitor actually is based on their IP geoIP. It should be possible either using approximation based on longitude and latitude or by using a database like GeoNames.

comment:131 Changed 10 months ago by dsampson

Hey Folks,

Thought I would inform this thread that I have been working on a batch loading script for those of us that require some extra features such as remembering how many lines in a log were processed. The major use case is for people running scripts through cron jobs on log files roted monthly, but they want to run the stats daily or more frequently than monthly.

You can check out the branch development of batch-loader.py for piwik here:

https://github.com/drsampson/piwik/tree/batch-loader/misc/log-analytics/batch-loader

I would love some testers and feedback. Read the readme here for an overview:
https://github.com/drsampson/piwik/blob/batch-loader/misc/log-analytics/batch-loader/readme.md

Developer notes:
This work is a branch of a forked version of piwik. My goal is to someday make a pull request to integrate in piwik. So piwik developers are encouraged to comment so I can prepare.

comment:132 Changed 10 months ago by Cyril (cbay)

dsampson: I've had a very quick look at your script. The core feature, which is keeping track of already imported log lines, should be done in Piwik itself, as detailed by Matt on this ticket. Using a local SQLite database is an inferior solution.

Your Python code could be better. A few suggestions:

  • follow the PEP8, as it's the de-facto standard in the Python world
  • do not concatenate multiple strings with +. Instead, store your strings in a list and use .join()
  • even better, when you can, use string formatting
  • count = sum(1 for line in open(logFile)) : use len() with xreadlines() instead.

comment:133 Changed 10 months ago by dsampson

Thanks for the feedback.

  • I will review pep8, it has been a while
  • interesting approach with strings, I can give that a shot
  • do you mean string formatting using %s ? For DB access the community advised against that, but for running the external command that might make sense.

As for developing in Piwik. Python is the extent of this geographers hacking skills. I thought since this was not being done within PIWIK I would create a homebrew solution. Then I convinced myself to offer it back to the community for those who could use it.

Perhaps it will inspire someone to do it the right way within piwik, which would be awesome. Right now it keeps me out of the piwik internals, which is probably best for everyone (smile).

comment:134 Changed 10 months ago by Cyril (cbay)

String formatting was a general tip to avoid multiple concatenations. Indeed, it should NOT be used for SQL requests with unfiltered input.

As for having a proper solution to your problem, you might try harassing Matt so that he implements it into Piwik :) Just kidding, but I would LOVE to have it!

comment:135 Changed 9 months ago by matt (mattab)

  • Description modified (diff)

Thanks for your submission of this tool that enhances log analytics process use cases.

For the particular "log line skip" feature, Why in core? because if several servers call Piwik, you are in trouble with the SQLite database. Better re-use Piwik datastore to keep track of dupes :)

Here is my updated proposal implementation.

  • Detect when log-lines are re-imported and only import them once.
    • Implementation: add new table piwik_log_lines (hash_tracking_request, day ))
    • In Piwik Tracker, before looping on the bulk requests, SELECT all the log lines that have already been processed on this day (WHERE hash_tracking_request IN (a,b,c,d) AND day=?) & Skip these requests from import
    • After bulk requests are processed in piwik.php process, INSERT in bulk (hash, day)
  • By default this feature would be enabled only for "Log import" script,
    • via a parameter that we know is the log import (&li=1 /import_logs=1)
    • but may be later useful to all users of Tracking API for general deduping service.

comment:136 Changed 9 months ago by dsampson

Matt,

I agree with you that getting it into core would be best. Having this solution means I could possibly dissolve my forked project. Again if I was a PHP and MySQL developer I would love to help. As a geographer, scripting is done on the side to handle special use cases.

For clarification of the use case for this script, it is launched independent of piwik. By that I mean the script will likely reside on a log server somewhere, not the PIWIK server. The script is called likely through a cron job. Since there will only be a single instance of the script run on any server you won't run into collides with multiple servers using it. If you need multiple instances then you will have each with an independant sqlite DB. That is why I used SQLITE because you have only one client accessing the client at any one time.

Let me know when these features are added to core and I will dissolve my fork.

Good luck.

comment:137 Changed 8 months ago by matt (mattab)

  • Description modified (diff)

Updated description, adding:

  • #3867 cannot resume with line number reported by skip for ncsa_extended log format
  • #4045 autodetection hangs on a weird formatted line

comment:138 Changed 8 months ago by gildow

Request for support of "x-forwarded-for" in cases where load balancing is placed in front of web server when importing log.

Apache Log format is as follow :-

LogFormat "%v %{X-Forwarded-For}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" cplus

Sample Log:-
smartstore.oomph.co.id 10.159.117.216, 202.70.56.129 - - [27/Jun/2013:12:05:28 +0700] "GET /index.php/nav/get_menu/1/ HTTP/1.1" 200 2391 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"

if noticed, there are 2 ip for remote host(in this case X-forwarded-for parameter. The 1st IP is the "virtual IP/local ip" and the second being the proxy used on a mobile network.

Regular expression when importing log used is as followed-

--log-format-regex='(?P<host>[\w\-\.]*)(?::\d+)? (?P<ip>\S+) \S+ \S+ \[(?P<date>.*?) (?P<timezone>.*?)\] "\S+ (?P<path>.*?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.*?)" "(?P<user_agent>.*?)"'

This works for regular log lines where there is only 1 IP address...

Current Workaround is to add additional field in the import_log.py for additional field for proxy...and run the import again with new regex.

--log-format-regex='(?P<host>[\w\-\.]*)(?::\d+)? (?P<proxy>\S+), (?P<ip>\S+) \S+ \S+ \[(?P<date>.*?) (?P<timezone>.*?)\] "\S+ (?P<path>.*?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.*?)" "(?P<user_agent>.*?)"'

I will be nice if additional support to handle additional x-forwarded-for instead.

comment:139 Changed 8 months ago by gildow

Request for support of "x-forwarded-for" in cases where load balancing is placed in front of web server when importing log.

Apache Log format is as follow :-

LogFormat "%v %{X-Forwarded-For}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" cplus

Sample Log:-
smartstore.oomph.co.id 10.159.117.216, 202.70.56.129 - - [27/Jun/2013:12:05:28 +0700] "GET /index.php/nav/get_menu/1/ HTTP/1.1" 200 2391 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"

if noticed, there are 2 ip for remote host(in this case X-forwarded-for parameter. The 1st IP is the "virtual IP/local ip" and the second being the proxy used on a mobile network.

Regular expression when importing log used is as followed-

--log-format-regex='(?P<host>[\w\-\.]*)(?::\d+)? (?P<ip>\S+) \S+ \S+ \[(?P<date>.*?) (?P<timezone>.*?)\] "\S+ (?P<path>.*?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.*?)" "(?P<user_agent>.*?)"'

This works for regular log lines where there is only 1 IP address...

Current Workaround is to add additional field in the import_log.py for additional field for proxy...and run the import again with new regex.

--log-format-regex='(?P<host>[\w\-\.]*)(?::\d+)? (?P<proxy>\S+), (?P<ip>\S+) \S+ \S+ \[(?P<date>.*?) (?P<timezone>.*?)\] "\S+ (?P<path>.*?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.*?)" "(?P<user_agent>.*?)"'

I will be nice if additional support to handle additional x-forwarded-for instead.

comment:140 follow-up: Changed 8 months ago by Cyril (cbay)

If you're using a reverse proxy, you really should use something like mod_rpaf so that the recorded IP address for Apache is the correct one (the client, not the proxy). And then you can use the standard log formats.

comment:141 in reply to: ↑ 140 Changed 8 months ago by gildow

Correct me if I am wrong... I'm pretty new to piwik... had used awstats previously.

This would be possible if the log was not older log. We are talking about handling IMPORTING log and not existing...it makes not much sense o me if to ask user to use mod_rpaf when their aim is to import older logs which had not implemented that.

The aim of import is to import older logs...for current tracking, this can already be done by piwik itself.

Replying to Cyril:

If you're using a reverse proxy, you really should use something like mod_rpaf so that the recorded IP address for Apache is the correct one (the client, not the proxy). And then you can use the standard log formats.

comment:142 Changed 8 months ago by Cyril (cbay)

I don't get why that won't work with a custom regexp?

comment:143 Changed 8 months ago by oliverhumpage

Assuming you want the last IP in the list (and also that you *trust* the last IP in the list - this is why mod_rpaf is the best idea since you can prevent clients spoofing IPs):

--log-format-regex='(?P<host>[\w\-\.]*)(?::\d+)? (?:\S+?, )*(?P<ip>\S+)/) …

If you want to capture proxy information, I don't think piwik supports that, so you'd need to set up a separate site with an import regex that captures the first IP in the list instead.

comment:144 Changed 8 months ago by gildow

Think the main point here is to "IMPORT" Existing log. For new log, it can be implemented easily as it is all done in java script.

As for "I don't get why that won't work with a custom regexp?" Any idea how/what the regexp can be...sorry I am no expert for regex...which is why I ended up having to process the log twice... and modifying the python script.

comment:145 Changed 6 months ago by AxelBrooks

Hi,
I'm testing the import, and ran the python script twice on the same log file.
It looks like the same log file was processed twice.

Does it mean I have to handle on my own the log file history ?
Iow, can you confirm piwik log processor does not remember the starting date and end date of the log files ?

Thanks,
Axel

comment:146 Changed 6 months ago by matt (mattab)

Iow, can you confirm piwik log processor does not remember the starting date and end date of the log files ?

Correct. we would like to add this feature at some point. If you can sponsor it, get in touch!

comment:148 Changed 6 weeks ago by Jadeham

Hi,

my box won't properly process log entries passed to stdin of import_logs.py. When i read the exact same entries from a file, everything works great. I am using nginx_json formatted entries. I have tried in dry run mode and normal - each time i read from stdin i get the following output (nothing imported). Can anyone get this setup to work via stdin?

Thank you for your help!

Test data:

{"ip": "41.11.12.41","host": "www.mywebsite.com","path": "/","status": "200","referrer": "http://"www.mywebsite.com/previous","user_agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/32.0.1700.107 Chrome/32.0.1700.107 Safari/537.36","length": 3593,"generation_time_milli": 0.275,"date": "2014-03-12T22:41:23+01:00"}

Python script parameters:

--url=http://piwik.mywebsite.com
--idsite=1
--recorders=1
--enable-http-errors
--enable-reverse-dns
--enable-bots
--log-format-name=nginx_json

--output
2014-03-12 23:29:37,251: [DEBUG] Accepted hostnames: all
2014-03-12 23:29:37,252: [DEBUG] Piwik URL is: http://piwik.mywebsite.com
2014-03-12 23:29:37,252: [DEBUG] No token-auth specified
2014-03-12 23:29:37,252: [DEBUG] No credentials specified, reading them from "the config file"
2014-03-12 23:29:37,374: [DEBUG] Authentication token token_auth is: a really beautiful token :)
2014-03-12 23:29:37,375: [DEBUG] Resolver: static
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2014-03-12 23:29:37,532: [DEBUG] Launched recorder
Parsing log (stdin)...
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)

Logs import summary


0 requests imported successfully
0 requests were downloads
0 requests ignored:

0 invalid log lines
0 requests done by bots, search engines, ...
0 HTTP errors
0 HTTP redirects
0 requests to static resources (css, js, ...)
0 requests did not match any known site
0 requests did not match any requested hostname

Website import summary


0 requests imported to 1 sites

1 sites already existed
0 sites were created:

0 distinct hostnames did not match any existing site:

Performance summary


Total time: 10 seconds
Requests imported per second: 0.0 requests per second

comment:149 Changed 5 weeks ago by oliverhumpage

Jadeham,

Try setting --recorder-max-payload-size=1 . I remember having issues myself when testing with very small data sets (e.g. just 1 line).

comment:150 Changed 5 weeks ago by matt (mattab)

  • Milestone changed from 2.x - The Great Piwik 2.x Backlog to Future releases
  • Priority changed from critical to normal

comment:151 Changed 4 days ago by estemendoza

I have a similar problem to Jadeham.

I have configured nginx to log with json format and created the following script that reads from access.log (with json format) and pass every line as stdin:

import sh
from sh import tail

run = sh.Command("/usr/bin/python")
run = run.bake("/var/www/piwik/misc/log-analytics/import_logs.py")
run = run.bake("--output=/home/XXX/piwik_live_importer/piwik.log")
run = run.bake("--url=http://X.X.X.X:8081/piwik/")
run = run.bake("--idsite=1")
run = run.bake("--recorders=1")
run = run.bake("--recorder-max-payload-size=1")
run = run.bake("--enable-http-errors")
run = run.bake("--enable-http-redirects")
run = run.bake("--enable-static")
run = run.bake("--enable-bots")
run = run.bake("--log-format-name=nginx_json")
run = run.bake("-")

for line in tail("-f", "/var/log/nginx/access_json.log", _iter=True):
    run(_in=line)

The problem that I'm having is that it seems that every record is saved but if I go to main panel, today's history it's not shown. This is the output when saving every line:

Parsing log (stdin)...
Purging Piwik archives for dates: 2014-04-16
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.

Logs import summary
-------------------

    1 requests imported successfully
    2 requests were downloads
    0 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    1 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 0 seconds
    Requests imported per second: 44.04 requests per second

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)

Besides that, when running archive.php, it's slower than parsing default nginx log format and a lot of lines are marked as invalid:

Logs import summary
-------------------

    94299 requests imported successfully
    145340 requests were downloads
    84140 requests ignored:
        84140 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    94299 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 1147 seconds
    Requests imported per second: 82.21 requests per second

Is there any way to know why these records are not shown and which are the records that are being marked as invalid?

comment:152 Changed 4 days ago by estemendoza

Ok, I figured out why the invalid requests. It was because the user_agent had a strange character. So, maybe the script should be aware of unicode characters

comment:153 Changed 4 days ago by matt (mattab)

To see the data in the dashboard, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.

Ok, I figured out why the invalid requests. It was because the user_agent had a strange character. So, maybe the script should be aware of unicode characters

Sure, please create a new ticket for this bug and attach a log file with 1 line that showcases the bug. Thanks

Note: See TracTickets for help on using tickets.