Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug import_logs.py #4697

Closed
imoullet opened this issue Feb 17, 2014 · 8 comments
Closed

bug import_logs.py #4697

imoullet opened this issue Feb 17, 2014 · 8 comments
Labels
Bug For errors / faults / flaws / inconsistencies etc. Critical Indicates the severity of an issue is very critical and the issue has a very high priority. worksforme The issue cannot be reproduced and things work as intended.
Milestone

Comments

@imoullet
Copy link

Running
piwik 2.1 RC1
python 2.7.1

Discussing about thet problem for 4 weeks now in the forum (
http://forum.piwik.org/read.php?2,110277 ) I cannot get any solution and really think there is a bug in import_logs.py.I let you read all the details in the discussion mentioned above

Here is a summarry with some test log file ..

I run the following command
/usr/bin/python2.7 /var/www/html/piwik/misc/log-analytics/import_logs.py --url=https://w3stat.unil.ch/piwik/ /var/tmp/stats/app/xxxx --idsite=xxx --config=/var/www/html/piwik/config/config.ini.php --recorders=2 --log-hostname=www3.unil.ch --hostname=www3.unil.ch --enable-static --enable-bots --enable-http-errors --enable-http-redirects --enable-reverse-dns --strip-query-string

for the two logfiles I send you in attachment ( 22nd of january and 14th pf february)

You can have a look to the results for this site on our piwik site : https://w3stat.unil.ch/piwik using piwik/debug4piwik as user/pwd.

You wil see that the piwik results are wrong both for the visitor log ( some IP are ignored for the 22nd of january AND also for the 14th of february ) and the actions > pages report.

For example, some IP are missing in Log visitor report for day 22 of january

65.55.24.218 and 83.233.207.74 are not there while they are present in the log files.. ( see my preceding message)

And the actions > pages report is empty !!!!!!!!!!!!!!!!! while I have some access such as

83.139.189.139 - - +0100 "GET /wpmu/alumnil/participez-a-la-construction-dun-nouvel-avenir-technologique-et-social/ HTTP/1.0" 200 34367 "http://www3.unil.ch/wpmu/alumnil/participez-a-la-construction-dun-nouvel-avenir-technologique-et-social/" "Mozilla/5.0 (Windows NT 5.2; rv:17.0) Gecko/20100101 Firefox/17.0"

in my logfile

I also mention that for the same site I can see some access ( in actions > pages report) as I use the WP piwik plugin for this individual site !!

The actions > downloads report is the only one which seem to be correct.

So in conclusion I cannot compare my results for each individual Wordpress site generated using WP PIwik plugin and the results for all my WOrdpress sites generated using import_log.py. Indeed the result for all ( ie 250 sites !!) WP sites are much less than for one indivudual site. That 's the reason which alerts me somethng was wrong with import_log.py !!

I am really confused about that..
I have the same results for all my parsed logfiles. They all come from an apache webserver with combined ( ncsa.. ) format..

Please let me know if you need more information.. The piwik output file is correct in the sene that it imports the correct number of lines.......

Hope you can help me !!

Keywords: import_logs

@imoullet
Copy link
Author

Attachment: test logfile 22n january
access_test0

@imoullet
Copy link
Author

Attachment: logfile 14th Feb
access_test1.zip

@mattab
Copy link
Member

mattab commented Feb 20, 2014

65.55.24.218 is excluded because it is a MSN bot IP address which we exclude by design ( unless you add --enable-bots )

IP: 83.233.x.x is tracked for me.

After importing the logs you have to re-run the archive.php cron script

All looks like working (I reused your access_test0 log). Plaese try again with RC5 as I think it will work!

@imoullet
Copy link
Author

Indeed I added the --enable-bots option !!!

So I retry with RC6 today on access_test0 ( 13 lines logfile) and I still have the same problem.. IP: 83.233.x.x is not tracked for me and nor is 65.55.24.218 even with --enable-bots option ..

Why do you say I have to re run archive.php cron script ?? I don't understand..

and what do you think of this following one line logifle example I mentioned in the forum


Last test with piwik 2.1RC6 and one line logfile :

41.107.212.109 - - +0100 "GET /slav/ling/cours/a07-08/SEMI%20UNIL/041207Iva.html HTTP/1.1" 200 6690 "https://www.google.dz/" "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"

command:
/usr/local/bin/python2.7 /var/www/html/piwik/misc/log-analytics/import_logs.py --url=https://w3stat.unil.ch/piwik/ --idsite=579 --config=/var/www/html/piwik/config/config.ini.php --recorders=2 --log-hostname=www2.unil.ch --hostname=www2.unil.ch --enable-static --enable-bots --enable-http-errors --enable-http-redirects --enable-reverse-dns --strip-query-string /var/tmp/stats/prod/access_slav2 2>&1

Output:
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log /var/tmp/stats/prod/access_slav2...
Purging Piwik archives for dates: 2014-02-17
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: [piwik.org] for more info.

Logs import summary

1 requests imported successfully
0 requests were downloads
0 requests ignored:
0 invalid log lines
0 requests done by bots, search engines, ...
0 HTTP errors
0 HTTP redirects
0 requests to static resources (css, js, ...)
0 requests did not match any known site
0 requests did not match any requested hostname

Website import summary

1 requests imported to 1 sites
1 sites already existed
0 sites were created:

0 distinct hostnames did not match any existing site:

Performance summary

Total time: 0 seconds
Requests imported per second: 11.11 requests per second

And the actions > report page is ..... empty ................!!!!!!!!!! I just see the IP in the visitor log but again with 0 action.

Any idea ??


Could it be my database is corrupted somewhere ?? I do not understand and really have no confidence in piwik tracking results

Thanks for your help

@mattab
Copy link
Member

mattab commented Feb 21, 2014

do you have "Browser trigger archiving" enabled?

see: http://piwik.org/docs/setup-auto-archiving/

execute:

 php misc/cron/archive.php --url=http://piwik.example.org ```

after importing the stats.

can you now see it in the Page URLs report for 17th of feb?

@imoullet
Copy link
Author

No, "Browser trigger archiving" is disabled now..
I run archive.php every 2 hours.. which means it has been run a lot of times since I import the data..
and I stil do not see the pages in actions > Pages report and I can see the IP BUT with "0 actions"

!!!
I am away of my office for one week now so do not "desperate" if I don't reply to your suggestions the day after..

Best regards,

@vspiliop
Copy link

Hello to all!

I am using piwik for a customer and just found out the following very serious issue.

I am using the latest piwik (2.2.0) and have probably the same issue with imoullet.

PROBLEM:

Lines with HTTP status 200 are ignored!! i.e. only the first entry is included both to the Visits and to the Actions. This applies before or after I do the achieving. So archiving is irrelevant.

I just import (access.log : file with just 2 lines):

66.249.76.11 - - +0100 "GET /id/resource/013541589 HTTP/1.1" 303 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.76.11 - - +0100 "GET /doc/resource/007667232 HTTP/1.1" 200 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

via command:

python import_logs.py --url=http://localhost:83/
analytics/ access.log --idsite=1 --recorders=2 --enable-http-errors --enable-http-redirects --enable-static --ena
ble-bots --add-sites-new-hosts

Result:

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log access_006_bl.services.tso.co.uk.2014.05.12.log...
Purging Piwik archives for dates: 2014-05-11
To re-process these reports with your new update data, execute the following command:
piwik/console core:archive --url=http://example/piwik/
Reference: http://piwik.org/docs/setup-auto-archiving/

Logs import summary

2 requests imported successfully
0 requests were downloads
0 requests ignored:
    0 invalid log lines
    0 requests done by bots, search engines, ...
    0 HTTP errors
    0 HTTP redirects
    0 requests to static resources (css, js, ...)
    0 requests did not match any known site
    0 requests did not match any requested hostname

Website import summary

2 requests imported to 1 sites
    1 sites already existed
    0 sites were created:

0 distinct hostnames did not match any existing site:

Performance summary

Total time: 0 seconds
Requests imported per second: 3.29 requests per second

Kind Regards,
Vassilis

@mattab
Copy link
Member

mattab commented May 14, 2014

I am using the latest piwik (2.2.0) and have probably the same issue with imoullet.

Latest piwik is 2.2.2 and this bug should be fixed.

please try latest latest beta; http://piwik.org/faq/how-to-update/faq_159/
and create a new ticket here if you still have a bug with that version. thanks

@imoullet imoullet added this to the 2.1 - Piwik 2.1 milestone Jul 8, 2014
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc. Critical Indicates the severity of an issue is very critical and the issue has a very high priority. worksforme The issue cannot be reproduced and things work as intended.
Projects
None yet
Development

No branches or pull requests

3 participants