Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import_logs.py cannot resume with line number reported by skip for ncsa_extended log format #3867

Closed
anonymous-matomo-user opened this issue Apr 3, 2013 · 4 comments
Labels
Bug For errors / faults / flaws / inconsistencies etc. worksforme The issue cannot be reproduced and things work as intended.
Milestone

Comments

@anonymous-matomo-user
Copy link

others seem to be experiencing the same problem:

http://forum.piwik.org/read.php?2,94797,102560

When the import fails, trying to resume with line reported by skip fails as well. However, I noticed the number of lines parsed was typically the same and supplied that value to skip which allowed the import to resume. Without debugging the issue, I cannot know if this is a valid work around i.e. does the script resume skipping over potentially valid log lines
Keywords: import_logs.py ncsa_extended resume skip lines

@anonymous-matomo-user
Copy link
Author

beyond failing to resume, for those looking to automate the import it is not ideal to have high probability of the import failing so possibly some issue parsing the ncsa_extended format

initially experienced with 1.11.1 release (python 2.6 and python 2.7.3 with php 5.3.14 and 21,22,23)

I've used the most current version of the script from the git repo and same issue

I have 7GB of log data to import split into reasonable size files and the import fails on a few of them now whereas without specifying an exclude list (save on processing requests to api) it failed more often

@mattab
Copy link
Member

mattab commented Apr 3, 2013

Thanks for the report. Could you take a look at the REGEX used for this log format, and existing tests in: https://github.com/piwik/piwik/tree/master/misc/log-analytics/tests/logs
and possibly upgrade the test or maybe it helps finding the issue?

post here if you have findings... thanks!

@anonymous-matomo-user
Copy link
Author

Could you take a look at the REGEX used for this log format
Verified the log format (apache configuration as "combined" including visual inspection) against the regex. http://httpd.apache.org/docs/1.3/logs.html#combined (Logs being imported are from an OpenBSD web server which uses a forked version of apache 1.3)

existing tests
Not sure how the piwik log tests will be any help in this situation. I tried to make sure I didn't have a configuration issue and combed the forums and bug tracker prior to posting the ticket.

Attempting skip after import quit...

continuing the import doesn't get far when reattempting using the value reported for skip. However, I am able to get the import to complete in most cases using last report value of lines parsed. Other cases, the import continues recording but quits again. Specifying the skip value results in quitting without recording any line but providing the last reported value of lines parsed results in the import completing.

Start import of log
...
289630 lines parsed, 33871 lines recorded, 577 records/sec (avg), 535 records/sec (current)
2013-04-07 21:16:31,512: [DEBUG] Error when connecting to Piwik: <urlopen error None>
293091 lines parsed, 34032 lines recorded, 570 records/sec (avg), 161 records/sec (current)
293091 lines parsed, 34238 lines recorded, 564 records/sec (avg), 206 records/sec (current)
2013-04-07 21:16:33,588: [DEBUG] Error when connecting to Piwik: <urlopen error None>
299982 lines parsed, 34954 lines recorded, 567 records/sec (avg), 716 records/sec (current)
303928 lines parsed, 35229 lines recorded, 562 records/sec (avg), 275 records/sec (current)
2013-04-07 21:16:35,670: [DEBUG] Error when connecting to Piwik: <urlopen error None>
Fatal error: None
You can restart the import of "log-to-import" from the point it failed by specifying --skip=273790 on the command line.

1st attempt to continue import using value reported by skip
...
2013-04-07 21:18:31,927: [DEBUG] Resolver: static
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2013-04-07 21:18:31,998: [DEBUG] Launched recorder
2013-04-07 21:18:31,998: [DEBUG] Launched recorder
2013-04-07 21:18:31,999: [DEBUG] Launched recorder
2013-04-07 21:18:31,999: [DEBUG] Launched recorder
Parsing log log-to-import...
273944 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2013-04-07 21:18:33,741: [DEBUG] Error when connecting to Piwik: <urlopen error None>
2013-04-07 21:18:33,779: [DEBUG] Error when connecting to Piwik: <urlopen error None>
2013-04-07 21:18:33,808: [DEBUG] Error when connecting to Piwik: <urlopen error None>
2013-04-07 21:18:33,889: [DEBUG] Error when connecting to Piwik: <urlopen error None>
287931 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
301444 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2013-04-07 21:18:35,863: [DEBUG] Error when connecting to Piwik: <urlopen error None>
2013-04-07 21:18:35,868: [DEBUG] Error when connecting to Piwik: <urlopen error None>
2013-04-07 21:18:35,931: [DEBUG] Error when connecting to Piwik: <urlopen error None>
2013-04-07 21:18:35,992: [DEBUG] Error when connecting to Piwik: <urlopen error None>
305730 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
305730 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2013-04-07 21:18:38,024: [DEBUG] Error when connecting to Piwik: <urlopen error None>
Fatal error: None
You can restart the import of "log-to-import" from the point it failed by specifying --skip=274057 on the command line.

2nd attempt to continue import using value reported by skip
...
2013-04-07 21:19:23,406: [DEBUG] Resolver: static
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2013-04-07 21:19:23,477: [DEBUG] Launched recorder
2013-04-07 21:19:23,478: [DEBUG] Launched recorder
2013-04-07 21:19:23,478: [DEBUG] Launched recorder
2013-04-07 21:19:23,478: [DEBUG] Launched recorder
Parsing log /var/apache-log-archive/apache-log-import/crs.com.aa...
274368 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2013-04-07 21:19:25,176: [DEBUG] Error when connecting to Piwik: <urlopen error None>
2013-04-07 21:19:25,244: [DEBUG] Error when connecting to Piwik: <urlopen error None>
2013-04-07 21:19:25,245: [DEBUG] Error when connecting to Piwik: <urlopen error None>
2013-04-07 21:19:25,332: [DEBUG] Error when connecting to Piwik: <urlopen error None>
288769 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
302306 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2013-04-07 21:19:27,351: [DEBUG] Error when connecting to Piwik: <urlopen error None>
2013-04-07 21:19:27,413: [DEBUG] Error when connecting to Piwik: <urlopen error None>
2013-04-07 21:19:27,449: [DEBUG] Error when connecting to Piwik: <urlopen error None>
306365 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2013-04-07 21:19:27,626: [DEBUG] Error when connecting to Piwik: <urlopen error None>
306365 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2013-04-07 21:19:29,402: [DEBUG] Error when connecting to Piwik: <urlopen error None>
Fatal error: None
You can restart the import of "lot-to-import" from the point it failed by specifying --skip=274066 on the command line.

Continuing with last reported value of lines parsed results in import completing
...
2013-04-07 21:19:55,738: [DEBUG] Resolver: static
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2013-04-07 21:19:55,808: [DEBUG] Launched recorder
2013-04-07 21:19:55,808: [DEBUG] Launched recorder
2013-04-07 21:19:55,808: [DEBUG] Launched recorder
2013-04-07 21:19:55,809: [DEBUG] Launched recorder
Parsing log log-to-import...
269907 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
317242 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
328403 lines parsed, 325 lines recorded, 107 records/sec (avg), 325 records/sec (current)
328942 lines parsed, 962 lines recorded, 238 records/sec (avg), 637 records/sec (current)
334000 lines parsed, 1846 lines recorded, 365 records/sec (avg), 884 records/sec (current)
338542 lines parsed, 1910 lines recorded, 315 records/sec (avg), 64 records/sec (current)
343848 lines parsed, 2684 lines recorded, 379 records/sec (avg), 774 records/sec (current)
348330 lines parsed, 3435 lines recorded, 424 records/sec (avg), 751 records/sec (current)
349526 lines parsed, 3592 lines recorded, 394 records/sec (avg), 157 records/sec (current)
355884 lines parsed, 4337 lines recorded, 428 records/sec (avg), 745 records/sec (current)
358584 lines parsed, 5089 lines recorded, 457 records/sec (avg), 752 records/sec (current)
...
1491586 lines parsed, 159321 lines recorded, 492 records/sec (avg), 816 records/sec (current)
1494986 lines parsed, 159583 lines recorded, 491 records/sec (avg), 262 records/sec (current)
1499957 lines parsed, 160213 lines recorded, 492 records/sec (avg), 630 records/sec (current)
1499997 lines parsed, 160826 lines recorded, 492 records/sec (avg), 613 records/sec (current)
1499997 lines parsed, 161206 lines recorded, 492 records/sec (avg), 380 records/sec (current)
1499997 lines parsed, 162168 lines recorded, 493 records/sec (avg), 962 records/sec (current)
1499997 lines parsed, 162254 lines recorded, 492 records/sec (avg), 86 records/sec (current)
Purging Piwik archives for dates: 2012-07-31 2012-07-24 2012-07-25 2012-07-26 2012-07-21 2012-07-27 2012-07-22 2012-07-28 2012-07-23 2012-07-29 2012-07-30
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.

Logs import summary

162403 requests imported successfully
13742 requests were downloads
333619 requests ignored:
    184411 invalid log lines
    0 requests done by bots, search engines, ...
    84022 HTTP errors
    0 HTTP redirects
    65186 requests to static resources (css, js, ...)
    0 requests did not match any known site
    0 requests did not match any requested hostname

Website import summary

162403 requests imported to 1 sites
    1 sites already existed
    0 sites were created:

0 distinct hostnames did not match any existing site:

Performance summary

Total time: 330 seconds
Requests imported per second: 492.0 requests per second

Importing to reproduce the failure again...

I now noticed that for the piwik server nginx logged a few 499s just before the script terminated. That's probably normal I suppose being that the import script quits after a fixed number of errors? There are 4 recorders, assuming one of the four encounters the error and retries then the script decides to quit having the other three close their connections? If not, then the question is why the connection was being closed by the import script. So possibly an NGINX configuration problem then? I'm seeing anything in to indicate an configuration problem in the nginx, fpm, or php logs.

options being used:

--url=piwk url \
--idsite=9 \
--output=${LOGFILE} \
--hostname=hostname \
--log-format-name=ncsa_extended \
--encoding=ASCII \
--recorders=4 \
--enable-bots \
--enable-http-redirects \
--exclude-path-from=PIWIK_EXCLUDE_PATH.txt \

...
1.2.3.4 - - -0500 "POST /piwik.php HTTP/1.1" 200 39 "-" "Python-urllib/2.7"
1.2.3.4 - - -0500 "POST /piwik.php HTTP/1.1" 200 54 "-" "Python-urllib/2.7"
1.2.3.4 - - -0500 "POST /piwik.php HTTP/1.1" 200 54 "-" "Python-urllib/2.7"
1.2.3.4 - - -0500 "POST /piwik.php HTTP/1.1" 499 0 "-" "Python-urllib/2.7"
1.2.3.4 - - -0500 "POST /piwik.php HTTP/1.1" 499 0 "-" "Python-urllib/2.7"
1.2.3.4 - - -0500 "POST /piwik.php HTTP/1.1" 499 0 "-" "Python-urllib/2.7"
1.3.3.4 - - -0500 "POST /piwik.php HTTP/1.1" 200 39 "-" "Python-urllib/2.7"

...
273753 lines parsed, 33009 lines recorded, 572 records/sec (avg), 476 records/sec (current)
282152 lines parsed, 33336 lines recorded, 568 records/sec (avg), 327 records/sec (current)
293091 lines parsed, 33871 lines recorded, 567 records/sec (avg), 535 records/sec (current)
2013-04-07 16:07:30,306: [DEBUG] Error when connecting to Piwik: <urlopen error None>
293091 lines parsed, 34032 lines recorded, 560 records/sec (avg), 161 records/sec (current)
293091 lines parsed, 34238 lines recorded, 555 records/sec (avg), 206 records/sec (current)
2013-04-07 16:07:32,395: [DEBUG] Error when connecting to Piwik: <urlopen error None>
299982 lines parsed, 34954 lines recorded, 557 records/sec (avg), 716 records/sec (current)
303928 lines parsed, 35229 lines recorded, 553 records/sec (avg), 275 records/sec (current)
2013-04-07 16:07:34,472: [DEBUG] Error when connecting to Piwik: <urlopen error None>
Fatal error: None
You can restart the import of "log-to-import" from the point it failed by specifying --skip=273790 on the command line.
...

Importing suspect portion of log...

Additionally, I tried to reproduce the failure with the suspect portion of the log but that went without issue.

sed -n '273790,303928p; 303929q' log-to-import > problem.log

...
2013-04-07 16:32:08,985: [DEBUG] Resolver: static
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2013-04-07 16:32:09,056: [DEBUG] Launched recorder
2013-04-07 16:32:09,056: [DEBUG] Launched recorder
2013-04-07 16:32:09,056: [DEBUG] Launched recorder
2013-04-07 16:32:09,057: [DEBUG] Launched recorder
Parsing log problem.log...
14047 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
27843 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
30139 lines parsed, 596 lines recorded, 196 records/sec (avg), 596 records/sec (current)
30139 lines parsed, 1026 lines recorded, 254 records/sec (avg), 430 records/sec (current)
30139 lines parsed, 1325 lines recorded, 262 records/sec (avg), 299 records/sec (current)
30139 lines parsed, 2148 lines recorded, 354 records/sec (avg), 823 records/sec (current)
30139 lines parsed, 2483 lines recorded, 351 records/sec (avg), 335 records/sec (current)
30139 lines parsed, 2795 lines recorded, 346 records/sec (avg), 312 records/sec (current)
30139 lines parsed, 2795 lines recorded, 307 records/sec (avg), 0 records/sec (current)
Purging Piwik archives for dates: 2012-07-21 2012-07-20
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.

Logs import summary

2977 requests imported successfully
309 requests were downloads
10716 requests ignored:
    6058 invalid log lines
    0 requests done by bots, search engines, ...
    3124 HTTP errors
    0 HTTP redirects
    1534 requests to static resources (css, js, ...)
    0 requests did not match any known site
    0 requests did not match any requested hostname

Website import summary

2977 requests imported to 1 sites
    1 sites already existed
    0 sites were created:

0 distinct hostnames did not match any existing site:

Performance summary

Total time: 9 seconds
Requests imported per second: 327.27 requests per second

@anonymous-matomo-user
Copy link
Author

Update: I forgot to post back but ultimately the problem had to do with server configuration timeouts and the size of logs I was importing... although it wasn't captured as an error in the logs for nginx or php.

Increasing the values for nginx and php in the php-fpm pool for piwik allowed successful import of logs, the only other issue I observed was the archiving had been timing out after importing a large amount of data. For that, I further increase the values.

http://nginx.org/en/docs/http/ngx_http_fastcgi_module.html

pass timeout responsibility to upstream (php)

fastcgi_read_timeout 14400; # 4 hrs

PHP FPM (FastCGI) pool config values for piwik

; 30mins for archive.php to generate reports and allow sizable log file imports
php_admin_value[max_execution_time] = 1800
php_admin_value[max_input_time] = 1800

These values worked for the size of log file I was importing given the server hardware piwik was running and a separate database server.

@anonymous-matomo-user anonymous-matomo-user added this to the 2.x - The Great Piwik 2.x Backlog milestone Jul 8, 2014
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc. worksforme The issue cannot be reproduced and things work as intended.
Projects
None yet
Development

No branches or pull requests

2 participants