Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log analytics list of improvements #3163

Closed
mattab opened this issue May 31, 2012 · 162 comments
Closed

Log analytics list of improvements #3163

mattab opened this issue May 31, 2012 · 162 comments
Labels
duplicate For issues that already existed in our issue tracker and were reported previously. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc.

Comments

@mattab
Copy link
Member

mattab commented May 31, 2012

In Piwik 1.8 we released the great new feature to import access logs and generate statistics.

The V1 release works very well (it was tracked in #703), but there are ideas to improve it. This ticket is a placeholder of all ideas and discussions related to the Log Analytics feature!

New features

  • Track non-bot activity only. When --enable-bots is not specified, it would be a nice improvement if we:

    • exclude visits with more than 150 actions per visitorID to block crawlers (detected at the python level by counting requests for that IP in the queue)
    • exclude visits that do not have User Agent or beyond the very basic ones used by all bots
    • exclude all requests when one of the first ones is for /robots.txt -- if we see a robots.txt in the middle we could stop tracking subsequent requests
    • check that /index.php?minimize_js=file.js is counted as a static file since it ends in .js

    After that bots & crawlers detection would be much better.

  • Support Accept-Language header and forward to piwik via the &lang= parameter. That might also be useful to some users who need to use this data in a custom plugin.

  • we could make it easy to delete logs for one day so to reimport one log file

  • This would be a new option to the python script. It would reuse the code from the Log Delete feature, but would only delete one day. The python script would call the CoreAdmin API for example, deleting this single day for a given website. This would allow to easily re-import data that didn't work the first time or was bogus.

  • Detect when log-lines are re-imported and only import them once.

    • Implementation: add new table piwik_log_lines (hash_tracking_request, day ))
    • In Piwik Tracker, before looping on the bulk requests, SELECT all the log lines that have already been processed on this day (WHERE hash_tracking_request IN (a,b,c,d) AND day=?) & Skip these requests from import
    • After bulk requests are processed in piwik.php process, INSERT in bulk (hash, day)
  • By default this feature would be enabled only for "Log import" script,

    • via a parameter that we know is the log import (&li=1 /import_logs=1)
    • but may be later useful to all users of Tracking API for general deduping service.

PERFORMANCE'

How to debug performance? First of all, you can run the script with --dry-run to see how many log lines per second are parsed. It typically should be between 2,000 and 5,000. When you don't do a dry run, it will insert new pageviews and visits calling Piwik API.

Other tickets

@oliverhumpage
Copy link

Could I just re-ask an unanswered problem from ticket #703 - #703 ? If instead of specifying a file you do

cat /path/to/log | log_import.py [options] -

then does it work for you, or do you just get 0 lines imported? Because with the latest version I'm getting 0 lines imported, and that means I can't log straight from apache (and hence the README is wrong too).

@cbay
Copy link
Contributor

cbay commented May 31, 2012

oliverhumpage: I couldn't reproduce this issue. Do you get it with --dry-run too? Could you send a minimal log file?

@ddeimeke
Copy link

ddeimeke commented Jun 4, 2012

Counting Downloads:

In a podcast project I want to count only the downloads of file type "mp3" and "ogg". In an other project it would be nice only to count the pdf-Downloads.

Another topic in this area is, how are downloads counted? Not every occurence of the file in the logs is a download. For instance, I am using a html5-player. Users might here one part of the podcast on their first visit and other parts on succeeding visitis. All together would be one download.

A possible "solution" (or may be a workaround): Sum up all the "bytes transferred" and divide it by the largest "bytes transferred" for a certain file.

@anonymous-matomo-user
Copy link

Feature Request: Support Icecast Logs currently we use Awstats but will be great can move to PIWIK.

@oliverhumpage
Copy link

@cyril

Having spent some time looking into it, and working out exactly which revision caused the problem, I think it's down to the regex I used in --log-format-regex not working any more. Turns out the regex format in import_logs.py has had the group <timezone> added to it, which seems to be required by code further down the script.

Could you update the readme so the middle of the regex changes from:

\[(?P<date>.*?)\]

to

\(?P<timezone>._?)\

This will then make it all work.

Thanks,

Oliver.

@cbay
Copy link
Contributor

cbay commented Jun 8, 2012

(In [6471]) Refs #3163 Fixed regexp in README.

@cbay
Copy link
Contributor

cbay commented Jun 8, 2012

Oliver: indeed, I've just fixed it, thanks.

@anonymous-matomo-user
Copy link

I've been fiddling with this tool, it looks really nice, the biggest issue I've found is when using --add-sites-new-hosts
It's quite difficult in my case (using a control panel) to add the required %v:%p fields in the custom log format.
What I do have is a log for every domain, so being able to specify the hostname manually would do the trick for me.

In the current situation launching this:

python /var/www/piwik/misc/log-analytics/import_logs.py 
  --url=https://server.example.com/tools/piwik --recorders=4 --enable-http-errors 
  --enable-http-redirects --enable-static --enable-bots --add-sites-new-hosts  /var/log/apache2/example.com-combined.log

Just produces this:

Fatal error: the selected log format doesn't include the hostname: 
  you must specify the Piwik site ID with the --idsite argument

By having a --hostname example.com (the same as the filename in my case) that fixed the hostname (such as -idsite-fallback=) would fix my issues.

@oliverhumpage
Copy link

I'm not a piwik dev, but what I think you're trying to do is:

For every logfile, get its filename (which is also the hostname), check if a site with that hostname exists in piwik: if it does exist, import the logfile to it; if it doesn't exist, create it, then import the logfile to it.

The way I'd do this is to write an import script which:

  • looks at all logfile names
  • accesses the piwik API to get a list of all existing websites (URLs and IDs)
  • for any logfile which doesn't appear in the list uses another API call to create the site and get the ID of the newly created site
  • imports all the logfiles with --idsite set to the right value for the logfile

http://piwik.org/docs/analytics-api/reference/ gives the various API calls, looks like SitesManager.getAllSites and SitesManager.addSite will do the job (e.g. call http://your.piwik.install/?module=API&method=SitesManager.getAllSites&format=xml&token_auth=xxxxxxxxxx to get all current sites, etc).

HTH (a real piwik person might have a better idea)

Oliver.

@anonymous-matomo-user
Copy link

Thanks for your answer Oliver, your process is perfectly fine, but I'd rather like to avoid having to code something that could be avoided by extending just a little the funtionality of --add-sites-new-hosts.
And thanks for the links too, I'll have look.

@anonymous-matomo-user
Copy link

Attachment: Document the debian vhost_combined format
vhost_combined.patch

@anonymous-matomo-user
Copy link

It would be nice to document the standard format provided (at the moment only debian/ubuntu) that would give piwik the hostname that is required.

The format is this:

LogFormat "%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined

You can see the latest version from debian's apache2.conf [http://anonscm.debian.org/gitweb/?p=pkg-apache/apache2.git;a=blob;f=debian/config-dir/apache2.conf;h=50545671cbaeb1f170d5f3f1acd20ad3978f36ea;hb=HEAD]

See attached a small change to the README file.

@anonymous-matomo-user
Copy link

Attachment: Force hostname patch
force_hostname.patch

@anonymous-matomo-user
Copy link

After looking at the code I created a patch to add a new option called --force-hostname that expects an string with the hostname.
In case it's set, the value of host will be ALWAYS the one entered by --force-hostname.
This allows to deal with logfiles of ncsa_extended or common as if they were complete formats. (creating idsites when needed and so on..)

@cbay
Copy link
Contributor

cbay commented Jun 13, 2012

(In [6474]) Refs #3163 Added the --log-hostname option, thanks to aseques.

@cbay
Copy link
Contributor

cbay commented Jun 13, 2012

(In [6475]) Refs #3163 Added a reference in README to the Debian/Ubuntu default vhost_combined, thanks aseques.

@cbay
Copy link
Contributor

cbay commented Jun 13, 2012

Thanks aseques, both your feature request and your patch were fine, I've just committed it. Attention: I renamed the option to --log-hostname to keep coherence with the --log prefix.

@anonymous-matomo-user
Copy link

Great, this will be so useful for me :)

@anonymous-matomo-user
Copy link

Hi,

im not sure that im right in here for a ticket or problem?
I have a problem importing access_logs from my shared webspace. I copy test from here http://forum.piwik.org/read.php?2,90313


Hi,

im on a shared webspace with ssh support. I try your import script to analyse my apache logs.
I get it to work, but there are sometime some "Fatal errors" and i have no idea why. It is, if i restart it without "skip" every time the same "skip-line"

Example:

4349 lines parsed, 85 lines recorded, 0 records/sec

4349 lines parsed, 85 lines recorded, 0 records/sec

4349 lines parsed, 85 lines recorded, 0 records/sec

4349 lines parsed, 85 lines recorded, 0 records/sec

Fatal error: Forbidden

You can restart the import of "/home/log/access_log_piwik" from the point it failed by specifying --skip=326 on the command line.


I try to figure out on what line these script end with that fata error, but i cant. If restart it at "skip=327" that it runs to the end and all works fine. Same problem is on some other access_logs "access_log_1.gz" and so on. But im not sure why it ends. If that is a misconfigured line in accesslog? Which line should i check?

Regards

@cbay
Copy link
Contributor

cbay commented Jun 18, 2012

Hexxer: you're getting a HTTP Forbidden from your Piwik install when importing the logs, you need to find out why.

@anonymous-matomo-user
Copy link

How do you now that?
It stops every time at the same line and if i skip that it runs 10 oder 15 minutes without a problem (up to this line it need 2 minutes or so).

Regards

@mattab
Copy link
Member Author

mattab commented Jun 18, 2012

Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks!

@mattab
Copy link
Member Author

mattab commented Jun 19, 2012

Benaka is implementing Bulk tracking in the ticket #3134 - The python script will simply have to send a JSON array:

["requests":[url1,url2,url3],"token_auth":"xyz"]

I suppose we can do some basic test to see which value works best?
Maybe 50 or 100 tracking requests at once? :)

@anonymous-matomo-user
Copy link

Hi,

.............
Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks!
.............

No, thats my problem. It stops (see above) with the hint to restart "--skip=326". But i dont now what it means. Line 326 in accesslog looks like all the others.

Replying to matt:

I suppose we can do some basic test to see which value works best?
Maybe 50 or 100 tracking requests at once? :)

Do you mean me? I cant test over the day because im sitting behind a proxy @work. I can do something in the evening - but, sorry, i have 5 month young lady who needs my love and attention :-)

@oliverhumpage
Copy link

Attachment:
README.apache_log_recorders.patch

@oliverhumpage
Copy link

Could I submit a request for an alteration to the README? I've just had a massive spike in traffic, and --recorders=1 just doesn't cut it when piping directly from apache's customlog :) Because each apache process hangs around waiting to log its request before moving onto the next request, it started jamming the server.

Setting a higher --recorders seems to have eased it, and there are no side effects that I can see so far.

Suggested patch attached to this ticket.

@anonymous-matomo-user
Copy link

Hi,

Is there a doc about the regex format for import_logs.py ?

We would like to import a file with awstat logFormat :

%time2 %other %cluster %other %method %url %query %other %logname %host %other %ua %referer %virtualname %code %other %other %bytesd %other %other
Thanks for your help,

Ludovic

@anonymous-matomo-user
Copy link

I am trying to set up a daily log import from the previous day. my issue is that my host date stamps the log file, how can I set it to import a log file with yesterdays date on it?

Here is the format of my log files
access.log.%Y-%m-%d.log

@anonymous-matomo-user
Copy link

Thanks a lot for all your great work!
The server log file analytics works great on my server.

I am using a lighttpd server and added the Accept-Language header to accesslog.format:

accesslog.format = "%h %V %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" "%{Accept-Language}i""
(see http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs:ModAccessLog)

I wonder if it would be possible to add support for the Accept-Language header to import_logs.py?
So that the country could then be guessed from the Accept-Language header when GeoIP isn't installed.

@anonymous-matomo-user
Copy link

Replying to Cyril:

(In [6474]) Refs #3163 Added the --log-hostname option, thanks to aseques.

Thanks for possibilities to import logs and also thanks for the log-hostname patch.
Not sure whether it is the patch or it is caused by using --recorders > 1, but with the first run with --add-sites-new-host I got 13 sites for the same hostname created.

@oliverhumpage
Copy link

Jadeham,

Try setting --recorder-max-payload-size=1 . I remember having issues myself when testing with very small data sets (e.g. just 1 line).

@estemendoza
Copy link

I have a similar problem to Jadeham.

I have configured nginx to log with json format and created the following script that reads from access.log (with json format) and pass every line as stdin:

import sh
from sh import tail

run = sh.Command("/usr/bin/python")
run = run.bake("/var/www/piwik/misc/log-analytics/import_logs.py")
run = run.bake("--output=/home/XXX/piwik_live_importer/piwik.log")
run = run.bake("--url=http://X.X.X.X:8081/piwik/")
run = run.bake("--idsite=1")
run = run.bake("--recorders=1")
run = run.bake("--recorder-max-payload-size=1")
run = run.bake("--enable-http-errors")
run = run.bake("--enable-http-redirects")
run = run.bake("--enable-static")
run = run.bake("--enable-bots")
run = run.bake("--log-format-name=nginx_json")
run = run.bake("-")

for line in tail("-f", "/var/log/nginx/access_json.log", _iter=True):
    run(_in=line)

The problem that I'm having is that it seems that every record is saved but if I go to main panel, today's history it's not shown. This is the output when saving every line:

Parsing log (stdin)...
Purging Piwik archives for dates: 2014-04-16
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.

Logs import summary
-------------------

    1 requests imported successfully
    2 requests were downloads
    0 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    1 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 0 seconds
    Requests imported per second: 44.04 requests per second

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)

Besides that, when running archive.php, it's slower than parsing default nginx log format and a lot of lines are marked as invalid:

Logs import summary
-------------------

    94299 requests imported successfully
    145340 requests were downloads
    84140 requests ignored:
        84140 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    94299 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 1147 seconds
    Requests imported per second: 82.21 requests per second

Is there any way to know why these records are not shown and which are the records that are being marked as invalid?

@estemendoza
Copy link

Ok, I figured out why the invalid requests. It was because the user_agent had a strange character. So, maybe the script should be aware of unicode characters

@mattab
Copy link
Member Author

mattab commented Apr 17, 2014

To see the data in the dashboard, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.

Ok, I figured out why the invalid requests. It was because the user_agent had a strange character. So, maybe the script should be aware of unicode characters

Sure, please create a new ticket for this bug and attach a log file with 1 line that showcases the bug. Thanks

@anonymous-matomo-user
Copy link

Replying to Hexxer:

Hi,

.............
Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks!
.............

No, thats my problem. It stops (see above) with the hint to restart "--skip=326". But i dont now what it means. Line 326 in accesslog looks like all the others.

Replying to matt:

I suppose we can do some basic test to see which value works best?
Maybe 50 or 100 tracking requests at once? :)

Do you mean me? I cant test over the day because im sitting behind a proxy @work. I can do something in the evening - but, sorry, i have 5 month young lady who needs my love and attention :-)

Wow. 23 months have passed, and still no solution to this problem???

I'm getting the same error, and there's no docco anywhere to tell me how to fix it:

The url is correct (I copy and paste it into my browser, and it gives me the Piwik login screen), and the apache error logs show nothing from today. Here's my console output:

$./import_logs.py --url=https://www.mysite.com/pathto/piwik/ /var/log/apache/access.log --debug
2014-04-28 00:10:29,205: [DEBUG] Accepted hostnames: all
2014-04-28 00:10:29,205: [DEBUG] Piwik URL is: http://www.mysite.com/piwik/
2014-04-28 00:10:29,205: [DEBUG] No token-auth specified
2014-04-28 00:10:29,205: [No credentials specified, reading them from ".../config/config.ini.php"
2014-04-28 00:10:29,347: [Authentication token token_auth is: REDACTED
2014-04-28 00:10:29,347: [DEBUG] Resolver: dynamic
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2014-04-28 00:10:29,349: [DEBUG] Launched recorder
Parsing log [...]/log/apache/access.log...
2014-04-28 00:10:29,350: [DEBUG] Detecting the log format
2014-04-28 00:10:29,350: [DEBUG] Check format icecast2
2014-04-28 00:10:29,350: [DEBUG] Format icecast2 does not match
2014-04-28 00:10:29,350: [DEBUG] Check format iis
2014-04-28 00:10:29,350: [DEBUG] Format iis does not match
2014-04-28 00:10:29,351: [DEBUG] Check format common
2014-04-28 00:10:29,351: [DEBUG] Format common does not match
2014-04-28 00:10:29,351: [DEBUG] Check format common_vhost
2014-04-28 00:10:29,351: [DEBUG] Format common_vhost matches
2014-04-28 00:10:29,351: [DEBUG] Check format nginx_json
2014-04-28 00:10:29,351: [DEBUG] Format nginx_json does not match
2014-04-28 00:10:29,351: [DEBUG] Check format s3
2014-04-28 00:10:29,352: [DEBUG] Format s3 does not match
2014-04-28 00:10:29,352: [DEBUG] Check format ncsa_extended
2014-04-28 00:10:29,352: [DEBUG] Format ncsa_extended does not match
2014-04-28 00:10:29,352: [DEBUG] Check format common_complete
2014-04-28 00:10:29,352: [DEBUG] Format common_complete does not match
2014-04-28 00:10:29,352: [DEBUG] Format common_vhost is the best match
2014-04-28 00:10:29,424: [Site ID for hostname www.mysite.com not in cache
2014-04-28 00:10:29,563: [DEBUG] Error when connecting to Piwik: HTTP Error 403: Forbidden
2504 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2504 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2014-04-28 00:10:31,612: [DEBUG] Error when connecting to Piwik: HTTP Error 403: Forbidden
2504 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2504 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2014-04-28 00:10:33,657: [DEBUG] Error when connecting to Piwik: HTTP Error 403: Forbidden
Fatal error: Forbidden
You can restart the import of "[...]/var/log/apache/access.log" from the point it failed by specifying --skip=5 on the command line.

And of course, trying with --skip=5 produces the same error.

I have googled, I have searched the archives, the bug tracker contains no clue. Would really appreciate some kind soul taking mercy on me here.

@mattab
Copy link
Member Author

mattab commented Apr 28, 2014

Piwik: HTTP Error 403: Forbidden

Please check your webserver error logs, there should be an error 403 logged in there that will maybe tell you why the Piwik API is failing to return data (maybe a server misconfiguration?).

@anonymous-matomo-user
Copy link

Replying to matt:

Piwik: HTTP Error 403: Forbidden

Please check your webserver error logs, there should be an error 403 logged in there that will maybe tell you why the Piwik API is failing to return data (maybe a server misconfiguration?).

Apache error log shows only a restart once every hour. I am unable to configure Apache directly, as I am running Piwik on Gandi.net's "Simple Hosting" service. I have repeatedly begged gandi support to look into this matter, but their attitude is (and not unreasonably) that their job is not to support user installation issues like this. If you can give me ammunition that shows it really is Gandi's fault, then maybe we can move forward here.

Or maybe it's just a Piwik bug. Or I'm doing something wrong. I don't know.

f

@mattab
Copy link
Member Author

mattab commented Apr 30, 2014

@foobard I suggest you create a new ticket for your particular issue, and we will try help you troubleshoot it (maybe we need to get access to server to reproduce and investigate). Cheers!

@mattab
Copy link
Member Author

mattab commented May 28, 2014

Please do not comment on this ticket anymore. instead, create a new ticket and assign it to "Component 'Log Analytics (import_logs.py)'

Here is the list of all tickets related to Log Analytics improvements: http://dev.piwik.org/trac/query?status=!closed&component=Log+Analytics+(import_logs.py)

@mattab
Copy link
Member Author

mattab commented Mar 12, 2015

Issue was moved to the new repository for Piwik Log Analytics: https://github.com/piwik/piwik-log-analytics/issues

refs #7163

@mattab mattab closed this as completed Mar 12, 2015
@mattab mattab added the duplicate For issues that already existed in our issue tracker and were reported previously. label Mar 12, 2015
@matomo-org matomo-org locked and limited conversation to collaborators Jul 12, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
duplicate For issues that already existed in our issue tracker and were reported previously. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc.
Projects
None yet
Development

No branches or pull requests

9 participants