Ticket #653 (closed New feature: fixed)

Opened 3 years ago

Last modified 19 months ago

The <noscript> image call doesn't currently record any visit, but it could

Reported by: matt Owned by:
Priority: critical Milestone: Piwik 0.7 - DigitalVibes
Component: Core Keywords: bots noscript
Cc: Sensitive: no

Description (last modified by matt) (diff)

Currently the Piwik tracking code has a noscript which could be used to record visits from people without Javascript enabled.

There is some work required to

  • filter out search engine bots
  • filter out spam bots
  • filter out all other type of bots

Of course this could also be used to log bots and show them in a specific Piwik report "Bot activity".

The initial design decision was to not record any visitor without Javascript as it is a lot of work to ensure that the data coming from Javascript-disabled devices is accurate and not bot initiated.

To record a visit without JS, you must call

piwik.php?idsite=$ID_SITE
          &rec=1
          &action_name=$ACTION_NAME

See also PUSH API without Javascript #134

Change History

  Changed 3 years ago by matt

  • type changed from Bug to New feature

  Changed 3 years ago by albass

  Changed 2 years ago by joux

To me this is a major issue, as non-javascript users still give us valuable information. I would love to see this implemented before 1.0

Suggestion 1

Couldn't the  http:BL by Project Honeypot be used to filter out any bots? They offer an API to identify Search Engines, Spammers and other bots by IP address.

Piwik could work like this:

  • Javascript enabled:
    • count users as usual
  • Javascript disabled:
    • Discard all users that are known search engine bots (by User-Agent)
    • Check IP of all remaining users against http:BL, discard if known bot, count otherwise.

This way traffic for the blacklist server would be kept low. I still think every Piwik installation would need their own API key, though.

Suggestion 2

Piwik should include its own tiny honeypot. The <noscript> tag should include a link that is invisible to the user and that has rel=nofollow.

<a href="http://domain/piwik/honeypot.php" rel="nofollow">&nbsp;</a>

Only malicious crawlers will follow this link, so Piwik can exclude their IPs from tracking. Known, well-behaving search bots can still be identified by User-Agent. This way, most bots will probably get identified.

follow-up: ↓ 5   Changed 2 years ago by matt

  • sensitive unset
  • description modified (diff)

in reply to: ↑ 4   Changed 2 years ago by lioman

Replying to matt: I wan't this feature too. Not only users are interessting. I want see which bots crawl my site.

  Changed 2 years ago by philmck

Can I add my vote for this as well? We're missing out on visits from many mobile phone users and disabled people using screen readers, for example, because they don't have javascript. And there are legitimate reasons for disabling javascript in a normal browser as well. I agree we need to separate out the bots somehow for the statistics, but really that's a separate issue. I'd like the option of counting all visitors, even if that includes bots.

  Changed 2 years ago by Charles Belov

The code

<a href="http://domain/piwik/honeypot.php" rel="nofollow">&nbsp;</a>

will be visible to blind persons using screen-reader software. It would be better to code this as

<a href="http://domain/piwik/honeypot.php" rel="nofollow" style="display:none;">&nbsp;</a>

which will also hide it from the screen readers.

Hope this helps, Charles Belov SFMTA Webmaster www.sfmta.com/webmaster

  Changed 2 years ago by vipsoft

Charles: that's not our tracking code. Piwik's tracking code doesn't contain an anchor link (honeypot or otherwise).

  Changed 2 years ago by vipsoft

re: comment:3 - The idea behind the noscript tag is to track Javascript-disabled visitors. We'll provide a hook here so third-party plugins can implement suggestion 2.

  Changed 23 months ago by matt

In order to report search engine bot activity, we could reuse some of the GPL code from  http://www.crawltrack.net/ which is a php bot tracker tool. The logic could sit in a Piwik plugin. There could be a new sub tab, that would report bot activity for each bot that was seen during the selected date range.

Bots would be identified by user agents and / or IPs, see eg. the list at crawltrack:  http://www.crawltrack.net/crawlerlist.php

Additional features could include:

  • give ratio of bots VS human activity on the website (what percentage of traffic comes from bots VS humans)
  • for a given bot on a given day, list all pages crawled
  • list bot crawling frequency in a new column (next to Visits, Page views, etc.). eg. google can crawl one page every 10s, other bots would crawl one page every 1 min, etc.

  Changed 23 months ago by matt

  • priority changed from normal to critical

  Changed 23 months ago by matt

  • description modified (diff)

  Changed 23 months ago by jr-ewing

So i think it would be interesting to track also robots f.e. for big sites. With this feature you can see how many bots a scraping your site. But it make sense to see Googlebot, Msnbot and maybe Slurp (Yahoo) But this should track in a seperatet table with a special plugin - like Live Bots ;-)

In my tool  http://www.spider-trap.de/en_index.html i ban a lot of bad bots. Maybe Piwik can report the webmaster if an bot is crawling.

  Changed 19 months ago by matt

  • status changed from new to closed
  • resolution set to fixed
  • milestone changed from Features requests - after Piwik 1.0 to 1 - Piwik 0.7 - DigitalVibes

The Tracking API has been released, which can help track visitors without Javascript, or even track visits Mobile apps, desktop apps and more.

http://piwik.org/docs/tracking-api/

Note: See TracTickets for help on using tickets.