Opened 4 years ago

Closed 4 years ago

Last modified 3 years ago

#1694 closed Task (fixed)

Rethink: SearchEngines.php

Reported by: vipsoft Owned by:
Priority: normal Milestone: Piwik 1.1
Component: Core Keywords:
Cc: Sensitive: no

Description (last modified by vipsoft)

The current data file:

  • requires exact match on domains; as a result, we have numerous 'holes':
    • country code tlds (e.g., www.example.ca, www.example.us)
    • country subdomains (e.g., ca.example.com, us.example.com)
    • wildcard subdomains (e.g., *.example.com)

Proposal:

  • use Public Suffix List and/or regular expressions
  • backfill data from the master record at runtime to avoid unnecessary duplication
    • we already do this to some extent; in which case, we just need to prune some entries (e.g., Baidu)

Affects:

  • Piwik_Common::extractSearchEngineInformationFromUrl()
  • plugins/Referers/functions.php

ToDo:

Change History (26)

comment:1 Changed 4 years ago by vipsoft (robocoder)

I see we call strtolower on the keywords. This may not be safe to do with the 'C' locale unless it happens to be UTF-8 aware.

comment:2 Changed 4 years ago by vipsoft (robocoder)

Task: review the iconv() code in extractSearchEngineInformationFromUrl(). The keywords from naver.com are showing up empty. The encoding in SearchEngines.php is specified as x-windows-949 (which I gather is a superset of the search page's charset, euc-kr).

comment:3 Changed 4 years ago by vipsoft (robocoder)

(In [3136]) refs #1694 - detect powered by google custom search

comment:4 Changed 4 years ago by vipsoft (robocoder)

(In [3141]) refs #1694 - prune arrays (these will be backfilled from the master record)

Separate "Powered by Google" (i.e., uses Google exclusively for search) from "Enhanced by Google" (uses Google in addition to other search engines); the latter are treated as separately branded (meta) search engines.

comment:5 Changed 4 years ago by vipsoft (robocoder)

(In [3142]) refs #1694 - add unit tests for missing and obsolete search engine icons; adapted from halfdan's ProcessFavIcons.php in #1350

comment:6 Changed 4 years ago by vipsoft (robocoder)

  • Milestone changed from Features requests 1.x or 2.x to 1.1 - Piwik 1.1

comment:7 Changed 4 years ago by vipsoft (robocoder)

(In [3144]) refs #1694 - add Piwik_Common::getLossyUrl($url) to reduce referrer URLs to
a more basic form/pattern. I'll prune the Google and Yahoo entries later.

comment:8 Changed 4 years ago by vipsoft (robocoder)

(In [3145]) refs #1694 - fix forestle.org and add unit test (i.e., {} can't appear in master record)

comment:9 Changed 4 years ago by vipsoft (robocoder)

(In [3146]) refs #1694 - update favicon names

comment:10 follow-up: Changed 4 years ago by vipsoft (robocoder)

(In [3149]) refs #1694 - applied lossy {} tld to 123people, google, lycos, and yahoo

comment:11 in reply to: ↑ 10 Changed 4 years ago by SteveG (sgiehl)

Replying to vipsoft:
some google entries are duplicates and can be removed (line 381 & 382)

comment:13 Changed 4 years ago by vipsoft (robocoder)

(In [3151]) refs #1694 - lossy Bing images URL

comment:14 Changed 4 years ago by vipsoft (robocoder)

Note: users who view a cached page from Bing search results will result in a pageview on cc.bingj.com. I've suggested that they add the original web site's URL (uuencoded, of course) to the link. That way we can parse it out (similar to webcache.googleusercontent.com).

comment:15 Changed 4 years ago by vipsoft (robocoder)

(In [3161]) refs #1694 - add bing cache

comment:16 Changed 4 years ago by vipsoft (robocoder)

  • Description modified (diff)

I'm thinking of adding a hook so plugins can implement their own search engine detection as there are requests for sites to be added that don't quite fit the traditional definition of a search engine.

comment:17 Changed 4 years ago by vipsoft (robocoder)

(In [3162]) refs #1694 - remove fix-up for webcache.googleusercontent.com; moving the logic to piwik.js

comment:18 Changed 4 years ago by vipsoft (robocoder)

(In [3163]) refs #739, refs #1694 - instead of a bing cache buster, detect when page is loaded from google or bing cache, and apply a fix to the url

comment:19 Changed 4 years ago by vipsoft (robocoder)

(In [3164]) refs #739, refs #1694 - fallback to the cache url if we can't parse it

comment:20 Changed 4 years ago by vipsoft (robocoder)

(In [3165]) refs #1694, refs #739 - make cacheFixup() testable

comment:21 Changed 4 years ago by vipsoft (robocoder)

Yahoo's Bing-powered search has an even weirder cache url.

comment:22 Changed 4 years ago by vipsoft (robocoder)

(In [3167]) refs #1694, refs #739 - Yahoo's cache result is served from Inktomi allocated ip addresses

comment:23 Changed 4 years ago by vipsoft (robocoder)

  • Resolution set to fixed
  • Status changed from new to closed

(In [3168]) fixes #1694 - misc fixes

  • if iconv fails, we use the original key
  • use mb_strtolower if available; apply this conversion after iconv
  • provide an icon when www.google.com/cse

comment:24 Changed 3 years ago by matt (mattab)

Great work :) this will make maintenance a lot less tedious.

Is there a reason www.google.cat is still listed, or can it be removed?

comment:25 Changed 3 years ago by vipsoft (robocoder)

Technically, .cat isn't an ISO country code. But since I've already added the MaxMind codes to Countries.php, I guess it won't hurt to add this one too.

comment:26 Changed 3 years ago by vipsoft (robocoder)

(In [3319]) refs #1694 - treat .cat as a pseudo country tld

Note: See TracTickets for help on using tickets.