Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink: SearchEngines.php #1694

Closed
robocoder opened this issue Sep 11, 2010 · 25 comments
Closed

Rethink: SearchEngines.php #1694

robocoder opened this issue Sep 11, 2010 · 25 comments
Labels
Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change.
Milestone

Comments

@robocoder
Copy link
Contributor

The current data file:

  • requires exact match on domains; as a result, we have numerous 'holes':
    • country code tlds (e.g., www.example.ca, www.example.us)
    • country subdomains (e.g., ca.example.com, us.example.com)
    • wildcard subdomains (e.g., *.example.com)

Proposal:

  • use Public Suffix List and/or regular expressions
  • backfill data from the master record at runtime to avoid unnecessary duplication
    • we already do this to some extent; in which case, we just need to prune some entries (e.g., Baidu)

Affects:

  • Piwik_Common::extractSearchEngineInformationFromUrl()
  • plugins/Referers/functions.php

ToDo:

  • strtolower in comment:1
  • iconv in comment:2
@robocoder
Copy link
Contributor Author

I see we call strtolower on the keywords. This may not be safe to do with the 'C' locale unless it happens to be UTF-8 aware.

@robocoder
Copy link
Contributor Author

Task: review the iconv() code in extractSearchEngineInformationFromUrl(). The keywords from naver.com are showing up empty. The encoding in SearchEngines.php is specified as x-windows-949 (which I gather is a superset of the search page's charset, euc-kr).

@robocoder
Copy link
Contributor Author

(In [3136]) refs #1694 - detect powered by google custom search

@robocoder
Copy link
Contributor Author

(In [3141]) refs #1694 - prune arrays (these will be backfilled from the master record)

Separate "Powered by Google" (i.e., uses Google exclusively for search) from "Enhanced by Google" (uses Google in addition to other search engines); the latter are treated as separately branded (meta) search engines.

@robocoder
Copy link
Contributor Author

(In [3142]) refs #1694 - add unit tests for missing and obsolete search engine icons; adapted from halfdan's ProcessFavIcons.php in #1350

@robocoder
Copy link
Contributor Author

(In [3144]) refs #1694 - add Piwik_Common::getLossyUrl($url) to reduce referrer URLs to
a more basic form/pattern. I'll prune the Google and Yahoo entries later.

@robocoder
Copy link
Contributor Author

(In [3145]) refs #1694 - fix forestle.org and add unit test (i.e., {} can't appear in master record)

@robocoder
Copy link
Contributor Author

(In [3146]) refs #1694 - update favicon names

@robocoder
Copy link
Contributor Author

(In [3149]) refs #1694 - applied lossy {} tld to 123people, google, lycos, and yahoo

@sgiehl
Copy link
Member

sgiehl commented Sep 13, 2010

Replying to vipsoft:
some google entries are duplicates and can be removed (line 381 & 382)

@robocoder
Copy link
Contributor Author

(In [3150]) refs #1694

@robocoder
Copy link
Contributor Author

(In [3151]) refs #1694 - lossy Bing images URL

@robocoder
Copy link
Contributor Author

Note: users who view a cached page from Bing search results will result in a pageview on cc.bingj.com. I've suggested that they add the original web site's URL (uuencoded, of course) to the link. That way we can parse it out (similar to webcache.googleusercontent.com).

@robocoder
Copy link
Contributor Author

(In [3161]) refs #1694 - add bing cache

@robocoder
Copy link
Contributor Author

I'm thinking of adding a hook so plugins can implement their own search engine detection as there are requests for sites to be added that don't quite fit the traditional definition of a search engine.

@robocoder
Copy link
Contributor Author

(In [3162]) refs #1694 - remove fix-up for webcache.googleusercontent.com; moving the logic to piwik.js

@robocoder
Copy link
Contributor Author

(In [3163]) refs #739, refs #1694 - instead of a bing cache buster, detect when page is loaded from google or bing cache, and apply a fix to the url

@robocoder
Copy link
Contributor Author

(In [3164]) refs #739, refs #1694 - fallback to the cache url if we can't parse it

@robocoder
Copy link
Contributor Author

(In [3165]) refs #1694, refs #739 - make cacheFixup() testable

@robocoder
Copy link
Contributor Author

Yahoo's Bing-powered search has an even weirder cache url.

@robocoder
Copy link
Contributor Author

(In [3167]) refs #1694, refs #739 - Yahoo's cache result is served from Inktomi allocated ip addresses

@robocoder
Copy link
Contributor Author

(In [3168]) fixes #1694 - misc fixes

  • if iconv fails, we use the original key
  • use mb_strtolower if available; apply this conversion after iconv
  • provide an icon when www.google.com/cse

@mattab
Copy link
Member

mattab commented Nov 16, 2010

Great work :) this will make maintenance a lot less tedious.

Is there a reason www.google.cat is still listed, or can it be removed?

@robocoder
Copy link
Contributor Author

Technically, .cat isn't an ISO country code. But since I've already added the MaxMind codes to Countries.php, I guess it won't hurt to add this one too.

@robocoder
Copy link
Contributor Author

(In [3319]) refs #1694 - treat .cat as a pseudo country tld

@robocoder robocoder added this to the Piwik 1.1 milestone Jul 8, 2014
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change.
Projects
None yet
Development

No branches or pull requests

3 participants