Opened 5 years ago

Closed 5 years ago

#589 closed Bug (fixed)

Piwik fails to properly decode and store some chinese keywords (eg. from baidu.com)

Reported by: matt Owned by:
Priority: major Milestone: RobotRock
Component: Core Keywords:
Cc: Sensitive:

Description

Baidu is the biggest search engine in China and currently Piwik fails detecting keywords from baidu.

Example queries:

http://www.baidu.com/s?lm=0&si=&rn=10&ie=gb2312&ct=0&wd=%BF%DA%D3%EF+%CD%F2%C4%DC&pn=10&ver=0&cl=3&uim=0&usm=0


http://www.baidu.com/s?kw=&sc=web&cl=3&tn=sitehao123&ct=0&rn=&lm=&ie=gb2312&rs2=&myselectvalue=&f=&pv=&z=&from=&word=%B7%E8%BF%F1%CB%B5%D3%A2%D3%EF+%D4%DA%CF%DF%B9%DB%BF%B4
http://www.baidu.com/s?wd=%C1%F7%D0%D0%C3%C0%D3%EF%CF%C2%D4%D8
http://www.baidu.com/s?wd=%C1%F7%D0%D0%C3%C0%D3%EF+%CF%C2%D4%D8&lm=0&si=&rn=10&ie=gb2312&ct=0&cl=3&f=1&rsp=3&oq=VOA%C1%F7%D0%D0%C3%C0%D3%EF
http://web.gougou.com/search?search=%e6%b5%81%e8%a1%8c%e7%be%8e%e8%af%ad%20%e4%b8%8b%e8%bd%bd

Resolving this issue involves writing unit test to cover these bits of code.
Also we should check whether the code path around line 715 in core/Tracker/Visit.php is useful, if not fix it or delete it.

Change History (3)

comment:2 Changed 5 years ago by vipsoft (robocoder)

The problems with baidu might be more complex than at first glance:

  • the second url uses the variable name "word" instead of "wd"
  • gb2312 is an encoding; are the keywords not utf-8?

comment:3 Changed 5 years ago by matt (mattab)

also see #435 which is very related

comment:3 Changed 5 years ago by matt (mattab)

  • Resolution set to fixed
  • Status changed from new to closed

(In [1014]) - cleaning up the search engine parsing code, adding tests, recording UTF8 keywords in the DB rather than encoded (as tables are now utf8, refs #310)

  • adding tests in url.test.php and fixed double encoding in some edge cases
  • fixed #589 Piwik fails to properly decode and store some chinese keywords (eg. from baidu.com)
  • fixed #435 Exotic encoded keywords should be stored as utf-8 in the DB
  • refs #575 hopefully fixed, will give it a few days of tests on piwik.org
Note: See TracTickets for help on using tickets.