Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Piwik fails to properly decode and store some chinese keywords (eg. from baidu.com) #589

Closed
mattab opened this issue Mar 9, 2009 · 3 comments
Labels
Bug For errors / faults / flaws / inconsistencies etc. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org.

Comments

@mattab
Copy link
Member

mattab commented Mar 9, 2009

Baidu is the biggest search engine in China and currently Piwik fails detecting keywords from baidu.

Example queries:

```

http://www.baidu.com/s?lm=0&si=&rn=10&ie=gb2312&ct=0&wd=%BF%DA%D3%EF+%CD%F2%C4%DC&pn=10&ver=0&cl=3&uim=0&usm=0

http://www.baidu.com/s?kw=&sc=web&cl=3&tn=sitehao123&ct=0&rn=&lm=&ie=gb2312&rs2=&myselectvalue=&f=&pv=&z=&from=&word=%B7%E8%BF%F1%CB%B5%D3%A2%D3%EF+%D4%DA%CF%DF%B9%DB%BF%B4
http://www.baidu.com/s?wd=%C1%F7%D0%D0%C3%C0%D3%EF%CF%C2%D4%D8
http://www.baidu.com/s?wd=%C1%F7%D0%D0%C3%C0%D3%EF+%CF%C2%D4%D8&lm=0&si=&rn=10&ie=gb2312&ct=0&cl=3&f=1&rsp=3&oq=VOA%C1%F7%D0%D0%C3%C0%D3%EF
http://web.gougou.com/search?search=%e6%b5%81%e8%a1%8c%e7%be%8e%e8%af%ad%20%e4%b8%8b%e8%bd%bd
```

Resolving this issue involves writing unit test to cover these bits of code.
Also we should check whether the code path around line 715 in core/Tracker/Visit.php is useful, if not fix it or delete it.

@robocoder
Copy link
Contributor

The problems with baidu might be more complex than at first glance:
- the second url uses the variable name “word” instead of “wd”
- gb2312 is an encoding; are the keywords not utf-8?

@mattab
Copy link
Member Author

mattab commented Mar 20, 2009

also see #435 which is very related

@mattab
Copy link
Member Author

mattab commented Mar 24, 2009

(In 1014) – cleaning up the search engine parsing code, adding tests, recording UTF8 keywords in the DB rather than encoded (as tables are now utf8, refs #5730)
- adding tests in url.test.php and fixed double encoding in some edge cases
- fixed #589 Piwik fails to properly decode and store some chinese keywords (eg. from baidu.com)
- fixed #435 Exotic encoded keywords should be stored as utf-8 in the DB
- refs #575 hopefully fixed, will give it a few days of tests on piwik.org

@mattab mattab added this to the RobotRock milestone Jul 8, 2014
@mattab mattab closed this as not planned Won't fix, can't repro, duplicate, stale Mar 10, 2023
@bx80 bx80 added the not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org. label Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org.
Projects
None yet
Development

No branches or pull requests

4 participants