Ticket #2761 (closed Bug: fixed)

Opened 7 months ago

Last modified 4 months ago

Mail.ru search engine results encoding has changed

Reported by: Erazor Owned by:
Priority: normal Milestone: 1.7 Piwik 1.7
Component: Core Keywords:
Cc: Sensitive: no

Description

File: SearchEngines.php

Original (shows incorrect encoding): // Mail.ru 'go.mail.ru' => array('Mailru', 'q', 'search?q={k}', 'windows-1251'),

I changed to: // Mail.ru 'go.mail.ru' => array('Mailru', 'q', 'search?rch=e&q={k}'),

And now it seems to work correctly.

Change History

Changed 7 months ago by vipsoft

  • status changed from new to closed
  • resolution set to fixed

(In [5413]) fixes #2761 - confirmed that go.mail.ru search results are now utf-8

Changed 7 months ago by vipsoft

  • keywords mail.ru mailru removed
  • milestone set to 1.6.x Piwik 1.6.x

Changed 7 months ago by vipsoft

  • summary changed from Mail.ru search engine broken to Mail.ru search engine results encoding has changed

Changed 4 months ago by kiav

As for now, Mail.ru uses UTF-8 in most cases. But rarely it still uses windows-1251 too.

I had to change extractSearchEngineInformationFromUrl function in /core/Common.php

if(function_exists('iconv')
	&& isset($searchEngines[$refererHost][3]))
{
	// accepts string, array or comma separated list string in preferred order
	if (!is_array($searchEngines[$refererHost][3]))
		$charsets = explode(',', $searchEngines[$refererHost][3]);
	else
		$charsets = $searchEngines[$refererHost][3];
		
	if(!empty($charsets))
	{
		$charset = mb_detect_encoding($key, $charsets);
		if ($charset === false)
			$charset = $charsets[0];
		
		$newkey = @iconv($charset, 'UTF-8//IGNORE', $key);
		if(!empty($newkey))
		{
			$key = $newkey;
		}
	}
}

It works with

'go.mail.ru' => array('Mailru', 'q', 'search?q={k}', array('UTF-8', 'windows-1251')),

in /core/DataFiles/SearchEngines.php

Changed 4 months ago by vipsoft

  • status changed from closed to reopened
  • resolution fixed deleted

Thanks for the patch.

I don't think we need to support comma separated list. We do have to check for mbstring and have a unit test.

Changed 4 months ago by kiav

Comma separated list is already supported by mb_detect_encoding.

By the way, mb_strtolower is already used in Common.php (in original Piwik code in the extractSearchEngineInformationFromUrl function) without any checks tests.

Changed 4 months ago by vipsoft

Can you provide a sample referrer url with windows-1251 encoding?

I've done some refactoring and added some more tests, but can never have enough.

Changed 4 months ago by vipsoft

Awesome! Thanks!

Changed 4 months ago by vipsoft

  • status changed from reopened to closed
  • resolution set to fixed

(In [5682]) fixes #2761

Note: See TracTickets for help on using tickets.