Opened 2 years ago

Closed 2 years ago

#2761 closed Bug (fixed)

Mail.ru search engine results encoding has changed

Reported by: Erazor Owned by:
Priority: normal Milestone: 1.7 Piwik 1.7
Component: Core Keywords:
Cc: Sensitive: no

Description

File: SearchEngines.php

Original (shows incorrect encoding):
Mail.ru
'go.mail.ru' => array('Mailru', 'q', 'search?q={k}', 'windows-1251'),

I changed to:
Mail.ru
'go.mail.ru' => array('Mailru', 'q', 'search?rch=e&q={k}'),

And now it seems to work correctly.

Change History (10)

comment:1 Changed 2 years ago by vipsoft (robocoder)

  • Resolution set to fixed
  • Status changed from new to closed

(In [5413]) fixes #2761 - confirmed that go.mail.ru search results are now utf-8

comment:2 Changed 2 years ago by vipsoft (robocoder)

  • Keywords mail.ru mailru removed
  • Milestone set to 1.6.x Piwik 1.6.x

comment:3 Changed 2 years ago by vipsoft (robocoder)

  • Summary changed from Mail.ru search engine broken to Mail.ru search engine results encoding has changed

comment:4 Changed 2 years ago by kiav

As for now, Mail.ru uses UTF-8 in most cases. But rarely it still uses windows-1251 too.

I had to change extractSearchEngineInformationFromUrl function in /core/Common.php

if(function_exists('iconv')
	&& isset($searchEngines[$refererHost][3]))
{
	// accepts string, array or comma separated list string in preferred order
	if (!is_array($searchEngines[$refererHost][3]))
		$charsets = explode(',', $searchEngines[$refererHost][3]);
	else
		$charsets = $searchEngines[$refererHost][3];
		
	if(!empty($charsets))
	{
		$charset = mb_detect_encoding($key, $charsets);
		if ($charset === false)
			$charset = $charsets[0];
		
		$newkey = @iconv($charset, 'UTF-8//IGNORE', $key);
		if(!empty($newkey))
		{
			$key = $newkey;
		}
	}
}

It works with

'go.mail.ru' => array('Mailru', 'q', 'search?q={k}', array('UTF-8', 'windows-1251')),

in /core/DataFiles/SearchEngines.php

comment:5 Changed 2 years ago by vipsoft (robocoder)

  • Resolution fixed deleted
  • Status changed from closed to reopened

Thanks for the patch.

I don't think we need to support comma separated list. We do have to check for mbstring and have a unit test.

comment:6 Changed 2 years ago by kiav

Comma separated list is already supported by mb_detect_encoding.

By the way, mb_strtolower is already used in Common.php (in original Piwik code in the extractSearchEngineInformationFromUrl function) without any checks tests.

comment:7 Changed 2 years ago by vipsoft (robocoder)

Can you provide a sample referrer url with windows-1251 encoding?

I've done some refactoring and added some more tests, but can never have enough.

comment:9 Changed 2 years ago by vipsoft (robocoder)

Awesome! Thanks!

comment:10 Changed 2 years ago by vipsoft (robocoder)

  • Resolution set to fixed
  • Status changed from reopened to closed

(In [5682]) fixes #2761

Note: See TracTickets for help on using tickets.