Opened 22 months ago

Closed 18 months ago

Last modified 4 months ago

#3259 closed Bug (fixed)

Warning: htmlspecialchars() Invalid multibyte sequence in argument in core/DataTable/Renderer.php on line 223

Reported by: matt Owned by: capedfuzz
Priority: normal Milestone: 1.10 - Piwik 1.10
Component: Core Keywords:
Cc: Sensitive: no

Description

Reported in forum
Input data:

				<url>http://example/?c_id=6&amp;sc_id=97&amp;chars=%EC%E5%F8%EA%EE%E2%FB%E5&amp;brand=ZELMER</url>

				<url>http:/example/?c_id=7&amp;sc_id=206&amp;chars=%E3%E0%E7%EE%E2%FB%E5</url>

with API request: index.php?module=API&method=Actions.getPageUrls&idSite=1&period=day&date=today&format=xml&token_auth=x

results in:

Warning: htmlspecialchars() [<a href='function.htmlspecialchars'>function.htmlspecialchars</a>]: Invalid multibyte sequence in argument in //public_html/core/DataTable/Renderer.php on line 223

Backtrace -->
#0 Piwik_ErrorHandler(...) called at [:]#1 htmlspecialchars(...) called at [//public_html/core/DataTable/Renderer.php:223]#2 Piwik_DataTable_Renderer::formatValueXml(...) called at [//public_html/core/DataTable/Renderer/Xml.php:342]#3 Piwik_DataTable_Renderer_Xml->renderDataTable(...) called at [//public_html/core/DataTable/Renderer/Xml.php:337]#4 Piwik_DataTable_Renderer_Xml->renderDataTable(...) called at [//public_html/core/DataTable/Renderer/Xml.php:142]#5 Piwik_DataTable_Renderer_Xml->renderTable(...) called at [//public_html/core/DataTable/Renderer/Xml.php:33]#6 Piwik_DataTable_Renderer_Xml->render(...) called at [//public_html/core/API/ResponseBuilder.php:221]#7 Piwik_API_ResponseBuilder->getRenderedDataTable(...) called at [//public_html/core/API/ResponseBuilder.php:325]#8 Piwik_API_ResponseBuilder->handleDataTable(...) called at [//public_html/core/API/ResponseBuilder.php:79]#9 Piwik_API_ResponseBuilder->getResponse(...) called at [//public_html/core/API/Request.php:130]#10 Piwik_API_Request->process(...) called at [//public_html/plugins/API/Controller.php:27]#11 Piwik_API_Controller->index(...) called at [:]#12 call_user_func_array(...) called at [//public_html/core/FrontController.php:138]#13 Piwik_FrontController->dispatch(...) called at [//public_html/index.php:53]

Bug fix ?
I'm not sure if we can reproduce this error with this input data, and I'm not sure which fix is correct in this case. Is it a bug in PHP? is it that we truncate the input data too early?

we could fix it by keeping the original input data if the result of htmlspecialchars is empty. But then it could maybe cause some XSS or other security issue if someone manages to generate bogus data leading to this error in order to print out a xss?

Change History (21)

comment:1 Changed 22 months ago by matt (mattab)

A temporary quick fix for users with this issue is as follwos:

Index: core/DataTable/Renderer.php
===================================================================
--- core/DataTable/Renderer.php	(revision 6483)
+++ core/DataTable/Renderer.php	(working copy)
@@ -220,7 +220,7 @@
 			&& !is_numeric($value)) 
 		{
 			$value = html_entity_decode($value, ENT_COMPAT, 'UTF-8');
-			$value = htmlspecialchars($value, ENT_COMPAT, 'UTF-8');
+			$value = @htmlspecialchars($value, ENT_COMPAT, 'UTF-8');

But this is only temporary, definitely not the ideal fix.

if it fails and returns empty string, maybe we could filte rthe initial input to a-zA-Z0-9 or similar and keep only the safe characters?

comment:2 Changed 21 months ago by matt (mattab)

Fix suggested by Anthon: if(function_exists('iconv')) $value = @iconv('UTF-8', 'UTF-8//IGNORE', $value);

I see in the php doc: http://www.php.net/manual/en/function.iconv.php#108643

that maybe this doesn't work in all php versions?

The comment suggests:

ini_set('mbstring.substitute_character', "none");
$text= mb_convert_encoding($text, 'UTF-8', 'UTF-8');

Any particular thoughts?

comment:3 Changed 21 months ago by vipsoft (robocoder)

We can try to run some tests between php versions and compare.

The more portable, the better.

comment:4 Changed 20 months ago by maze

I have a similar error at some actions at "Visitors --> Visitor Log":

There is an error. Please report the message (Piwik 1.8.3) and full backtrace in the Piwik forums (please do a Search first as it might have been reported already!).

Warning: htmlspecialchars(): Invalid multibyte sequence in argument in /srv/www/htdocs/piwik/core/SmartyPlugins/modifier.escape.php on line 31

Backtrace -->
#0 Piwik_ErrorHandler(...) called at [:]#1 htmlspecialchars(...) called at [/srv/www/htdocs/piwik/core/SmartyPlugins/modifier.escape.php:31]#2 smarty_modifier_escape(...) called at [/srv/www/htdocs/piwik/tmp/templates_c/%%46^462^46227F0A%%visitorLog.tpl.php:337]#3 include(...) called at [/srv/www/htdocs/piwik/libs/Smarty/Smarty.class.php:1263]#4 Smarty->fetch(...) called at [/srv/www/htdocs/piwik/core/View.php:133]#5 Piwik_View->render(...) called at [/srv/www/htdocs/piwik/core/Controller.php:153]#6 Piwik_Controller->renderView(...) called at [/srv/www/htdocs/piwik/plugins/Live/Controller.php:84]#7 Piwik_Live_Controller->getVisitorLog(...) called at [:]#8 call_user_func_array(...) called at [/srv/www/htdocs/piwik/core/FrontController.php:138]#9 Piwik_FrontController->dispatch(...) called at [/srv/www/htdocs/piwik/index.php:53]

comment:5 Changed 19 months ago by matt (mattab)

There is a similar bug, reported in the forums

comment:6 Changed 18 months ago by matt (mattab)

  • Owner set to matt

comment:7 Changed 18 months ago by matt (mattab)

  • Owner changed from matt to capedfuzz
  • Priority changed from normal to major

@capedfuzz maybe similar to the other bug you're dealing with.

comment:8 Changed 18 months ago by matt (mattab)

  • Priority changed from major to critical

comment:9 Changed 18 months ago by capedfuzz (diosmosis)

This bug is due to query parameter values being encoded in non-UTF-8 charsets. When archiving, there's no way to know what this other encoding is, and so we have to use UTF-8 w/ htmlspecialchars. I came up w/ two possible ways of fixing this one:

  • Either associate the charset w/ the URLs in the DB when tracking (using the 'cs' query param I added in #3450). (Not sure how to do this efficiently)
  • Or, add a charset field to the piwik_site table and let users set it if their site does not use UTF-8. We can display an info here saying it's better to just use UTF-8.

comment:10 follow-up: Changed 18 months ago by matt (mattab)

when the charset is defined and not utf-8, maybe in the Tracker, we should also re-encode the URL in UTF-8, since it's likely the URL parameters are encoded in the page charset...?

Last edited 18 months ago by matt (previous) (diff)

comment:11 in reply to: ↑ 10 ; follow-up: Changed 18 months ago by capedfuzz (diosmosis)

Replying to matt:

when the charset is defined and not utf-8, maybe in the Tracker, we should also re-encode the URL in UTF-8, since it's likely the URL parameters are encoded in the page charset...?

I tried that and it works in that it displays the label correctly, but then the URL becomes invalid. As in, if you visit the link, it won't work anymore. As far as I can tell, there has to be some way to store the charset of the query params somewhere (like in the piwik_site table).

comment:12 in reply to: ↑ 11 ; follow-up: Changed 18 months ago by matt (mattab)

How can we quickly fix this problem, since the proper solution would have performance overhead which we can't consider for this use case.

The goal is to see a quick fix, for example, would if(function_exists('iconv')) $value = @iconv('UTF-8', 'UTF-8//IGNORE', $value); work to force convert to Utf8 the url when the htmlspecialchars return empty?

comment:13 in reply to: ↑ 12 Changed 18 months ago by capedfuzz (diosmosis)

Replying to matt:

How can we quickly fix this problem, since the proper solution would have performance overhead which we can't consider for this use case.

The goal is to see a quick fix, for example, would if(function_exists('iconv')) $value = @iconv('UTF-8', 'UTF-8//IGNORE', $value); work to force convert to Utf8 the url when the htmlspecialchars return empty?

From what I remember of my test, this would strip the non-UTF8 encoded characters from the string, so I guess "www.site.com?q=%EC%ED..." would turn into "www.site.com?q=". Since this only affects the query string, we might be able to do this to each individual query param and for stuff that doesn't have UTF8 encoding, we just display the encoded data (ie, %EC%ED)? Not sure if it'll work, though, but when I get a chance I'll check it out.

comment:14 Changed 18 months ago by capedfuzz (diosmosis)

(In [7336]) Refs #3259, strip non-UTF-8 chars from strings in outputted XML/HTML so htmlspecialchars won't fail.

comment:15 Changed 18 months ago by capedfuzz (diosmosis)

I've applied one part of a fix for this. This will make sure the warning doesn't display and will make sure at least SOMETHING is shown as the label for strings w/ non-UTF-8 chars. The non-UTF-8 stuff won't show though.

The only way I can think of to fix that is to look for each %.. encoded item and check if the decoded char is invalid UTF-8. If so, it gets double encoded so html_entity_decode won't turn it into the non-UTF-8 value. This seems excessive, but I can't think of another way of fixing this (specifying a charset column for piwik_site might work, but would likely take a while to implement).

comment:16 follow-up: Changed 18 months ago by matt (mattab)

Looks good to me! Thanks for researching and finding this fix, which is better than before. We cant really afford to store charsets per website... too complicated.

However, I have an idea for a possible "ultimate" solution: could we use http://php.net/manual/en/function.mb-detect-encoding.php ?? :) that looks interesting. Maybe we could run this function when htmlspecialchars return empty string (prior to your commmit) ? would it work to detect encoding of the URL containing non utf8 characters?

comment:17 in reply to: ↑ 16 Changed 18 months ago by capedfuzz (diosmosis)

Replying to matt:

Looks good to me! Thanks for researching and finding this fix, which is better than before. We cant really afford to store charsets per website... too complicated.

However, I have an idea for a possible "ultimate" solution: could we use http://php.net/manual/en/function.mb-detect-encoding.php ?? :) that looks interesting. Maybe we could run this function when htmlspecialchars return empty string (prior to your commmit) ? would it work to detect encoding of the URL containing non utf8 characters?

We can tell if it's not UTF-8, but that doesn't mean we can find out what the true encoding is. Some non-UTF8 strings can still be interpreted as UTF-8 though the result will be gibberish.

comment:18 Changed 18 months ago by matt (mattab)

This function will find out the encoding I think: http://php.net/manual/en/function.mb-detect-encoding.php -- it might work on the non utf8 URL, so you could display a label that works with non UTF8 URLs without modifying the URLs, simply detect encoding > convert from detectedEncoding to UTF8 > Parse URL To get the Pretty name <label>

comment:19 Changed 18 months ago by capedfuzz (diosmosis)

  • Resolution set to fixed
  • Status changed from new to closed

The warning is gone, for now I think that's good enough.

comment:20 Changed 17 months ago by matt (mattab)

  • Priority changed from critical to normal

comment:21 Changed 4 months ago by matt (mattab)

We have a variation of similar error in: #3859

Note: See TracTickets for help on using tickets.