[Piglit] [PATCH 2/2] Fix JSON problems with non-ASCII characters, by explicitly decoding test output.

Mon Aug 8 10:52:57 PDT 2011

Due to a design flaw in Python 2.x, if an 8-bit string containing a
non-ASCII character is ever used in a context requiring a unicode
string, the Python interpreter will raise an exception.  This happens
because Python lazily decodes 8-bit strings to unicode as needed, and
when it does the decoding it assumes the 8-bit string is in ASCII
format.  Because of this lazy decoding behavior, this means
that 8-bit strings containing non-ASCII characters are unsafe in any
Python program that also uses unicode.

Since Python 3.x doesn't have 8-bit strings (it has "byte arrays",
which don't autoconvert to unicode strings), the most
forward-compatible way to address this problem is to find the source
of the unsafe 8-bit strings, and introduce an explicit decoding step
that translates them to unicode strings and handles errors properly.

This problem manifested in Piglit if the output of a test ever
contained a non-ASCII character.  Due to Python 2.x's lazy string
decoding semantics, the exception wouldn't occur until attempting to
serialize the test output as JSON.

This patch introduces an explicit step, right after running a test,
which decodes the test output from 8-bit strings into unicode.  The
decoding assumes UTF-8, and silently replaces invalid UTF-8 sequences
with the unicode "replacement character" rather than raise an
exception.
---
 framework/exectest.py |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/framework/exectest.py b/framework/exectest.py
index 91a674e..e3327b1 100644
--- a/framework/exectest.py
+++ b/framework/exectest.py
@@ -55,6 +55,23 @@ class ExecTest(Test):
 				)
 			out, err = proc.communicate()
 
+			# proc.communicate() returns 8-bit strings, but we need
+			# unicode strings.  In Python 2.x, this is because we
+			# will eventually be serializing the strings as JSON,
+			# and the JSON library expects unicode.  In Python 3.x,
+			# this is because all string operations require
+			# unicode.  So translate the strings into unicode,
+			# assuming they are using UTF-8 encoding.
+			#
+			# If the subprocess output wasn't properly UTF-8
+			# encoded, we don't want to raise an exception, so
+			# translate the strings using 'replace' mode, which
+			# replaces erroneous charcters with the Unicode
+			# "replacement character" (a white question mark inside
+			# a black diamond).
+			out = out.decode('utf-8', errors='replace')
+			err = err.decode('utf-8', errors='replace')
+
 			results = TestResult()
 
 			out = self.interpretResult(out, results)
-- 
1.7.6