tomclegg.net


Diary
Examples
Hire Tom
Mostly Mozart
Patches
    apache
    freebsd-usb-phase-error
    hack-verisign
  >nusoap-ncbi-encoding<
School
Scrapbook
Software
Telephones




colocation
comments
davidireland
edsgranola
faq
funsites
goodlooking
goodmovies
google-earth-saucy-amd64
houserules
liberating
resume
resume2
scratch
shopping
snacks
todo
university
warisbogus

character encoding bugs break Biblio / nusoap
Posted December 12, 2010

Problem: MediaWiki Biblio extension fails.

  • Rendered page says: "Error fetching PMID 20445623:" (no further details).
  • Server error log says: "PHP Warning: Attempt to modify property of non-object in /path/to/mediawiki/extensions/nusoap/nusoap.php on line 4151" (but this is a red herring).
  • Server error log says: "PHP Notice: Undefined index: ERROR in /path/to/mediawiki/extensions/Biblio.php on line 744" (also a red herring).

The relevant error message, which is provided by nusoap but doesn't end up getting reported, is:

XML error parsing SOAP payload on line 306: Invalid character

This is caused by the confluence of two problems:

  1. The server is using ISO-8859-1 encoding, but does not mention this fact in its XML declaration.
    • According to my reading of the XML 1.0 recommendation, this isn't necessarily a problem. "In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin with a text declaration [...] containing an encoding declaration." (emphasis added)
    • However, in this case, the external character encoding information is incorrect; the HTTP response headers received from www.ncbi.nlm.nih.gov say charset="UTF-8".
  2. As noted in PGP bug #36785, there is no way to communicate the "external character encoding information" to PHP's XML parser.

Horrible workaround for both problems at once:

--- nusoap.php~ 2006-10-05 19:28:36.000000000 -0400
+++ nusoap.php  2010-12-12 21:29:11.000000000 -0500
@@ -5874,6 +5874,10 @@
                                $this->debug('No XML declaration');
                        }
                        $this->debug('Entering soap_parser(), length='.strlen($xml).', encoding='.$encoding);
+
+                       $tried_fudge = false;
+
+                   xmlparse:
                        // Create an XML parser - why not xml_parser_create_ns?
                        $this->parser = xml_parser_create($this->xml_encoding);
                        // Set the options for parsing the XML data.
@@ -5888,6 +5892,16 @@

                        // Parse the XML file.
                        if(!xml_parse($this->parser,$xml,true)){
+                           if (!$tried_fudge) {
+                               $fudgexml = preg_replace ('{^<\?xml version="1.0"}', 'xml_encoding);
+                               if (xml_parse ($fudgeparser, $fudgexml, true)) {
+                                   $xml = $fudgexml;
+                                   $tried_fudge = true;
+                                   goto xmlparse;
+                               }
+                           }
+
                            // Display an error message.
                            $err = sprintf('XML error parsing SOAP payload on line %d: %s',
                            xml_get_current_line_number($this->parser),
--- Biblio.php~ 2006-10-05 19:28:36.000000000 -0400
+++ Biblio.php  2010-12-12 22:37:23.000000000 -0500
@@ -266,6 +266,7 @@
       new nusoapclient($server_url, true,
                       $proxyhost, $proxyport,
                       $proxyusername, $proxypassword);
+    $client->decode_utf8 = false;
     $err = $client->getError();

     if (!$err) {

MedaWiki test block:

<biblio>
#Barash2010 pmid=20445623
#MultipleFluors pmid=15558047
#SmolkeNAR2010 pmid=20385591
#IntronSizeDist pmid=16980575
#Kangueane2004 pmid=15217358
#SmithReview2005 pmid=15956978
#SplicingNomenclature pmid=18688268
</biblio>

Broken server response:

HTTP/1.1 200 OK
Date: Mon, 13 Dec 2010 02:04:11 GMT
Server: Apache
Content-length: 16028
Content-Type: text/xml; charset="UTF-8"
Vary: Accept-Encoding
Connection: close

<?xml version="1.0"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
 xmlns:xsd="http://www.w3.org/2001/XMLSchema"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >
<SOAP-ENV:Body><eSummaryResult xmlns="http://www.ncbi.nlm.nih.gov/soap/eutils/esummary">