Feature Request #92

Allow the XML parser to be more lax

Added by Anonymous 699 days ago. Updated 429 days ago.

Status:New Start:2008-08-29
Priority:Medium Due date:
Assigned to:- % Done:

0%

Category:Data Input
Target version:1.3

Description

Some feeds contain white space at the beginning of the feed, and on PHP 4.4.9 on Debian etch, this is correctly flagged as malformed XML. Likewise, the parser chokes in the presence of otherwise harmless control characters - quite correctly, given the XML specs.

PHP 5 on MacOS X 10.5 parses these feeds, skipping over the offending characters.

This code (a subclass of SimplePie_Parser) fixes it, even it has to steal some code from the superclass to get past the BOMs:

class CleanerParser extends SimplePie_Parser { ?>

function parse(&$data, $encoding) {
}
// Remove illegal control chars - leave only 0x09, 0x0A, 0x0D    
$txt = ereg_replace("[\x01-\x08]|\x0B|\x0C|[\x0E-\x1F]", "", $txt); }
// Strip BOM:
// UTF-32 Big Endian BOM
if (substr($data, 0, 4) === "\x00\x00\xFE\xFF") {
$data = substr($data, 4);
}
// UTF-32 Little Endian BOM
elseif (substr($data, 0, 4) === "\xFF\xFE\x00\x00") {
$data = substr($data, 4);
}
// UTF-16 Big Endian BOM
elseif (substr($data, 0, 2) === "\xFE\xFF") {
$data = substr($data, 2);
}
// UTF-16 Little Endian BOM
elseif (substr($data, 0, 2) === "\xFF\xFE") {
$data = substr($data, 2);
}
// UTF-8 BOM
elseif (substr($data, 0, 3) === "\xEF\xBB\xBF") {
$data = substr($data, 3);
}
// Remove white space (empty lines, etc.) before the first XML decl.
$data = trim($data);
return SimplePie_Parser::parse($data, $encoding);

History

Updated by Geoffrey Sneddon 699 days ago

trim() doesn't work, as it'll break any UTF-16LE feed (as the first word in any UTF-16LE feed is 00 3E), and trim() strips any null bytes. If we trim() whitespace, it should be just the S production in XML.

Updated by Morten Norby Larsen 699 days ago

Good catch. I had intended to add the code after the conversion to utf-8, but decided to go with the subclass, so I didn't have to make any changes to simplepie.inc itself.

At any rate, removing control chars that are not allowed by the XML standard anyway, should not make bad XML worse.

Updated by Geoffrey Sneddon 698 days ago

  • Category set to Data Input
  • Status changed from Unconfirmed to New
  • Target version set to 1.2

There was some bug about UTF-16LE being trimmed before — I just removed the trim call entirely then.

Updated by Geoffrey Sneddon 522 days ago

Does this even effect any feeds? Examples?

Updated by Geoffrey Sneddon 429 days ago

  • Target version changed from 1.2 to 1.3

Also available in: Atom PDF