XML Canonical Forms

DRAFT 1

As with many sorts of structured information, there are many categories of information that may be deemed "important" for some task. Canonical forms are standard ways to represent such classes of information. For testing XML, and potentially for other purposes, three XML Canonical Forms have been defined as of this writing:

For a document already in a given canonical form, recanonicalizing to that same form will change nothing. Canonicalizing second or third forms to the first canonical form discards all declarations. Canonicalizing second or third forms to the other form has no effect.

The author is pleased to acknowledge help from James Clark in defining the additional canonical forms.

First XML Canonical Form

This description has been extracted from the version at http://www.jclark.com/xml/canonxml.html.

Every well-formed XML document has a unique structurally equivalent canonical XML document. Two structurally equivalent XML documents have a byte-for-byte identical canonical XML document. Canonicalizing an XML document requires only information that an XML processor is required to make available to an application.

A canonical XML document conforms to the following grammar:

CanonXML    ::= Pi* element Pi*
element     ::= Stag (Datachar | Pi | element)* Etag
Stag        ::= '<'  Name Atts '>'
Etag        ::= '</' Name '>'
Pi          ::= '<?' Name ' ' (((Char - S) Char*)? - (Char* '?>' Char*)) '?>'
Atts        ::= (' ' Name '=' '"' Datachar* '"')*
Datachar    ::= '&amp;' | '&lt;' | '&gt;' | '&quot;'
                 | '&#9;'| '&#10;'| '&#13;'
                 | (Char - ('&' | '<' | '>' | '"' | #x9 | #xA | #xD))
Name        ::= (see XML spec)
Char        ::= (see XML spec)
S           ::= (see XML spec)

Attributes are in lexicographical order (in Unicode bit order).

A canonical XML document is encoded in UTF-8.

Ignorable white space is considered significant and is treated equivalently to data.

Second XML Canonical Form

Modified to ensure that literals are surrounded by single quotes.

This canonical form is identical to the first form, with one significant addition. All XML processors are required to report the name and external identifiers of notations that are declared and referred to in an XML document (section 4.7); those reports are reflected in declarations in this form, presented in lexicographic order.

Note that all public identifiers must be normalized before being presented to applications (section 4.2.2).

System identifiers are normalized on output to be relative to the input document, if that is possible, with the shortest such relative URI. All other URIs must be absolute. Any hash mark and fragment ID, if erroneously present on input, are removed. Any non-ASCII characters in the URI must be escaped as specified in the XML specification (section 4.2.2).

CanonXML2    ::= DTD2? CanonXML
DTD2         ::= '<!DOCTYPE ' name ' [' #xA Notations? ']>' #xA
Notations    ::= ( '<!NOTATION ' Name '
			(('PUBLIC ' PubidLiteral ' ' SystemLiteral)
			|('PUBLIC ' PubidLiteral)
			|('SYSTEM ' SystemLiteral))
			'>' #xA )*
PubidLiteral ::= "'" PubidChar* "'"
SystemLiteral ::= "'" [^']* "'"

The requirement of this canonical form differs slightly from that of the XML specification itself in that all declared notations must be listed, not just those which were referred to. Should that change? SAX supports it easily.

Third XML Canonical Form

This canonical form is identical to the second form, with two significant exceptions reflecting requirements placed on validating XML processors:

This builds on the grammar productions included above.

CanonXML3    ::= DTD3? CanonXML
DTD3         ::= '<!DOCTYPE ' name ' [' #xA Notations? Unparsed? ']>' #xA
Unparsed    ::= ( '<!ENTITY ' Name '
			(('PUBLIC ' PubidLiteral ' ' SystemLiteral)
			|('SYSTEM ' SystemLiteral))
			'NDATA ' Name
			'>' #xA )*

The requirement of this canonical form differs slightly from that of the XML specification itself in that all declared unparsed entities must be listed, not just those which were referred to. Should that change? SAX supports it easily.

xml-feedback@java.sun.com