Current Status of XSV: Coverage, Known Bugs, etc.

Applies to XSV 2.10-1 of 2005-04-22 13:10:49

Henry S. Thompson
Richard Tobin
22 April 2005

1.   What is XSV

XSV (XML Schema Validator) is an open source (GPLed) work-in-progress attempt at a conformant schema-aware processor, as defined by XML Schema Part 1: Structures, May 2, 2001 (REC) version. It has been developed at the Language Technology Group of the Human Communication Research Centre in the Division of Informatics at the University of Edinburgh, with support for one of us (Thompson) from the World Wide Web Consortium.

2.   How can I use XSV

2.1.   Using XSV online

The simplest way to use XSV is via a form-based interface on the web.

Please note the description below of a major change in the handling of numeric exponents in this release of XSV.

2.2.   Running XSV at your own installation

2.2.1.   Win32 one-click installation

I've packaged the current version up in to a self-installing package for Win32 platforms: just fetch it, run it, and add the installation directory to your PATH, then

> xsv [flags] target [schemas . ..]
target
The document to be processed (must be a URL, relative or absolute. Note this means forward-slashes only, even on WIN32 -- e.g. file:///C:/Project/xxx.xml).
schemas
Schema documents to process it with, also URLs.
-o errfile
Output error file to errfile rather than stderr.
-s stylefile
Include an XSL style PI to stylefile in the error output.
-r [alt|ind]
Reflect the augmented document infoset as an XML file to stdout (follow with alt to force old-style (alternating normal form) reflection, or ind (the default) for new-style (individual normal form) reflection. Use -r -r to get all schema components other than those of the schema for schemas, and -r -r -r to get the complete PSVI reflection including the schema for schemas.
-w
Include warnings in error output.
-t
Show stage timings.
-k
Attempt instance validation even if schema(s) has/have errors.
-i
Input should all be schemas, assume they are meant to be complete and check them as such.
-D
Use DTD to pre-validate, not built-in schema-for-schemas.
-l
Scan the whole document for schema location hints, not just root and new-namespace-binding-introducers.
-E elt
Force document element to be named elt, an expanded name (i.e. either an unqualfied simple name in no known namespace, or a name of the form {namespaceName}localName).
-T type
Force document element to be validated against the type definition named type, an expanded name as for -E.
-N
Don't dereference namespace URIs looking for schema documents.
-e
Preserve the low-level error transcript file.
-n
Output the input document with normalized values and defaults.
-u URI
Provide a base URI for target and schemas.
-d
Show backtrace if crash occurs

2.2.2.   Source distributions for the more adventurous

You can download the (Python) sources from the W3C public CVS repository, install Python 2.3, install PyLTXML (be sure to use the most recent, currently PyLTXML-1.3, release 7, RPMs for a number of architectures now available) to get the necessary XML validating library and do:

> [set PYTHONPATH to whereever you installed XSV sources]
> python .../XSV/commandLine.py ...

No, the above instructions aren't sufficiently detailed, but you probably don't want the sources unless you can figure out how to make it work :-)

2.2.3.   Linux RPMs and DEBs

Packages are now available for those running some versions of Linux:

Redhat installable
Debian installable
Linux source

These have a dependency on PyLTXML-1.3 (and Python itself), see above.

2.2.4.   Source tarball

A simple tar ball is also available, suitable for installation using Python's distutils:

> [cd to whereever you unpacked the tarball]
> python setup.py install

3.   What is implemented

The basic framework of schema checking and instance schema-validation is implemented. Some details of both are not yet filled in.

Potential breaking change: I've implemented a new bounded-cost approach to translating content models with numeric exponents. In obscure corner cases involving numeric ranges nested inside numeric ranges, not known to occur in any existing schema documents, some valid element sequences will be labelled invalid. Content models which are 'at risk' from this behaviour are noted with a schemaWarning element which says (exagerrating somewhat) "violation of constraints on exponents". Please let me know if you see this message.

Here's a brief tabulation of implemented and unimplemented aspects of the REC:

3.1.   Implemented at least in part

Content-model validation
Attribute validation
Include
Import
Equivalence classes
Local and global element and attribute declarations
Type definition derivation by extension and restriction (constraints on valid restrictions not completely enforced for simple types)
Identity-constraint checking (key/unique/keyref)
Content and attribute wildcards
xsi:schemaLocation, xsi:noNamespaceSchemaLocation and, as last resort, dereferencing of namespace URIs to find schema documents
xsi:null
xsi:type
Opportunistic validation inside <any> and <anyAttribute>
Redefinition
whitespace processing
Partial support for simple types, including enumeration, pattern (partially), length of lists, min/max, unions
ID/IDREF

3.2.   Not implemented yet

The rest of simple type conformance, esp. duration types

3.3.   Recent Changes

Correct validation with <xs:all> is finally implemented
New switch: -d, see above
Major bug fixes wrt checking restriction of complex type definitions involving numeric exponents.
Change in support for numeric exponents, see above
New switches: -u and -n, see above
Bug fixes wrt base URIs
Keyword-spotting with NSA/MI6 backdoor [backed out on 2 April]
Better behaviour wrt missing schema documents
Default no longer preserves low-level error transcript file
Better enforcement of block/final wrt type derivation
Bug fixes wrt PSVI, mixed derivation errors
Handle HTTP redirects better
Improved robustness when running with broken schemas (-k)
Bug fixes in the area of patterns, mixed-content derivations
Various modest small bug fixes
Changed alternating reflection to not dump all the schemas. See discussion of -r above
Added -N switch to suppress dereferencing of namespace URIs
Allow single schema on [stdin] with -i
Fixed infinite loop bug if namespace pointed to plain XHTML
List- and date-valued keys now work
Now uses values for key/unique/keyref checking
Support for ID/IDREF/IDREFS checking added
Major change to using values for enumeration and fixed checks
Modest support for date, time and dateTime
Improved handling of default and fixed values, including implementing them for mixed-content elements
Changed version numbering
Implemented group redefinition
Fix obscure bugs if namespace URIs had chars >127
Fail more gracefully in absence of a base type for restriction
Change 'rel' command-line arg to 'alt', for alternating normal form, which is what it always was
Major source-code restructuring
Checking of NMTOKEN, Name and NCName built-in types
Partial support for pattern facet (basically everything except named classes other than \d \s \w \D \S \W)
Support -i with -r (just the schema components get reflected)
Add top-level force element and/or type params
Change reflection format
Support enumeration and length facets on lists, and enumeration on unions
Fix bug which prevented explicit derivation by extension from anySimpleType
Fix pair of bugs wrt attribute wildcards and urtype
Handle redefining type defs with simple content
Improve management of file redundancy
Allow for xml:base by updating builtin schema for schemas
Fix bug in handling recursive keys
Fix some inadvertent bugs introduced by speedup
Performance improvements -- approx. 25% speedup
Flag attempt to define a complex type by restricting a simple type
Fix corner case bug when schema doc has schemaLoc
Fix longstanding bug with multiple local element decls
Enforce that types of elements in content model derived by restriction must be restrictions of their corresponding base types
Introduce FSM-based checking of derivation by restriction -- big change, potential backwards incompatibilities!
Fix bug introduced by include checking
Enforce include restrictions more carefully, enforce <import> required for QName reference.
Check for schemaLoc on root of schema docs before validating them
Fixed some base URL bugs breaking e.g. relative schemaLoc URLs, and then a bug in one of the fixes
Fix re-introduced multiple include bug
Switched from using DTD to pre-validate schema documents to using schema-for-schemas -- big change, potential backwards incompatibilities!
Fix crash when whiteSpace specified explicitly
Improved reflection of model group definitions, identity constraint definitions
Fix intersection of ##local and ##other bug
Fix two bugs wrt max/min constraints
Fixed problem with multiple chameleon includes
Fixed long-standing failure to propagate keys and uniques upwards
Fixed long-standing failure to correctly implement reference to key/unique from keyref using QNames
Changed schema location policy -- now will only look for schemaLoc attrs on doc elt and on change to previously unseen namespace. Such namespaces will also be dereferenced in the absence of a schemaLoc (this is new).
Made all schema document access messages consistent
Fixed bug which hid error message when non-schema-doct presented as schema; fixed crash when list of union failed
Fixed bug in type definition derivation chain checking; added better support for attribute defaults.
Fixed some crashes with -i and an obscure chameleon include bug
Reflected infoset in line with published schemas
Support RDDL at e.g. namespace URIs
bug fixes wrt min/max, element fixed/default
Reflect annotations, oob attributes properly
Handle extending with empty content model correctly
Reflect more thoroughly
Complete standalone schema checking, i.e. assuming this is all you're going to get
Multiple keys is error, not warning
Handle bogus xsi attrs
Better crash logging
Obscure bug in defaulted NS attributes fixed
Added full independent schema check switch (-i)
Fixed bug in use of DTD to pre-validate schemas, this caused serious and baffling problems, sorry.
Supports 'decimal', not 'number'
Supports all renamings, http://www.w3.org/2001/XMLSchema namespace
Improve stylesheet wrt XML parsing error output
XPath implementation now implements NS prefixes properly
Improve efficiency for large schemas
support for whitespace processing added
PSVI now mostly supported, can be reflected with -r switch
Fixed missing DTD complaint
Vintage 2000-09-22 changes restricting what can be specified in conjunction with element declarations of the form <element ref="..."> now implemented
Important: All schemas are now validated against the DTD for schemas before being loaded, even if they lack a DOCTYPE of their own. This may mean errors are found where none were before, or a change in error message. Feedback on this change is welcome.
Handle shadowing of e.g. elementFormDefault in <include>d schemas correctly
(Partial) support for restricting a simpleContent complexType with a nested simpleType
Check 'fixed' attribute values for correctness
Provide control of whether instance validation warnings appear (default is that they don't)
Improve bomb-proofing and recovery after crashes
Backlog of bug fixes, including forestalling crashes when <restriction>/<extension> are missing
Allow references to 'anyType' to work (oops)
Display more (more useful) output even if validator crashes
add -o outfile for command line invocation on e.g. win98 where capturing stderr is hard
Fix xmlschema-instance namespace, so xsi:schemaLocation works (oops)
losing itemType no longer causes crash
fix bug in restriction of lists
new version syntax, support for redefine of simple and complex type defns
Chameleon include: including a schema doc't with no target namespace into one with a target namespace does the right thing
No files on command line causes read from stdin
Bug fixes: handle nested attribute group references, more than two explicitly supplied schemas
Bug fixes: handle lists as element content correctly; allow simple types as type for document element; don't crash on restriction of type with simple content; don't crash on min/max for date/time types
Bug fixes: don't crash on missing group defn or attr. type or bogus XPath
Bug fixes: allow (but still don't implement) minInclusive on string types, don't crash if XPath ends with a '/'
Bug fixes: catch bad 'content' attr, don't crash after empty <group>, missing base
Better handling of some content-model errors in schemas
Allow appropriate facets on types derived by 'list' (still not actually enforced :-()
Bug fixes: require 'value' on facets; don't crash if simple type has element content; handle min/max facets on float/double
More bug fixes: catch complex type used for attributes or base of simple type cleanly; don't die if xsi:type encountered during lax validation
Don't require 'fixed' attributes to appear, fix obscure bug in use of xsi:type and no-target-namespace schemas
Allow <unique> fields to be missing without comment
More thorough enforcement of Unique Attribution (== determinism) constraint on content models, including checking for <any>-derived ambiguities
Bug-fixes to catch logged crashes: missing basetypes, bogus attr type derivation, minInclusive
Upgraded stylesheets to report schema errors and warnings properly
Fixed bugs in cases of missing attribute group definition, missing attribute type definition, missing base type definition
16-bit, XML output version now the default
now checks enumerated types, user-defined min-max bug fixed
Partial check of QName simple type conformance
Fix bad knock-on effect of failed import
Improve fidelity of lax validation; validate laxly instead of throwing error if no declaration found for document element
Fix bug in opportunistic validation of attribute values
Support for xsi:null
Fix bug causing bogus errors when restricting elements declared to have the urtype definition
Fix bug which made the <anyAttribute namespace="other"> in the schema for schemas overly generous
Try to catch all 404 (not found) errors better
Fixed ref-to-undeclared-elt bug
Fixed a bug causing a crash if you used an element with no content model at all == the ur-type by default, manifesting itself as an 'Attribute Error, note' crash
Now checks for and handles gracefully case where supplied file is not a schema document

3.4.   Known bugs/features

Patterns with more than 100 curly-braces ({}) will break the underlying Python substrate.