Other features

Schema objects and package APIs include a set of other features that have been added since a specific release. These features are regulated by arguments, alternative classes or module parameters.

XSD 1.0 and 1.1 support

Since release v1.0.14 XSD 1.1 support has been added to the library through the class xmlschema.XMLSchema11. You have to use this class for XSD 1.1 schemas instead the default class xmlschema.XMLSchema, that is linked to XSD 1.0 validator xmlschema.XMLSchema10.

The XSD 1.1 validator can be used also for validating XSD 1.0 schemas, except for a restricted set of cases related to content extension in a complexType (the extension of a complex content with simple base is allowed in XSD 1.0 and forbidden in XSD 1.1).

CLI interface

Starting from the version v1.2.0 the package has a CLI interface with three console scripts:

xmlschema-validate: Validate a set of XML files.
xmlschema-xml2json: Decode a set of XML files to JSON.
xmlschema-json2xml: Encode a set of JSON files to XML.

XSD validation modes

Since the version v0.9.10 the library uses XSD validation modes strict/lax/skip, both for schemas and for XML instances. Each validation mode defines a specific behaviour:

strict: Schemas are validated against the meta-schema. The processor stops when an error is found in a schema or during the validation/decode of XML data.
lax: Schemas are validated against the meta-schema. The processor collects the errors and continues, eventually replacing missing parts with wildcards. Undecodable XML data are replaced with None.
skip: Schemas are not validated against the meta-schema. The processor doesn’t collect any error. Undecodable XML data are replaced with the original text.

The default mode is strict, both for schemas and for XML data. The mode is set with the validation argument, provided when creating the schema instance or when you want to validate/decode XML data. For example you can build a schema using a strict mode and then decode XML data using the validation argument setted to ‘lax’.

Note

From release v1.1.1 the iter_decode() and iter_encode() methods propagate errors also for skip validation mode. The errors generated in skip mode are discarded by the top-level methods decode() and encode().

Namespaces mapping options

Since the earlier releases the validation/decoding/encoding methods include the namespaces optional argument that can be used to provide a custom namespace mapping. In versions prior to 3 of the library the XML declarations are loaded and merged over the custom mapping during the XML document traversing, using alternative prefixes in case of collision.

With version 3.0 the processing of namespace information of the XML document has been improved, with the default of maintaining an exact namespace mapping between the XML source and the decoded data.

The feature is available both with the decoding and encoding API with the new converter option xmlns_processing, that permits to change the processing mode of the namespace declarations of the XML document.

The preferred mode is ‘stacked’, the mode that maintains a stack of namespace mapping contexts, with the active context that always match the namespace declarations defined in the XML document. In this case the namespace map is updated dynamically, adding and removing the XML declarations found in internal elements. This choice provide the most accurate mapping of the namespace information of the XML document.

Use the option value ‘collapsed’ for loading all namespace declarations in a single map. In this case the declarations are merged into the namespace map of the converter, using alternative prefixes in case of collision. This is the legacy behaviour of versions prior to 3 of the library.

With ‘root-only’ only the namespace declarations of the XML document root are loaded. In this case you are expected to provide the internal namespace information with namespaces argument.

Use ‘none’ to not load any namespace declaration of the XML document. Use this option if you don’t want to map namespaces to prefixes or you want to provide a fully custom namespace mapping.

For default xmlns_processing option is set automatically depending by the converter class capability and the XML data source. The option is available also for encoding with updated converter classes that can retrieve xmlns declarations from decoded data (e.g. xmlschema.JsonMLConverter or the default converter). For decoding the default is set to ‘stacked’ or ‘collapsed’, for encoding the default can be also ‘none’ if no namespace declaration can be retrieved from XML data (e.g. xmlschema.ParkerConverter).

Lazy validation

From release v1.0.12 the document validation and the decoding API have an optional argument lazy=False, that can be changed to True for operating with a lazy xmlschema.XMLResource. The lazy mode can be useful for validating and decoding big XML data files, consuming less memory.

From release v1.1.0 the lazy mode can be also set with a non negative integer. A zero is equivalent to False, a positive value means that lazy mode is activated and defines also the lazy depth to use for traversing the XML data tree.

Lazy mode works better with validation because is not needed to use converters for shaping decoded data.

XML entity-based attacks protection

The XML data resource loading is protected using the SafeXMLParser class, a subclass of the pure Python version of XMLParser that forbids the use of entities. The protection is applied both to XSD schemas and to XML data. The usage of this feature is regulated by the XMLSchema’s argument defuse.

For default this argument has value ‘remote’ that means the protection on XML data is applied only to data loaded from remote. Providing ‘nonlocal’ all XML data are defused except local files. Other values for this argument can be ‘always’ and ‘never’, with obvious meaning.

Access control on accessing resources

From release v1.2.0 the schema class includes an argument named allow for protecting the access to XML resources identified by an URL or filesystem path. For default all types of URLs are allowed. Provide a different value to restrict the set of URLs that the schema instance can access:

all: All types of URL and file paths are allowed.
remote: Only remote resource URLs are allowed.
local: Only file paths and file-related URLs are allowed.
sandbox: Allows only the file paths and URLs that are under the directory path identified by source argument or base_url argument.
none: No URL based or file path access is allowed.

Warning

For protecting services that are freely accessible for validation (eg. a web on-line validator that has a form for loading schema and/or XML instance) the recommendation is to provide ‘always’ for the defuse argument and ‘none’ for the allow argument. These settings prevent attacks to your local filesystem, through direct paths or injection in XSD schema imports or includes.

For XSD schemas, if you want to permit imports of namespaces located on other web services you can provide ‘remote’ for the allow argument and provide an XMLResource instance, initialized providing allow=’none’, as the source argument for the main schema.

Processing limits

Since release v1.0.16 a module has been added in order to group constants that define processing limits, generally to protect against attacks prepared to exhaust system resources. These limits usually don’t need to be changed, but this possibility has been left at the module level for situations where a different setting is needed.

Limit on XSD model groups checking

Model groups of the schemas are checked against restriction violations and Unique Particle Attribution violations. To avoids XSD model recursion attacks a depth limit of 15 levels is set. If this limit is exceeded an XMLSchemaModelDepthError is raised, the error is caught and a warning is generated. If you need to set an higher limit for checking all your groups you can import the library and change the value of MAX_MODEL_DEPTH in the limits module:

>>> import xmlschema
>>> xmlschema.limits.MAX_MODEL_DEPTH = 20

Limit on XML data depth

A limit of 1000 on maximum depth is set for XML validation/decoding/encoding to avoid attacks based on extremely deep XML data. To increase or decrease this limit change the value of MAX_XML_DEPTH in the module limits after the import of the package:

>>> import xmlschema
>>> xmlschema.limits.MAX_XML_DEPTH = 1000

Limit on parsable XML elements

A limit of 1,000,000 on maximum number of elements parsable by a non-lazy xmlschema.XMLResource instance is set to avoid attacks based on heavy loads of XML data. Lazy resources are not bounded to this limit. To increase or decrease this limit change the value of MAX_XML_ELEMENTS in the module limits after the import of the package:

>>> import xmlschema
>>> xmlschema.limits.MAX_XML_ELEMENTS = 10 ** 8

Limit on loadable schema sources

A limit of 1,000 schemas per global map is set. To increase or decrease this limit change the value of MAX_SCHEMA_SOURCES in the module limits after the import of the package:

>>> import xmlschema
>>> xmlschema.limits.MAX_SCHEMA_SOURCES = 1500

Translations of parsing/validation error messages

From release v1.11.0 translation of parsing/validation error messages can be activated:

>>> import xmlschema
>>> xmlschema.translation.activate()

Note

Activation depends by the default language in your environment and if it matches translations provided with the library. You can build your custom translation from the template included in the repository (xmlschema/locale/xmlschema.pot) and then use it in your runs providing localedir and languages arguments to activation call. See Translation API for information.

Translations for default do not interfere with other translations installed at runtime and can be deactivated after:

>>> xmlschema.translation.deactivate()

Schema loaders

With v4.0 and beyond it’s possible to variate the loading phase. When a schema instance is initialized is connected to or creates a xmlschema.XsdGlobals instance and a xmlschema.SchemaLoader instance for processing declared or explicit imports and includes. The default loader class process imports of namespaces, ignoring further import statements of the same namespace. This strategy is safe for avoiding component collisions, considering that schemas in other namespaces are usually edited and changed by others. If you need a loader that import any declared location you can provide the xmlschema.LocationSchemaLoader through the option loader_class. For the same strategy you can provide xmlschema.SafeSchemaLoader, that try all the unloaded locations without raising in case of collision, but in this case the loading phase could be slower.

Schema settings

From v4.2 each composition of schema instances is based on not changeable xmlschema.settings.SchemaSettings instance stored in global maps, in order to handle schema and XML resource options in a secure way.

These setting are created at when a new schema instance is created, basing on default schema settings, that are stored at package level, overridden by provided arguments for xmlschema.XMLResource, now included as optional keyword arguments.

Also a new schema instances can be created with xmlschema.XMLSchemaBase.from_settings(), to have a more flexible way for creating schemas that match the same settings.

Default schema settings can be changes after package import using the class method xmlschema.settings.SchemaSettings.update_defaults() and restored to library default with xmlschema.settings.SchemaSettings.reset_defaults(). This way of managing settings would be sharpened in future releases, and anyway these methods can be aldready used for building configuration management in applications that use this library, if needed.