This post is in response to Mark Baker‘s post on Validation Considered Harmful. He is partly right about the harmfulness of validation in the specific use case discussed in his post, but I disagree with the generalization implied in his post. Whether validation is useful or not depends on the schema design, and document validation is not a sacred cow worth picking on.
Mark considers a scenario in which party B sends a document to party A for some processing. The schema A and B are following has been updated to accept a new value for some field. Old values are still allowed.
Assuming that A and B are using W3C’s XML Schema, the schema may have been changed from
<simpleType name='someInt'> <restriction base='integer'> <minInclusive value='1'> <maxInclusive value='3'> </restriction> </simpleType>
to the following:
<simpleType name='someInt'> <restriction base='integer'> <minInclusive value='1'> <maxInclusive value='4'> </restriction> </simpleType>
Party B uses the new schema and sends a value of “4” allowed by the new schema but not allowed by the old version of the schema. Party A does not understand the new schema yet, and hence it’s schema validator rejects the new message. Mark concludes that validation is therefore harmful, and I disagree.
This example is a generalization of the schema extensibility problem. In this particular example, although the scheme change was compatible (i.e new value space for this integer is a super set of the old value space), since A and B are using different versions of the schema, validation will fail when the sending party uses constructs from a later version of the schema.
My first comment is that schema validation may not matter in this example for A to fail. Even if A does not validate the incoming XML, there are several things that can fail within A’s code that is processing the message. For example there may be a database constraint somewhere in the back-end that may fail when it sees a value out of range. Or, some GUI may fail to render because it did not expect a new value. In that sense, validation provides a first-line of defense. There is nothing wrong with having a first line of defense. Failing early is better than failing late.
Secondly, in this particular example, there are two ways to address the schema extensibility.
If party A is only one processing the XML and B is always the sender, it needs to own the schema. That would let it control the evolution of the schema along with the code changes happening in its implementation. By letting some other arbitrary party extend the schema in the way described in Mark’s post, the receiving party can easily be made to fail.
If the schema is needed to be controlled by some other party, then it needs to take an entirely different approach towards extensibility. It could, e.g. use extensibility points in the schema (using xs:any) to introduce changes. Alternatively, when the changes are significant, it could create a new version of the schema in a new namespace. Both these approaches will make B’s messages schema-valid, and preserve forwards and backwards compatibility. See this and this for some background.
My take is that schema validation is useful and can help catch bugs early. Whether an application should fail immediately or not after detecting a schema violation is an implementation choice. There are cases when it can simply ignore the failures and continue to process, then there are cases when it needs to fail immediately. For instance, if party B sends some new XML elements that party A does not understand and does not care, it could simply ignore the validation errors. But if it receives a different set of values for some elements that it needs to decipher, it may choose to fail early.