Attention

This article is out of date. Check out the updated version.

If you're familiar with SQL you will know how schemas work; a definition of how data is stored will be the first thing you think about when creating a database. For various reasons it's now popular to store structured data in text files such as JSON or YAML. However, in a lot of cases, the schema of these files is not explicit. A format will be expected by the appplication which loads that data, but you may only find out your input is invalid when the application throws an error message. The expected format is implicit; there may be examples given with the program, but there is no formal specification of the format of the data expected.

There are two main validators I've looked at: kwalify for YAML and json-schema for JSON. There's also Rx, which I haven't investigated yet. Rx's website says "Since Rx itself is still undergoing design refinements, relying on any implementation in production would be a bad idea." However, it may still be worth investigating.

Schema validation in theory

A schema validator takes two files as input - a schema and a data file to validate. Usually the schema file is written in the same language as the data file. This is an example from PyKwalify, a tool which validates YAML schemas.

---

type: map
mapping:
  name:
    type: str
    pattern: N.*n

Above is the schema. This says that the layout of the file is a mapping - which is analogous to an associative array, or a dictionary object in python. There is only one recognised key in this mapping which is 'name'. The value attached to a name key must be a string, and must fit the pattern "N.*n". Below is some sample data we want to check:

---
name: Néron

We could check this if we run pwkwalify -s schema.yaml -d data.yaml.

The schema validator will tell you whether the data file fits the schema, and if it doesn't, point out what's at fault. The schema file is also of course written to a schema. For json-schema at least, the schema in which other schemas are written in is self-validating; it is written according to its own schema, making it similar to a quine in programming languages.

In practice

Baserock and json-schema

Baserock uses json-schema to validate its morph files, which is an unusual choice, since both the schema and the morph files are actually YAML. YAML is almost a superset of JSON. Thus, while you might reasonably expect a YAML validator to work on JSON data, the reverse isn't true. YAML is favoured in Baserock and other communities because the files are easier for people to read. Also, although not all YAML is valid JSON, converting from YAML to JSON is pretty trivial and error free, so it's not unreasonable to write a schema for validation by json-schema and store the underlying data in YAML.

OpenControl and Kwalify

OpenControl uses Kwalify to validate its schema. Kwalify is a YAML validation tool written in Ruby. In theory, a YAML validator should be the best thing to use: since almost all valid JSON is valid YAML, a YAML validator could validate data in either language.

Unfortunately, Kwalify's idea of valid yaml is quite restricted. This is an example piece of data from the OpenControl project - an example of a (fake) government standard:

name:
  FRIST-800-53
AU-1:
  family: AU
  name: Audit and Accountability Policy and Procedures
AU-2:
  family: AU
  name: Audit Events
AU-2 (3):
  family: AU
  name: Audit Events | Reviews and Updates

Kwalify, unfortunately, doesn't believe that you can have spaces or brackets in a key name, so "AU-2 (3)" is unacceptable. Every other YAML standard and validator thinks this is fine, though. Kwalify is also a dead project; it hasn't been updated for about five years, so fixing it seems unlikely.

The more hopeful alternative is PyKwalify, which is a reimplementation of Kwalify. However, it still isn't capable of validating the above data yet. Kwalify has the '=:' operator which means 'match anything which isn't otherwise caught' - so you could use it to match 'AU-1', 'AU-2' and 'AU-2 (3)' with one sub-schema and specify 'name:' to match the special case. '=:' isn't in PyKwalify yet. PyKwalify does have regex matching for keys, but these are also broken in subtle ways. At least PyKwalify is maintained, so I've raised pull request for it and hopefully this will be fixed soon. Annoyingly, though, the original Kwalify doesn't support regexes, so you can't have a schema that can be validated by both.

Conclusion

Attention

The state of the art has improved since we wrote this in 2016. Check out the updated version of this article

In short, schema validation for YAML documents is a mess at the moment. For this reason I'd reccomend picking JSON over YAML for data storage, if you have a choice. json-schema appears to be the most mature project available at the moment.

Validating Schemas in YAML

Schema validation in theory

In practice

Baserock and json-schema

OpenControl and Kwalify

Conclusion

Other Content

Get in touch to find out how Codethink can help you

connect@codethink.co.uk +44 161 660 9930