Attention
This article is out of date. Check out the updated version.
If you're familiar with SQL you will know how schemas work; a definition of how data is stored will be the first thing you think about when creating a database. For various reasons it's now popular to store structured data in text files such as JSON or YAML. However, in a lot of cases, the schema of these files is not explicit. A format will be expected by the appplication which loads that data, but you may only find out your input is invalid when the application throws an error message. The expected format is implicit; there may be examples given with the program, but there is no formal specification of the format of the data expected.
There are two main validators I've looked at: kwalify for YAML and json-schema for JSON. There's also Rx, which I haven't investigated yet. Rx's website says "Since Rx itself is still undergoing design refinements, relying on any implementation in production would be a bad idea." However, it may still be worth investigating.
Schema validation in theory
A schema validator takes two files as input - a schema and a data file to validate. Usually the schema file is written in the same language as the data file. This is an example from PyKwalify, a tool which validates YAML schemas.
---
type: map
mapping:
name:
type: str
pattern: N.*n
Above is the schema. This says that the layout of the file is a mapping - which is analogous to an associative array, or a dictionary object in python. There is only one recognised key in this mapping which is 'name'. The value attached to a name key must be a string, and must fit the pattern "N.*n". Below is some sample data we want to check:
---
name: Néron
We could check this if we run pwkwalify -s schema.yaml -d data.yaml.
The schema validator will tell you whether the data file fits the schema, and if it doesn't, point out what's at fault. The schema file is also of course written to a schema. For json-schema at least, the schema in which other schemas are written in is self-validating; it is written according to its own schema, making it similar to a quine in programming languages.
In practice
Baserock and json-schema
Baserock uses json-schema to validate its morph files, which is an unusual choice, since both the schema and the morph files are actually YAML. YAML is almost a superset of JSON. Thus, while you might reasonably expect a YAML validator to work on JSON data, the reverse isn't true. YAML is favoured in Baserock and other communities because the files are easier for people to read. Also, although not all YAML is valid JSON, converting from YAML to JSON is pretty trivial and error free, so it's not unreasonable to write a schema for validation by json-schema and store the underlying data in YAML.
OpenControl and Kwalify
OpenControl uses Kwalify to validate its schema. Kwalify is a YAML validation tool written in Ruby. In theory, a YAML validator should be the best thing to use: since almost all valid JSON is valid YAML, a YAML validator could validate data in either language.
Unfortunately, Kwalify's idea of valid yaml is quite restricted. This is an example piece of data from the OpenControl project - an example of a (fake) government standard:
name:
FRIST-800-53
AU-1:
family: AU
name: Audit and Accountability Policy and Procedures
AU-2:
family: AU
name: Audit Events
AU-2 (3):
family: AU
name: Audit Events | Reviews and Updates
Kwalify, unfortunately, doesn't believe that you can have spaces or brackets in a key name, so "AU-2 (3)" is unacceptable. Every other YAML standard and validator thinks this is fine, though. Kwalify is also a dead project; it hasn't been updated for about five years, so fixing it seems unlikely.
The more hopeful alternative is PyKwalify, which is a reimplementation of Kwalify. However, it still isn't capable of validating the above data yet. Kwalify has the '=:' operator which means 'match anything which isn't otherwise caught' - so you could use it to match 'AU-1', 'AU-2' and 'AU-2 (3)' with one sub-schema and specify 'name:' to match the special case. '=:' isn't in PyKwalify yet. PyKwalify does have regex matching for keys, but these are also broken in subtle ways. At least PyKwalify is maintained, so I've raised pull request for it and hopefully this will be fixed soon. Annoyingly, though, the original Kwalify doesn't support regexes, so you can't have a schema that can be validated by both.
Conclusion
Attention
The state of the art has improved since we wrote this in 2016. Check out the updated version of this article
In short, schema validation for YAML documents is a mess at the moment. For this reason I'd reccomend picking JSON over YAML for data storage, if you have a choice. json-schema appears to be the most mature project available at the moment.
Other Content
- A new way to develop on Linux - Part II
- GUADEC 2024
- Developing a cryptographically secure bootloader for RISC-V in Rust
- Philip Martin, Meet the Team
- Improving systemd’s integration testing infrastructure (part 1)
- A new way to develop on Linux
- RISC-V Summit Europe 2024
- Safety Frontier: A Retrospective on ELISA
- Codethink sponsors Outreachy
- The Linux kernel is a CNA - so what?
- GNOME OS + systemd-sysupdate
- Codethink has achieved ISO 9001:2015 accreditation
- Outreachy internship: Improving end-to-end testing for GNOME
- Lessons learnt from building a distributed system in Rust
- FOSDEM 2024
- Introducing Web UI QAnvas and new features of Quality Assurance Daemon
- Outreachy: Supporting the open source community through mentorship programmes
- Using Git LFS and fast-import together
- Testing in a Box: Streamlining Embedded Systems Testing
- SDV Europe: What Codethink has planned
- How do Hardware Security Modules impact the automotive sector? The final blog in a three part discussion
- How do Hardware Security Modules impact the automotive sector? Part two of a three part discussion
- How do Hardware Security Modules impact the automotive sector? Part one of a three part discussion
- Automated Kernel Testing on RISC-V Hardware
- Automated end-to-end testing for Android Automotive on Hardware
- GUADEC 2023
- Embedded Open Source Summit 2023
- RISC-V: Exploring a Bug in Stack Unwinding
- Adding RISC-V Vector Cryptography Extension support to QEMU
- Introducing Our New Open-Source Tool: Quality Assurance Daemon
- Long Term Maintainability
- FOSDEM 2023
- Think before you Pip
- BuildStream 2.0 is here, just in time for the holidays!
- A Valuable & Comprehensive Firmware Code Review by Codethink
- GNOME OS & Atomic Upgrades on the PinePhone
- Flathub-Codethink Collaboration
- Codethink proudly sponsors GUADEC 2022
- Tracking Down an Obscure Reproducibility Bug in glibc
- Web app test automation with `cdt`
- FOSDEM Testing and Automation talk
- Protecting your project from dependency access problems
- Porting GNOME OS to Microchip's PolarFire Icicle Kit
- YAML Schemas: Validating Data without Writing Code
- Deterministic Construction Service
- Codethink becomes a Microchip Design Partner
- Hamsa: Using an NVIDIA Jetson Development Kit to create a fully open-source Robot Nano Hand
- Using STPA with software-intensive systems
- Codethink achieves ISO 26262 ASIL D Tool Certification
- RISC-V: running GNOME OS on SiFive hardware for the first time
- Automated Linux kernel testing
- Native compilation on Arm servers is so much faster now
- Full archive