Attention
This article is out of date. Check out the updated version.
If you're familiar with SQL you will know how schemas work; a definition of how data is stored will be the first thing you think about when creating a database. For various reasons it's now popular to store structured data in text files such as JSON or YAML. However, in a lot of cases, the schema of these files is not explicit. A format will be expected by the appplication which loads that data, but you may only find out your input is invalid when the application throws an error message. The expected format is implicit; there may be examples given with the program, but there is no formal specification of the format of the data expected.
There are two main validators I've looked at: kwalify for YAML and json-schema for JSON. There's also Rx, which I haven't investigated yet. Rx's website says "Since Rx itself is still undergoing design refinements, relying on any implementation in production would be a bad idea." However, it may still be worth investigating.
Schema validation in theory
A schema validator takes two files as input - a schema and a data file to validate. Usually the schema file is written in the same language as the data file. This is an example from PyKwalify, a tool which validates YAML schemas.
---
type: map
mapping:
name:
type: str
pattern: N.*n
Above is the schema. This says that the layout of the file is a mapping - which is analogous to an associative array, or a dictionary object in python. There is only one recognised key in this mapping which is 'name'. The value attached to a name key must be a string, and must fit the pattern "N.*n". Below is some sample data we want to check:
---
name: Néron
We could check this if we run pwkwalify -s schema.yaml -d data.yaml.
The schema validator will tell you whether the data file fits the schema, and if it doesn't, point out what's at fault. The schema file is also of course written to a schema. For json-schema at least, the schema in which other schemas are written in is self-validating; it is written according to its own schema, making it similar to a quine in programming languages.
In practice
Baserock and json-schema
Baserock uses json-schema to validate its morph files, which is an unusual choice, since both the schema and the morph files are actually YAML. YAML is almost a superset of JSON. Thus, while you might reasonably expect a YAML validator to work on JSON data, the reverse isn't true. YAML is favoured in Baserock and other communities because the files are easier for people to read. Also, although not all YAML is valid JSON, converting from YAML to JSON is pretty trivial and error free, so it's not unreasonable to write a schema for validation by json-schema and store the underlying data in YAML.
OpenControl and Kwalify
OpenControl uses Kwalify to validate its schema. Kwalify is a YAML validation tool written in Ruby. In theory, a YAML validator should be the best thing to use: since almost all valid JSON is valid YAML, a YAML validator could validate data in either language.
Unfortunately, Kwalify's idea of valid yaml is quite restricted. This is an example piece of data from the OpenControl project - an example of a (fake) government standard:
name:
FRIST-800-53
AU-1:
family: AU
name: Audit and Accountability Policy and Procedures
AU-2:
family: AU
name: Audit Events
AU-2 (3):
family: AU
name: Audit Events | Reviews and Updates
Kwalify, unfortunately, doesn't believe that you can have spaces or brackets in a key name, so "AU-2 (3)" is unacceptable. Every other YAML standard and validator thinks this is fine, though. Kwalify is also a dead project; it hasn't been updated for about five years, so fixing it seems unlikely.
The more hopeful alternative is PyKwalify, which is a reimplementation of Kwalify. However, it still isn't capable of validating the above data yet. Kwalify has the '=:' operator which means 'match anything which isn't otherwise caught' - so you could use it to match 'AU-1', 'AU-2' and 'AU-2 (3)' with one sub-schema and specify 'name:' to match the special case. '=:' isn't in PyKwalify yet. PyKwalify does have regex matching for keys, but these are also broken in subtle ways. At least PyKwalify is maintained, so I've raised pull request for it and hopefully this will be fixed soon. Annoyingly, though, the original Kwalify doesn't support regexes, so you can't have a schema that can be validated by both.
Conclusion
Attention
The state of the art has improved since we wrote this in 2016. Check out the updated version of this article
In short, schema validation for YAML documents is a mess at the moment. For this reason I'd reccomend picking JSON over YAML for data storage, if you have a choice. json-schema appears to be the most mature project available at the moment.
Other Content
- Using Git LFS and fast-import together
- Testing in a Box: Streamlining Embedded Systems Testing
- SDV Europe: What Codethink has planned
- How do Hardware Security Modules impact the automotive sector? The final blog in a three part discussion
- How do Hardware Security Modules impact the automotive sector? Part two of a three part discussion
- How do Hardware Security Modules impact the automotive sector? Part one of a three part discussion
- Automated Kernel Testing on RISC-V Hardware
- Automated end-to-end testing for Android Automotive on Hardware
- GUADEC 2023
- Embedded Open Source Summit 2023
- RISC-V: exploring a bug in stack unwinding
- Adding RISC-V Vector Cryptography Extension support to QEMU
- Introducing Our New Open-Source Tool: Quality Assurance Daemon
- Long Term Maintainability
- FOSDEM 2023
- Think before you Pip
- BuildStream 2.0 is here, just in time for the holidays!
- A Valuable & Comprehensive Firmware Code Review by Codethink
- GNOME OS & Atomic Upgrades on the PinePhone
- Flathub-Codethink Collaboration
- Codethink proudly sponsors GUADEC 2022
- Tracking Down an Obscure Reproducibility Bug in glibc
- Web app test automation with `cdt`
- FOSDEM Testing and Automation talk
- Protecting your project from dependency access problems
- Porting GNOME OS to Microchip's PolarFire Icicle Kit
- YAML Schemas: Validating Data without Writing Code
- Deterministic Construction Service
- Codethink becomes a Microchip Design Partner
- Hamsa: Using an NVIDIA Jetson Development Kit to create a fully open-source Robot Nano Hand
- Using STPA with software-intensive systems
- Codethink achieves ISO 26262 ASIL D Tool Certification
- RISC-V: running GNOME OS on SiFive hardware for the first time
- Automated Linux kernel testing
- Native compilation on Arm servers is so much faster now
- Higher quality of FOSS: How we are helping GNOME to improve their test pipeline
- RISC-V: A Small Hardware Project
- Why aligning with open source mainline is the way to go
- Build Meetup 2021: The BuildTeam Community Event
- A new approach to software safety
- Does the "Hypocrite Commits" incident prove that Linux is unsafe?
- ABI Stability in freedesktop-sdk
- Why your organisation needs to embrace working in the open-source ecosystem
- RISC-V User space access Oops
- Tracking Players at the Edge: An Overview
- What is Remote Asset API?
- Running a devroom at FOSDEM: Safety and Open Source
- Meet the codethings: Understanding BuildGrid and BuildBox with Beth White
- Streamlining Terraform configuration with Jsonnet
- Bloodlight: Designing a Heart Rate Sensor with STM32, LEDs and Photodiode
- Making the tech industry more inclusive for women
- Bloodlight Case Design: Lessons Learned
- Safety is a system property, not a software property
- RISC-V: Codethink's first research about the open instruction set
- Meet the Codethings: Safety-critical systems and the benefits of STPA with Shaun Mooney
- Why Project Managers are essential in an effective software consultancy
- FOSDEM 2021: Devroom for Safety and Open Source
- Meet the Codethings: Ben Dooks talks about Linux kernel and RISC-V
- Here we go 2021: 4 open source events for software engineers and project leaders
- Xmas Greetings from Codethink
- Call for Papers: FOSDEM 2021 Dev Room Safety and Open Source Software
- Building the abseil-hello Bazel project for a different architecture using a dynamically generated toolchain
- Advent of Code: programming puzzle challenges
- Full archive