Attention
This article is out of date. Check out the updated version.
If you're familiar with SQL you will know how schemas work; a definition of how data is stored will be the first thing you think about when creating a database. For various reasons it's now popular to store structured data in text files such as JSON or YAML. However, in a lot of cases, the schema of these files is not explicit. A format will be expected by the appplication which loads that data, but you may only find out your input is invalid when the application throws an error message. The expected format is implicit; there may be examples given with the program, but there is no formal specification of the format of the data expected.
There are two main validators I've looked at: kwalify for YAML and json-schema for JSON. There's also Rx, which I haven't investigated yet. Rx's website says "Since Rx itself is still undergoing design refinements, relying on any implementation in production would be a bad idea." However, it may still be worth investigating.
Schema validation in theory
A schema validator takes two files as input - a schema and a data file to validate. Usually the schema file is written in the same language as the data file. This is an example from PyKwalify, a tool which validates YAML schemas.
---
type: map
mapping:
name:
type: str
pattern: N.*n
Above is the schema. This says that the layout of the file is a mapping - which is analogous to an associative array, or a dictionary object in python. There is only one recognised key in this mapping which is 'name'. The value attached to a name key must be a string, and must fit the pattern "N.*n". Below is some sample data we want to check:
---
name: Néron
We could check this if we run pwkwalify -s schema.yaml -d data.yaml.
The schema validator will tell you whether the data file fits the schema, and if it doesn't, point out what's at fault. The schema file is also of course written to a schema. For json-schema at least, the schema in which other schemas are written in is self-validating; it is written according to its own schema, making it similar to a quine in programming languages.
In practice
Baserock and json-schema
Baserock uses json-schema to validate its morph files, which is an unusual choice, since both the schema and the morph files are actually YAML. YAML is almost a superset of JSON. Thus, while you might reasonably expect a YAML validator to work on JSON data, the reverse isn't true. YAML is favoured in Baserock and other communities because the files are easier for people to read. Also, although not all YAML is valid JSON, converting from YAML to JSON is pretty trivial and error free, so it's not unreasonable to write a schema for validation by json-schema and store the underlying data in YAML.
OpenControl and Kwalify
OpenControl uses Kwalify to validate its schema. Kwalify is a YAML validation tool written in Ruby. In theory, a YAML validator should be the best thing to use: since almost all valid JSON is valid YAML, a YAML validator could validate data in either language.
Unfortunately, Kwalify's idea of valid yaml is quite restricted. This is an example piece of data from the OpenControl project - an example of a (fake) government standard:
name:
FRIST-800-53
AU-1:
family: AU
name: Audit and Accountability Policy and Procedures
AU-2:
family: AU
name: Audit Events
AU-2 (3):
family: AU
name: Audit Events | Reviews and Updates
Kwalify, unfortunately, doesn't believe that you can have spaces or brackets in a key name, so "AU-2 (3)" is unacceptable. Every other YAML standard and validator thinks this is fine, though. Kwalify is also a dead project; it hasn't been updated for about five years, so fixing it seems unlikely.
The more hopeful alternative is PyKwalify, which is a reimplementation of Kwalify. However, it still isn't capable of validating the above data yet. Kwalify has the '=:' operator which means 'match anything which isn't otherwise caught' - so you could use it to match 'AU-1', 'AU-2' and 'AU-2 (3)' with one sub-schema and specify 'name:' to match the special case. '=:' isn't in PyKwalify yet. PyKwalify does have regex matching for keys, but these are also broken in subtle ways. At least PyKwalify is maintained, so I've raised pull request for it and hopefully this will be fixed soon. Annoyingly, though, the original Kwalify doesn't support regexes, so you can't have a schema that can be validated by both.
Conclusion
Attention
The state of the art has improved since we wrote this in 2016. Check out the updated version of this article
In short, schema validation for YAML documents is a mess at the moment. For this reason I'd reccomend picking JSON over YAML for data storage, if you have a choice. json-schema appears to be the most mature project available at the moment.
Other Articles
- Web app test automation with `cdt`
- FOSDEM Testing and Automation talk
- Protecting your project from dependency access problems
- Porting GNOME OS to Microchip's PolarFire Icicle Kit
- YAML Schemas: Validating Data without Writing Code
- Deterministic Construction Service
- Codethink becomes a Microchip Design Partner
- Hamsa: Using an NVIDIA Jetson Development Kit to create a fully open-source Robot Nano Hand
- Using STPA with software-intensive systems
- Codethink achieves ISO 26262 ASIL D Tool Certification
- RISC-V: running GNOME OS on SiFive hardware for the first time
- Automated Linux kernel testing
- Native compilation on Arm servers is so much faster now
- Higher quality of FOSS: How we are helping GNOME to improve their test pipeline
- RISC-V: A Small Hardware Project
- Why aligning with open source mainline is the way to go
- Build Meetup 2021: The BuildTeam Community Event
- A new approach to software safety
- Does the "Hypocrite Commits" incident prove that Linux is unsafe?
- ABI Stability in freedesktop-sdk
- Why your organisation needs to embrace working in the open-source ecosystem
- RISC-V User space access Oops
- Tracking Players at the Edge: An Overview
- What is Remote Asset API?
- Running a devroom at FOSDEM: Safety and Open Source
- Meet the codethings: Understanding BuildGrid and BuildBox with Beth White
- Streamlining Terraform configuration with Jsonnet
- Bloodlight: Designing a Heart Rate Sensor with STM32, LEDs and Photodiode
- Making the tech industry more inclusive for women
- Bloodlight Case Design: Lessons Learned
- Safety is a system property, not a software property
- RISC-V: Codethink's first research about the open instruction set
- Meet the Codethings: Safety-critical systems and the benefits of STPA with Shaun Mooney
- Why Project Managers are essential in an effective software consultancy
- FOSDEM 2021: Devroom for Safety and Open Source
- Meet the Codethings: Ben Dooks talks about Linux kernel and RISC-V
- Here we go 2021: 4 open source events for software engineers and project leaders
- Xmas Greetings from Codethink
- Call for Papers: FOSDEM 2021 Dev Room Safety and Open Source Software
- Building the abseil-hello Bazel project for a different architecture using a dynamically generated toolchain
- Advent of Code: programming puzzle challenges
- Improving performance on Interrogizer with the stm32
- Introducing Interrogizer: providing affordable troubleshooting
- Improving software security through input validation
- More time on top: My latest work improving Topplot
- Orchestrating applications by (ab)using Ansible's Network XML Parser
- My experience of the MIT STAMP workshop 2020
- Red Hat announces new Flatpak Runtime for RHEL
- How to keep your staff healthy in lockdown
- Bloodlight: A Medical PPG Testbed
- Bringing Lorry into the 2020s
- How to use Tracecompass to analyse kernel traces from LTTng
- Fixing Rust's test suite on RISC-V
- The challenges behind electric vehicle infrastructure
- Investigating kernel user-space access
- Consuming BuildStream projects in Bazel: the bazelize plugin
- Improving RISC-V Linux support in Rust
- Creating a Build toolkit using the Remote Execution API
- Trusting software in a pandemic
- The Case For Open Source Software In The Medical Industry
- My experiences moving to remote working
- Impact of COVID-19 on the Medical Devices Industry
- COVID-19 (Coronavirus) and Codethink
- Codethink develops Open Source drivers for Microsoft Azure Sphere MediaTek MT3620
- Codethink partners with Wirepas
- Testing Bazel's Remote Execution API
- Passing the age of retirement: our work with Fortran and its compilers
- Sharing technical knowledge at Codethink
- Using the REAPI for Distributed Builds
- An Introduction to Remote Execution and Distributed Builds
- Gluing hardware and software: Board Support Packages (BSPs)
- Engineering's jack of all trades: an intro to FPGAs
- Bust out your pendrives: Debian 10 is out!
- Why you should attend local open source meet-ups
- Acceptance, strife, and progress in the LGBTIQ+ and open source communities
- Codethink helps York Instruments to deliver world-beating medical brain-scanner
- Full archive