YAML Schemas: Validating Data without Writing Code

In the 20 years since YAML first appeared, its flexible and approachable way of representing data has become ubiquitous. Codethink is driven by YAML: we manage our infrastructure using Ansible, we integrate software stacks using BuildStream, we define CI pipelines using Gitlab CI or GitHub actions, we run safety analysis using STPA tools, and the list goes on. YAML has even made it to Mars.

The "human friendly" design principle of YAML means it's convenient to read and write by hand, and is most likely the secret to its dominance over competing formats. This convenience comes at a cost for developers who need to work with YAML data, though. Read on to find out more.

Safety and Validation

YAML's data model can represent arbitrary data, so when an app parses a YAML document it might get back anything. It's up to the developer to check that the data is structured how the app expects and to control what happens when it isn't. Does it report an error to the user? Is the behaviour undefined? Does it crash?

The YAML data model has strong, implicit typing. Each time it reads a value, the YAML parser will guess the type tag in a process called "tag resolution". The following example shows the "Norway problem", an exciting edge case present in YAML 1.1 and earlier:

country-list: [DK, NO, SE]

That's a list of two-letter ISO country codes. Let's load this into Python using pyyaml:

>>> import yaml
>>> yaml.safe_load("country-list: [DK, NO, SE]")
{'country-list': ['DK', False, 'SE']}

If your program expects two letter country codes, what happens if it gets boolean False instead? This happens because YAML 1.1 specified that YES and NO become boolean types. You can work around this by quoting 'NO' or explicitly setting type tag !!str NO, but it's easy to forget and there are many similar edge cases.

One way or another, your app must validate the data it receives before processing it.

Schemas

You can write code to manually verify the data structure and report any issues to the user. The more times you implement this the more boring and error-prone it becomes, and you think: is there a library to make this more convenient?

The answer is yes, but the correct choice depends on the language you're working with.

Some languages have built-in types that easily map to YAML's data model. JavaScript is a good example: when you load a YAML document using js-yaml you receive a tree of JavaScript objects and values which hasn't been validated against any schema. Your code must handle situations where the input data is not formed how you expect, and you can use a general purpose validator that operates on JavaScript objects directly.

In many cases a validator supports a reasonable subset of the YAML data model, which is fine when you know what to expect. JSON-Schema is one example: the JSON data model is very close to YAML, and it's commonly used to validate data read from YAML files. We might validate our country-list above with the following JSON-Schema:

$id: https://example.com/schema.json
$schema: https://json-schema.org/draft/2020-12/schema
type: object
properties:
  country-list:
    type: array
    items: { "type": "string" }

Other languages, such as C have no built-in way to represent YAML's model. A YAML library for a language like C must define its own types to represent the data, or use an alternative approach: the popular libyaml C library provides a low-level token-based API, which can be inconvenient to use. Some libraries improve on this by allowing developers to define a schema and data model in code. This has two benefits: library can validate the data before returning it to your app, and it can return it using types defined by you. Examples of this approach are libcyaml (for C), go-yaml (Go) and serde_yaml (Rust).

Libraries

So what's the safest and most convenient library to use for loading and validating YAML? Here are some recommendations grouped by language ecosystem.

C

Developed by Codethink engineer Michael Drake, libcyaml wraps the standard libyaml parser with a strongly-typed API for defining your data schema and loading it into structures defined by you.

See the guide to find out more about how it works.

C++

There is no schema-driven YAML loader for C++ that we know of. Both yaml-cpp and rapidyaml return a tree of C++ objects, and it's up to you to validate and process the results.

There are several libraries that handle JSON-Schema for C++, but most are tied to specific JSON parser libraries and can't be extended to YAML. The exception is valijson, whose flexible adaptor-based API integrates with many JSON libraries and could be extended to cover YAML libraries too.

If you are using Boost and could represent your data using boost::PropertyTree, you could validate it using valijson already.

Go

Go libraries can see your program's type information at runtime using the powerful reflect module. The yaml package and the newer go-yaml both use this feature to make schema validation simple: you define the expected structure of the data using Go's built-in types, and the library will raise an error if anything in the file doesn't match.

JavaScript

The most popular JavaScript YAML loader is js-yaml. There are several other options too, and all of them return the data as JavaScript objects without validating that it's what you want.

Once you have the data, you check it against a JSON-Schema using ajv, or the newer djv library.

Python

The safest way to load YAML in Python is with strictyaml. When you know what structure the input data should have, strictyaml is perfect: it's specifically designed to avoid the "Norway problem", and provides its own way to define a schema using Python code so it can verify the structure of the input and raise exceptions if it sees a problem.

You should be aware that it it imposes some limits on the incoming data, so for some use cases you'll want pyyaml or ruamel.yaml instead. These libraries work well with the Python jsonschema package for data validation.

Rust

With its emphasis on performance, Rust requires that any type introspection is done at compile-time. The Serde crate provides Serialize and Deserialize traits which your data structures can implement in a single line of code, using #[derive] macros.

This enables the serde_yaml library to check data loaded from a YAML file against the data structures you expect to hold it, and return an error code if there is anything unexpected.

See the tutorial How to read and write YAML in Rust using Serde for a detailed example.

Other languages

We've listed some common languages that we use here at Codethink. Many more language ecosystems have some way to parse YAML, you can start by checking the list at yaml.org, and if you need an easy way to validate the data, start with the list of JSON-Schema validator implementations for your language.

Whatever your language, see if you can use a library to validate incoming data instead of checking everything by hand.