Avram Specification

Avram is a schema language for field-based data formats such as key-value records or library formats MARC and PICA.

Table of Contents

Introduction

MARC and related formats such as PICA and MAB are used since decades as the basis for library automation. Several variants, dialects and profiles exist for different applications. The Avram schema language allows to specify individual formats for documentation, validation, and requirements engineering. The schema language is named after Henriette D. Avram (1919-2006) who devised MARC as the first automated cataloging system in the 1960s.

The Avram specification consists of a schema format based on JSON and validation rules to validate records against individual schemas. The format can also be used to express results of record analysis. Avram schemas cover library formats based on MARC and PICA as well as simple key-value structures.

The document is managed in a git repository at https://github.com/gbv/avram together with test files for implementations.

Conformance requirements

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Data types

A string is a sequence of Unicode code points.

A single character is a string consisting of exactely one Unicode code point.

A timestamp is a date or datetime as defined with XML Schema datatype datetime (-?YYYY-MM-DDThh:mm:ss(\.s+)?(Z|[+-]hh:mm)?) date (-?YYYY-MM-DD(Z|[+-]hh:mm)?), gYearMonth (-?YYYY-MM), or gYear (-?YYYY).

A regular expression is a non-empty string that conforms to the ECMA 262 (2015) regular expression grammar. The expression is interpreted as Unicode pattern with . matching all characters, including newlines.

A language is a natural language identifier as defined with XML Schema datatype language.

A non-negative integer is a natural number (0, 1, 2...)

An URI is a valid URI string according to RFC 3986.

An URL is an URI starting with http:// or https://.

A range is a sequence of digits or a sequence of digits followed by a dash (-) and a second sequence of digits. The second sequence, if given, SHOULD have same length as the first. The numeric values of each sequence are called start number and end number, respectively. The end number, if given, MUST be larger than the start number. Examples of valid ranges include 0, 00, 3-7, 03-12, and 1-09 but not 7-2. A string matches a range if it is a sequence of digits of same length as the longest sequence in the range and the numerical value is equal to or within the start number and the end number of the range. For instance 7 matches range 0-9 but it does not match 1-3 nor 03-10 and 07 matches 03-10 but not 0-9.

Records

Avram schemas are used to validate and analyze records. A record is a non-empty sequence of fields, each consisting of a tag, being a non-empty string and

Fields with subfields, also called variable fields, MAY also have

In addition, each record has a set of record types, each being a non-empty string. By default this set is empty and applications MAY choose to not support record types at all (see validation with record types).

The record model can further be restricted by a format family.

The encoding of records in JSON or other individual serialization formats such as MARCXML, ISO 2709, or PICA JSON is out of the scope of this specification.

Examples

Format families

The record model can be restricted by a format family, identified by a non-empty string. The following format families are part of this specification:

Restrictions on records by a format family imply restrictions on schemas for this format family.

Examples

Schema format

An Avram Schema is a JSON object given as serialized JSON document or any other format that encodes a JSON document. In contrast to RFC 7159, all object keys MUST be unique. String values SHOULD NOT be the empty string. Applications MAY remove keys with empty string value.

A schema MUST contain key

A schema SHOULD contain keys documenting the format defined by the schema:

The schema MAY contain keys:

Example
{
  "fields": { },
  "title": "MARC 21 Format for Classification Data",
  "description": "MARC format for classification numbers and captions associated with them",
  "url": "https://www.loc.gov/marc/classification/",
  "profile": "http://format.gbv.de/marc/classification",
  "language": "en",
  "$schema": "https://format.gbv.de/schema/avram/schema.json"
}

Field schedule

A field schedule is a JSON object that maps field identifiers to field definitons.

Example
{
  "010": { "label": "Library of Congress Control Number" },
  "084": { "label": "Classification Scheme and Edition" }
}

Field identifiers of a field schedule MUST NOT overlap. Two field identifiers overlap when it is possible to match a field with both.

Field identifier

A field identifier is a non-empty string that can be used to match fields. The identifier consists of a tag, optionally followed by the slash (/) and

Applications MAY further allow a tag followed by the slash and two zeroes (/00) as alias for a bare tag.

A field matches a field identifier if the tag of the field is equal to the tag of the field identifier, and

Examples

Field definition

A field definition is a JSON object that SHOULD contain key:

The field definition MAY further contain keys:

A typed field definition is a JSON object with optional keys positions, pattern, groups, codes, label, description, and url, each defined identical to keys of same name allowed in a field definition (see validation with record types).

If a field definition is given in a field schedule, each of tag, occurrence and counter MUST either be missing or have same value as used to construct the corresponding field identifier.

If a field definition contains the subfield keys indicating a variable field, it MUST NOT contain keys for flat fields (positions, pattern and/or codes).

Applications MAY allow and remove occurrence keys with value two zeroes (00) as alias for a field definition without occurrence.

Example

Positions

Subfield values and flat field values can be specified positions, being a JSON object that maps character positions to data element definitions. A character position is a range. It is RECOMMENDED to use sequences of two digits.

A data element definition is a JSON object that SHOULD contain key:

The data element definition MAY further contain keys:

Character positions of a positions object MUST NOT overlap. Two character positions overlap if there is a string that matches both of them.

Examples

Pattern groups

Pattern groups are a JSON object that maps numbers of capturing groups (starting with "1") of a regular expression to documentation objects, each with optional keys:

Subfield schedule

A subfield schedule is a JSON object that maps subfield codes to subfield definitions. A subfield code is a single character. A subfield definition is a JSON object that SHOULD contain keys:

The subfield definition MAY further contain keys:

The subfield definition MAY but SHOULD NOT contain an additional, deprecated key

Example

Indicator definition

An indicator definition is a JSON object that SHOULD contain key

and further MAY contain keys:

Example
{
  "label": "Type",
  "codes": {
    " ": "Abbreviated key title",
    "0": "Other abbreviated title"
  }
}

Codelist

A codelist is

A code is a non-empty string. A code definition is either a string or a JSON object with optional keys:

Optional key code of a code definition MUST be equal to the key of the code definition in its codelist.

A code definition being a string MUST be treated identical to a codelist definition being JSON object with only key label having the value of the string.

A codelist directory is a JSON object that maps codelist references to JSON objects each having at least the mandatory key codes with a codelist and optional keys:

A codelist reference can be resolved by looking up its value as key in the codelist directory to get the corresponding explicit codelist.

Examples

External validation rules

An Avram Schema MAY include references to additional validation rules with key rules at the root level, at field schedules, and at subfield schedules to check additional data types or integrity constraints. The value of this keys MUST be an array of strings or arbitrary JSON objects. String elements MUST NOT be equal to names of validation rules but they SHOULD be URIs.

Example
{
  "fields": {
    "birth": {
      "subfields": {
        "Y": { "label": "year" },
        "M": { "label": "month" },
        "D": { "label": "day" }
      },
      "rules": ["http://example.org/valid-date"]
    },
    "death": {
      "subfields": {
        "Y": { "label": "year" },
        "M": { "label": "month" },
        "D": { "label": "day" }
      },
      "rules": ["http://example.org/valid-date"]
    },
    "age": {
      "rules": ["xsd:nonNegativeInteger"]
    }
  },
  "rules": [
    "death must not be earlier than birth, except for time-travelers",
    "age must equal death minus birth, if given",
    {
      "id": "privacy",
      "if": "birth?",
      "then": "birth.Y < 1950",
      "description": "birth only allowed before 1950 for privacy reasons"
    },
  ]
}

Restrictions by format family

A format family restricts the model of records than can be described by an Avram schema. Known values of schema key family imply restriction on field identifiers and field definitions.

flat formats

Field identifiers are plain tags. Field definitions MUST NOT include keys occurrence, counter, indicator1, indicator2, or subfields.

marc formats

Field identifiers are plain tags and MUST either be the string LDR or three digits. Field definitions MUST NOT include keys occurrence or counter. Field definitions of flat fields MUST NOT have keys indicator1 or indicator2.

pica formats

Field identifiers MUST NOT include a field counter if its tag starts with digit 0 or 1 and MUST NOT include a field occurrence if its tag starts with digit 2.Tags MUST match the regular expression ^[012][0-9][0-9][A-Z@]. Field definitions MUST NOT include keys indicator1 or indicator2.

mab formats

Field identifiers are plain tags and MUST consist of excactely three digits. Field definitions MUST NOT include keys indicator2, occurrence, or counter.

Metaschema

A JSON Schema to validate Avram Schemas is available at https://format.gbv.de/schema/avram/schema.json.

Applications MAY extend the metaschema for particular format families and formats, for instance by further restriction of the allowed set of field identifiers.

Validation rules

Avram schemas can be used to validate records based on validation rules specfied in this section (marked in bold and numbered from 1 to 23). Rule 1 to 19 refer to validation of individual records, fields, and subfields. Rule 20 to 22 refer to validation of sets of records. Rule 22 can refer to both.

An Avram validator MAY choose to support only a limited set of validation rules, it SHOULD allow to enable and disable selected rules and it MAY disable selected rules by default. It is RECOMMENDED to disable counting rules (18 to 20) and external rules (21) by default. Support and selection of validation rules MUST be documented.

An Avram validator MAY limit validation to selected format families.

  1. invalidRecord: A set of records is valid against a schema, if all of its records pass record validation against the field schedule of the schema.

Record validation

A record is valid against a field schedule if the following rules are met and every field passes field validation against its corresponding field definition from the field schedule. If rule undefinedField is disabled, fields without corresponding field definition are assumed to be valid.

  1. undefinedField: Every field matches a field identifier in the field schedule.

  2. deprecatedField: The matched field definition must not have key deprecated set to true.

  3. nonrepeatableField: The record does not contain more than one field matching the same field definition with repeatable being false.

  4. missingField: the record contains at least one field for each field definition with required being true.

Field validation

A field is valid against a field definition if the following rules are met:

  1. invalidfieldvalue: if the field is a flat field, its field value must be valid by value validation and by validation with record types.

  2. invalidIndicator: If the field contains indicators, their values must be valid by value validation against the corresponding indicator definition indicator1 (first indicator) and indicator2 (second indicator).

If the field is a variable field:

  1. undefinedSubfield: Every subfield has a corresponding subfield definition.

  2. deprecatedSubfield: The matched subfield definition must not have key deprecated set to true.

  3. nonrepeatableSubfield: For subfield definitions with repeatable being true, the field MUST NOT contain more than one subfield.

  4. missingSubfield: For subfield definitions with required being true, the field MUST contain at least one subfield.

  5. invalidSubfieldValue: Every subfield value is valid by value validation against its corresponding subfield definition.

Tag and occurrence of a field are not included in field validation as they are part of record validation.

Value validation

A value (given as string), is valid if it conforms to a definition (given as field definition, subfield definition, indicator definition, or data element definition) by meeting the following rules:

  1. patternMismatch: If the definition contains key pattern, the value must match its regular expression. The pattern is not anchored by default, so ^ and/or $ must be included to match start and/or end of the value.

  2. invalidPosition: If the definition contains key positions, the value must be valid against its positions.

If the definition contains key codes, the value must further be valid against its codelist (see corresponding rules below).

A value is always valid if the definition contains neither of keys pattern, positions, and codes.

Validation with record types

Record types are arbitrary strings attached to a record as flags. An Avram validator SHOULD support the following rule to enable additional validation depending on record types.

  1. recordTypes: If a field definition contains key types with a JSON object, the record types of a record must be looked up in this object to corresponding typed field definitions. All resulting typed field definitions must then be used for additional value validation with their keys pattern, positions, and codes.
Example

Validation with positions

A string value is valid against positions if all substrings defined by character positions of the positions are valid by value validation against the corresponding data element definitions. Character positions are counted by Unicode code points.

  1. invalidFlag: If a data element definition contains key flags, the substring MUST also consist of a concatenation of codes from the codelist defined by flags.

Validation with codelists

  1. undefinedCode: A string value is valid against an explicit codelist if the value is a defined code in this codelist.

  2. deprecatedCode: A string value is not valid against an explicit codelist if its code has key deprecated set to true.

  3. undefinedCodelist: A string value is valid against a codelist reference if the codelist reference can be resolved and the value is defined in the resolved explicit codelist.

Applications MAY also resolve codelist references against externally defined explicit codelists by implicitly extending the codelist directory of the schema. If so, the application MUST make clear whether codelists directly defined in the codelist directory are overriden or extened.

Counting

Avram schemas can also be used to give or expect a number of elements with keys records at root level and keys records and total at field definitions, subfield definitions and code definitions. Support of the following counting rules in Avram validators is OPTIONAL. An Avram validator MUST document whether it supports counting rules or not.

Validation rules for counting are:

  1. countRecord to enable counting the total number of records, and the total numbers or records each field with a field definition, each subfield with a subfield definition, and each code with a code definition is found in.

  2. countField to enable counting the total number each field from the field schedule is found

  3. countSubfield to enable counting the total number each subfield field from a subfield schedule is found

If selected counting rules are supported and enabled, then the following must be checked by an Avram validator:

Validation with external validation rules

By default external validation rules are ignored for validation because their semantics is out of the scope of this specification. The following rule can be enabled to require records to met all external rules:

  1. externalRule: Enforces an Avram validator to process all external rules and reject input data as invalid if a rule is violated or cannot be checked.

References

Normative references

Informative references

Implementations and public Avram schemas

Appendix

Acknowledgments

Thanks to Péter Király for picking up the idea and for collaborative development. Thanks to Carsten Klee, Ed Summers, Harry Gegic, Johann Rolschewski, Stefan Majewski, Thomas Frings, and Timothy Thompson for comments, code and contributions.

Changes

0.9.6 - 2024-01-19

0.9.5 - 2024-01-12

0.9.4 - 2024-01-02

0.9.3 - 2023-12-22

0.9.2 (2023-11-29)

0.9.1 (2023-11-27)

0.9.0 (2023-10-27)

0.8.2 (2022-09-01)

0.8.1 (2022-06-20)

0.8.0 (2022-04-25)

0.7.1 (2021-10-01)

0.7.0 (2021-09-29)

0.6.0 (2020-09-15)

0.5.0 (2020-08-04)

0.4.0 (2019-05-09)

0.3.0 (2018-03-16)

0.2.0 (2018-03-09)

0.1.0 (2018-02-20)