Avram Specification

Avram is a schema language for field-based data formats such as key-value records or library formats MARC and PICA.

Table of Contents

Introduction

MARC and related formats such as PICA and MAB are used since decades as the basis for library automation. Several variants, dialects and profiles exist for different applications. The Avram schema language allows to specify individual formats for documentation, validation, and requirements engineering. The schema language is named after Henriette D. Avram (1919-2006) who devised MARC as the first automated cataloging system in the 1960s.

The Avram specification consists of a schema format based on JSON and validation rules to validate records against individual schemas. The format can also be used to express results of record analysis. Avram schemas cover library formats based on MARC and PICA as well as simple key-value structures.

The document is managed in a git repository at https://github.com/gbv/avram together with test files for implementations.

Conformance requirements

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Data types

A string is a sequence of Unicode code points.

A timestamp is a date or datetime as defined with XML Schema datatype datetime (-?YYYY-MM-DDThh:mm:ss(\.s+)?(Z|[+-]hh:mm)?) date (-?YYYY-MM-DD(Z|[+-]hh:mm)?), gYearMonth (-?YYYY-MM), or gYear (-?YYYY).

A regular expression is a non-empty string that conforms to the ECMA 262 (2015) regular expression grammar. The expression is interpreted as Unicode pattern with . matching all characters, including newlines.

A language is a natural language identifier as defined with XML Schema datatype language.

A non-negative integer is a natural number (0, 1, 2...)

A range is a sequence of digits, optionally followed by a dash (-) and a second sequence of digits with same length and numerical value larger than the first sequence (examples: 0, 00, 3-7, 03-12, 01-09...). A string matches a range if it is a sequence of digits of same length as the sequence(s) in the range and the numerical value is equal to or between the numerical value(s) of the range. Applications MAY accept and normalize two sequences of different length to valid ranges.

Records

Avram schemas are used to validate and analyze records. A record is a non-empty sequence of fields, each consisting of a tag, being a non-empty string and

Fields with subfields, also called variable fields, MAY also have

The record model can further be restricted by a format family, identified by a non-empty string. The following format families are part of this specification and imply restrictions on schemas for this format family:

The encoding of records in individual serialization formats such as MARCXML, ISO 2709, or PICA JSON is out of the scope of this specification.

Schema format

An Avram Schema is a JSON object given as serialized JSON document or any other format that encodes a JSON document. In contrast to RFC 7159, all object keys MUST be unique. String values SHOULD NOT be the empty string. Applications MAY remove keys with empty string value.

A schema MUST contain key

A schema SHOULD contain keys documenting the format defined by the schema:

The schema MAY contain keys:

Example
{
  "fields": { },
  "title": "MARC 21 Format for Classification Data",
  "description": "MARC format for classification numbers and captions associated with them",
  "url": "https://www.loc.gov/marc/classification/",
  "profile": "http://format.gbv.de/marc/classification",
  "language": "en",
  "$schema": "https://format.gbv.de/schema/avram/schema.json"
}

Field schedule

A field schedule is a JSON object that maps field identifiers to field definitons.

Example
{
  "010": { "label": "Library of Congress Control Number" },
  "084": { "label": "Classification Scheme and Edition" }
}

Field identifiers of a field schedule SHOULD NOT overlap. Two field identifiers overlap when it is possible to match a field with both. Applications MUST either detect and reject overlapping field identifiers or match fields to the first of multiple overlapping field identifiers in alphabetical sort order.

Field identifier

A field identifier is a non-empty string that can be used to match fields. The identifier consists of a tag, optionally followed by

Applications MAY further allow a tag followed by the slash and two zeroes (/00) as alias for a bare tag.

A field matches a field identifier if the tag of the field is equal to the tag of the field identifier, and

Examples

Field definition

A field definition is a JSON object that SHOULD contain key:

The field definition MAY further contain keys:

A field definition MUST NOT contain both keys for flat fields (positions, pattern, codes and/or deprecated-codes) and keys for variable fields (subfields and/or deprecated-subfields) together.

If a field definition is given in a field schedule, each of tag, occurrence and counter MUST either be missing or have same value as used to construct the corresponding field identifier.

Applications MAY allow and remove occurrence keys with value two zeroes (00) as alias for a field definition without occurrence.

Example

Positions

Subfield values and flat field values can be specified positions, being a JSON object that maps character positions to data element definitions. A character position is a range not consisting of zeroes only. It is RECOMMENDED to use sequences of two digits.

A data element definition is a JSON object that SHOULD contain key:

The data element definition MAY further contain keys:

Example

Subfield schedule

A subfield schedule is a JSON object that maps subfield codes to subfield definitions. A subfield code is a single character. A subfield definition is a JSON object that SHOULD contain keys:

The subfield definition MAY further contain keys:

Example

Indicator definition

An indicator definition is a JSON object that SHOULD contain key

and further MAY contain keys:

Example
{
  "label": "Type",
  "codes": {
    " ": "Abbreviated key title",
    "0": "Other abbreviated title"
  }
}

Codelist

A codelist is

A code is a non-empty string. A code definition is a JSON object with optional keys:

A codelist directory is a JSON object that maps referenced codelists to explicit codelists.

Examples (explicit, referenced, and truncated codelist directory)
{
  " ": {
    "label": "No specified type"
  },
  "a": {
    "label": "Archival"
  },
  "x": { }
}
"http://id.loc.gov/vocabulary/languages"
{
  "http://id.loc.gov/vocabulary/languages": {
    "eng": { "label": "English" },
    "fre": { "label": "French" }
  }
}

External validation rules

An Avram Schema MAY include references to additional validation rules with key checks at the root level, at field schedules, and at subfield schedules. The value of this keys MUST be an string or a JSON array of strings. Strings SHOULD be URIs.

Example
{
    "fields": {
        "birth": {
            "subfields": {
                "Y": { "label": "year" },
                "M": { "label": "month" },
                "D": { "label": "day" }
            },
            "checks": "http://example.org/valid-date"
        },
        "death": {
            "subfields": {
                "Y": { "label": "year" },
                "M": { "label": "month" },
                "D": { "label": "day" }
            },
            "checks": "http://example.org/valid-date"
        }
    },
    "checks": [
        "death must not be earlier than birth",
        "birth only allowed before 1950 for privacy reasons"
    ]
}

Restrictions by format family

A format family restricts the model of records than can be described by an Avram schema. Known values of schema key family imply restriction on field identifiers and field definitions.

flat formats

Field identifiers are plain tags. Field definitions MUST NOT include keys occurrence, counter, indicator1, indicator2, subfields, or deprecated-subfields.

marc formats

Field identifiers are plain tags and MUST either be the character sequence LDR or three digits. Field definitions MUST NOT include keys occurrence or counter. Field definitions of flat fields MUST NOT have keys indicator1 or indicator2.

pica formats

Field identifiers MUST NOT include a field counter if its tag starts with digit 0 or 1 and MUST NOT include a field occurrence if its tag starts with digit 2.Tags MUST match the regular expression ^[012][0-9][0-9][A-Z@]. Field definitions MUST NOT include keys indicator1 or indicator2.

mab formats

Field identifiers are plain tags and MUST consist of excactely three digits. Field definitions MUST NOT include keys indicator2, occurrence, or counter.

Metaschema

A JSON Schema to validate Avram Schemas is available at https://format.gbv.de/schema/avram/schema.json.

Applications MAY extend the metaschema for particular format families and formats, for instance by further restriction of the allowed set of field identifiers.

Validation rules

Rules how to validate records against Avram Schemas have not fully been specified yet!

Avram schemas can be used to validate records (see record validation). An Avram validator MAY limit validation to selected format families. Validation can be configured with validation options.

Record validation

A record is valid against a field schedule if

If validation option ignore_unknown_fields is enabled, all fields not matching a field identifier in the field schedule are valid by definition.

A record is valid against a schema if it is valid against the field schedule given with key fields of the schema. If validation option allow_deprecated is enabled, a record is also valid if it is valid against the field schedule given with key deprecated-fields.

Field validation

A field is valid against a field definition if the following rules are met:

Field validation of variable fields can be configured:

Tag and occurrence of a field are not included in field validation as they are part of record validation.

Subfield validation

A subfield is valid if it conforms to its corresponding subfield definition:

Subfield validation can be configured:

Value validation

A value (given as string), is valid if it conforms to a definition (given as field definition, subfield definition, indicator definition, data element definition):

A value is always valid if the definition contains neither of keys pattern, positions, and codes.

If validation option ignore_values is enabled, all non-indicator values are valid.

Validation with positions

A string value is valid against positions if all substrings defined by character positions of the positions are valid against the corresponding data element definitions. Character positions are counted by Unicode code points.

Substrings can be empty, for instance when the value is shorter than some character position. An empty substring can be valid, depending on the data element definition.

Positions can recursively contain other positions via their data element definitions.

Validation with codelists

A string value is valid against an explicit codelist if the value is a defined code in this codelist.

To check whether a string value is valid against a referenced codelist, the codelist is resolved with the codelist directory of the Avram schema. Applications MAY resolve referenced codelists against externally defined explicit codelists. If so, the application MUST make clear whether codelists defined in the codelist directory are overriden or extened.

Validation can further be configured:

Counting

An Avram schema can contain key records at root level and keys records and total at field definitions, subfield definitions and code definitions. Validation can be configured to not ignore these fields but to compare given counting fields to the actual number of records, subfields, and/or codes found in input data. Validation options are:

Enabling count_subfields implies count_fields and enabling count_fields or count_codes implies count_records.

Validation options

An Avram validator MAY support selected validation options to configure how validation rules are applied. All options MUST be disabled by default and if not supported. An Avram validator MUST document which options it supports. The following validation options are defined:

Option Aspect Implication
ignore_unknown_fields record validation ignore fields without field definition
allow_deprecated record validation, field validation, validation with codelists include deprecated fields, subfields, and/or codelists
ignore_subfields field validation ignore subfields
ignore_unknown_subfields field validation ignore subfields without subfield definition
check_subfield_order field validation additionally validate order of subfields
ignore_values value validation ignore all flat field values and subfield values
ignore_codes validation with codelists don't validate with codelists
ignore_unknown_codelists validation with codelists don't validate with unresolveable referenced codelists

References

Normative references

Informative references

Implementations

Background information

Changes

0.8.2 (2022-09-01)

0.8.1 (2022-06-20)

0.8.0 (2022-04-25)

0.7.1 (2021-10-01)

0.7.0 (2021-09-29)

0.6.0 (2020-09-15)

0.5.0 (2020-08-04)

0.4.0 (2019-05-09)

0.3.0 (2018-03-16)

0.2.0 (2018-03-09)

0.1.0 (2018-02-20)