Requirements on XML Schema to enable it to be used for data models developed using the EXPRESS information specification language
(ISO 1030-11; 1994)

Summary

This document provides a brief description of the requirements on XML Schema that arise from the need to express, using XML Schema constructs, models that have been developed using the EXPRESS language (ISO 10303-11; 1994).

Background

EXPRESS is a rich and mature language for the definition of data schemas. It is part of the STEP standard (ISO 10303), and is in widespread use to define data models for large-scale industrial applications.

In 1984 an ISO effort was initiated to deal with data exchange of product information between design systems and from design to other systems, such as manufacturing. This was the start of the STEP project under ISO TC184/SC4. Then and since the effort has been primarily driven by user companies from a wide variety of industries: aerospace (Boeing, Lockheed, British Aerospace), automotive (Ford, GM, BMW, VW, Volvo, Bosch…), process (Shell, BP), shipbuilding, construction, defence. STEP is now in production usage in many of these and other companies.

The original intent was to free CAD data from dependence on proprietary computer systems and formats, and thus to enable data describing manufactured products to be exchanged between systems during the design and manufacturing process. Its usefulness has led to it needing to meet challenges beyond those originally envisaged. As well as the shape and configuration of a product, its design requirements, operating instructions, maintenance history and use within a business also need to be described. For large or expensive manufactured product, lifecycle information can be at least as important as the description of its physical configuration. And the information may need to be shared in real time between multiple users in a networked environment, not just exchanged periodically as a snapshot, which was the original intent of STEP.

The effort has resulted in several standards, the first of which is called STEP (ISO 10303). It describes a technology for describing and exchanging product data and then uses the description capability (called EXPRESS) to define many standard schemas for product data and necessary related topics.

(A prior data exchange mechanism had taught the valuable lesson that the description of the content of some data has to be independent of the syntax used for data transmission. So EXPRESS is designed to enable multiple implementation forms.)

Three other standards are also in development under TC184/SC4 using EXPRESS. Also EXPRESS has been used as the definition form by other standards groups, such as the Petrotechnical Open Software Corporation (POSC) dealing with geophysical data and EDIF (Electronic Data Interchange Format).

The STEP and related community have major investment (hundreds of man years) in the models defined in EXPRESS and therefore are extremely keen that the XML Schema capability be able to deal with the information content of EXPRESS schemas.

The EXPRESS language

EXPRESS is a general information description language. It is a full ISO standard. It is defined in the EXPRESS Language Reference Manual (ISO 10303-11; 1994). EXPRESS is a rich and powerful language for the definition of the structure to which a set of data should conform. The mechanisms it provides for defining the types of object and their properties, that will be used in a given data set, and the constraints to which those objects should conform, are far richer than those provided by the DTD syntax of SGML and XML.

The requirements given below are abstracted from the full manual in order to provide the essential characteristics that are required to enable a future XML Schema capability to deal with schemas defined in EXPRESS. These are presented in a way that avoids the need to deal with the syntax defined by EXPRESS.

Although EXPRESS was defined specifically to deal with the specification of the information around products, it is not specific to product data and has been applied in other areas.

(Note: Unlike JTC1 and W3C standards, ISO imposes its copyright on standards. As a consequence the EXPRESS Language Reference Manual is not available on the web.)

Note on Vocabulary

EXPRESS is essentially an Entity-Attribute language. EXPRESS uses the term "entity" to refer to a type of data object, and "attribute" to refer to a property of that object. It uses the term "entity instance" to refer to an object that belongs to a given entity (type). This usage is clearly very confusing when used in an SGML or XML context, where the same terms have another meaning.

EXPRESS's "entity" corresponds roughly to SGML's "element type", and EXPRESS's "entity instance" to SGML's "element". It could be thought that EXPRESS's "attribute" is equivalent to SGML's "attribute", but this is not so, given that EXPRESS attributes can themselves be complex data objects, and cannot be constrained within the rules of SGML or XML attributes.

In this document, we shall therefore avoid the use of these terms. Instead, we shall talk of "object types" (EXPRESS entities) and "object instances" (EXPRESS entity instances), which have "properties" (EXPRESS attributes).

In this way we shall avoid the risk of assuming that an EXPRESS attribute will necessarily be encoded using an XML attribute. It is clear that an object's "property" could well be encoded as an XML element, which could either be a subelement of the object of which it is a property, or linked to it by some form of XLink, for example.

General requirements

The underlying requirement is the ability to use XML Schema to describe, without loss of information, an EXPRESS data model.

The data corresponding to the model is not necessarily encoded in XML syntax. However, it should be possible to create an XML representation of that data, or a subset of it.

Arising from this are three initial requirements:

  1. XML Schema should not assume any particular syntax in which the data is encoded.
  2. It should be possible for an XML instance, or a set of several XML instances, that contains data conforming to the model, to be able to state explicitly whether it contains all, or only part of, a data population described by the model.
    Example: If the model describes people, organisations, and the contractual relationships between them, it should be permissible and meaningful to generate a valid XML instance which contains all the people and only the people, or one person and all the other people or organisations with whom that person has a contractual relationship.
  3. It should be possible to determine, by examining an XML instance or set of instances, whether it is meaningful to check their conformance to constraints that the schema defines on the entire data population.

Requirements derived from EXPRESS

The following sections give different areas of requirements, all of which arise from the underlying requirement to be able to use XML Schema to describe a complete EXPRESS model.

Some additional requirements are included which stem from lessons learned but which are not part of EXPRESS now. Such requirements are annotated with (LL).

Declarative aspects

 

It should be possible for a data model to extend across multiple schemas. Schemas are named when they are declared, and it should be possible for object types from one schema to be used or referenced from within another.

Example: A schema for 2-dimensional geometry might have a data type point. Another schema, for geography, might use objects of type 2dgeometry.point, under an alias location to identify geographical locations.

 

Object types are named when declared. Their names are unique within a schema but need not be unique across multiple schemas.

 

Declaring an object type does not imply that there need be any instances of the type in a data set corresponding to the schema.

 

When declaring an object type, it must be possible to state constraints that apply to every instance of that object type (see the section below on Constraint specifications).

 

For each object type, it is possible to declare zero or more properties that objects of that type may have. Each property has a name, which is unique among the properties of that object type, but need not be unique within the schema. Properties can be declared as optional. Every property has a declared type.

 

A schema can define data types, based on the underlying data types provided by EXPRESS, subject to specific constraints.

Example: The data type length might be defined as a REAL (underlying data type) which is non-negative (constraint)

The following underlying data types are available:

  1. INTEGER
  2. REAL
  3. NUMBER (may be either integer or real)
  4. BINARY
  5. STRING (may be empty)
  6. BOOLEAN (TRUE or FALSE)
  7. LOGICAL (TRUE, FALSE or UNKNOWN)

STRING and BINARY types can be declared with fixed lengths or maximum lengths.

For REALs a precision can be provided indicating the number of significant digits that are expected for the mantissa.

 

A property's type can be one of the underlying data types, or a defined data type, or an object type.

Example: An object of type circle might have two properties: centre and radius, with radius defined as being of type length (a defined data type, declared elsewhere in the schema), and centre as being of type point (an object type). Elsewhere in the schema, point is declared, with properties x_coord and y_coord, both of type REAL (an underlying data type).

 

A property's type can also be an aggregation, of which the following kinds exist:

  1. BAG (the most general and least constrained form of aggregate)
  2. SET (all members different)
  3. LIST (ordered)
  4. ARRAY (defined size, indexed)

It is possible to define minimum and maximum sizes for aggregates.

For LIST and ARRAY, it is possible to specify that all members shall be different. (In the case of entity instances, this test requires that all the members of the set are different instances rather than have different property values.)

Aggregates may be nested to arbitrary depth.

 

A property's type can also be a constructed type, of which the following kinds exist:

  1. ENUMERATION (a list of name values for the type)
  2. Example: primary_colour could be defined as having the allowed values red, green and blue.

  3. SELECT (defines a domain that is the union of the domains of several other types)

Example: the employer property of a person could be defined as a SELECT of person or organisation.

SELECTs may be nested.

 

An object type can be declared to be a subtype of one or more other object types. Subtyping implies an IS-A relationship between object instances. (Although often described as multiple inheritance, the use of inheritance is just one possible implementation approach.)

 

The populations of subtypes of a given object type can be specified as:

  1. Mutually exclusive (as typical of most object-oriented programming languages) (ONEOF)
  2. Potentially overlapping in population (ANDOR)
  3. Always common (AND)

Example: male, female, citizen and alien could all be declared as being subtypes of person. It is possible to state the constraint on the populations of these subtypes, that every person is ONEOF (male, female) AND ONEOF (citizen, alien).

 

An object type that is a supertype can be defined as ABSTRACT. This means that it can only be instantiated as one of its subtypes.

 

Properties can be constrained to have unique values across the known population of object instances

 

Combinations of properties of an object type can be constrained to have unique values across the known population of object instances

 

A property can be declared as DERIVEd and an expression provided which returns the value for the property.

Example: The diameter property for circle could be declared as derived, using the expression 2 x the value of the radius property.

 

For any given property it is assumed that the inverse property always exists. It can be named and have a cardinality specified.

Example: If the object type door has a property handle, of type knob, then the object type knob can have an inverse property ishandleof, whose value is a SET of 0 or 1 doors. This would mean that a knob can be the handle of at most one door.

 

It is possible to declare a constant - that is, a specific object instance that can be used by name in expressions.

Example: The point whose x_coord and y_coord properties are both zero, could be declared as a constant named origin, and then used to declare a subtype of circle consisting of all those circles whose centre is the origin.

Constraint specifications

Constraints on the form, values and combinations of valid instance populations must be able to be declared. Such constraints can range from simple constraints such as requiring that an integer value be positive, to complex constraints such as that the representation of a shape describes a well-formed solid.

The questions of when to apply the constraints and what to do if they are not satisfied are left open. These are considered to be business process dependent and not definable at the level of an information description standard.

EXPRESS provides a full expression language for use in constraint specifications and DERIVE expressions. It includes a number of standard functions. The usual arithmetic operators are provided. A query expression is provided which allows aggregates (including populations) to be filtered according to a Boolean expression.

Against this background, we can state the following additional requirements:

 

Two types of constraints must be able to be defined: those that apply to each instance of a given object type, and those that apply to one or more entire populations of objects.

 

Constraints can be named.

 

Constraints can take the following forms:

  1. an expression on the values of properties of an object which returns a logical value. Such an expression can be a test on the result of running a query on a population of object instances
  2. Example: It could be specified that the result set from a particular query must have at least one member, or must have no members, or must have more members than the result of running some other query.

  3. a function which returns a LOGICAL value

Metadata

The following requirements concern comments, annotations and metadata:

 

It must be possible to include comments within a schema definition.

 

It would be useful to be able to add annotations to a schema definition. (LL)

 

It should be possible to attach metadata, such as author, approval status, version number, etc. to a schema definition. (LL)

Change Management (LL)

The following requirements concern the management of schemas over time, given that data conforming to an earlier version of a schema may need to be used after the schema has changed.

 

It must be possible to add new object types to a model.

 

It must be possible to mark an object type as logically deleted from a model.

 

It should be possible to provide a mapping to the structures that should be used in the revised model for data that uses the deleted object types. (Note that there is a mapping language defined which allows the description of how data defined according to one schema shall be mapped so as to correspond to a second schema. This has direct parallels with capabilities provided in DSSSL.)

 

Changes should be able to be date-stamped, and relate to a specific version number of the model (as mentioned under Metadata above).

 

Acknowledgements

Various people from within SC4 WGs have contributed to this document. These include:

Nigel Shaw nigel.shaw@EuroSTEP.com

Daniel Rivers-Moore daniel.rivers-moore@rivcom.com

Robin Lafontaine robin@monsell.co.uk

Peter Bergstrom peter.bergstrom@EuroSTEP.com