Structured Datatypes and the Ontology Web Language

A proposed issue for the Web Ontology Working Group

Integrating structured (e.g. XML and multimedia) datatypes into the ontology web language falls within the charter and an explicit requirement of the Ontology Web Language.

Dan Brickley has posted a terrific (IMHO) summary of the requirement: http://lists.w3.org/Archives/Public/public-webont-comments/2002Apr/0004.html

The XML and before that, the SGML, communities have had a long interest in the graphical representation and manipulation of structured, including multimedia, information, which has been called "Groves" (Graphical Representation Of property ValuES) [1,2]. Such representations lead themselves naturally as RDF descriptions [3]. It has been the explicit hope that an RDF Schema description of the XML Infoset will allow "validation" of an RDF/Infoset representation of an XML document [4].

I propose that WebOnt accept this challenge (my preliminary work suggests that we are up to the task). Integrating XML and XML Schema datatypes in this fashion will provide a concrete and tangible benefit provided by OWL to the XML community, as well as properly allowing OWL to reason about structured XML and multimedia datatypes.

RDF Core is developing MT extensions for simple or concrete datatypes. The proposal which is outlined below is not a duplication of this effort, rather directed at complex or structured datatypes. I will discuss why the approach taken by RDF Datatypes, while perfectly reasonable for concrete datatypes, cannot be directly extended to structured datatypes, primarily due to some technical details with respect to XML Schema.

To summarize:

1) There is a desire to incorporate and reason about structured datatypes (e.g. XML Schema complexTypes)

2) RDF Datatypes, and by extension OWL's DatatypeProperty deals with concrete or string based datatypes (e.g. XML Schema simpleTypes). A preliminary WD is at http://www-nrc.nokia.com/sw/rdf-datatyping.html

3) Technical issues involved with integration of general XML types, XML Schema datatypes  and XQuery formal types are discussed below.

4) A proposed solution to the problem of integrating general XML Schema datatypes is presented.

Issues involved with integration of XML types and XML Schema datatypes into OWL:

In a perfect RDF world there would be a URIreference for each XML Schema type (otherwise known as an XML Schema particle). It turns out that XML Schema has defined URIs for a fixed set of basic datatypes but this involves doing a bit of weirdness with internal XML subsets and labelling these specific XML Schema particles with "id"s. Suffice to say that 99.9% of XML Schemas in the wild don't go to this effort, nor should our solution mandate it. See http://www.w3.org/2001/XMLSchema.xsd for details.

For those of you at home, XML Schema type names are XML QNames (e.g. xsd:string) and at face value it should be, and is, possible to derive a URIreference from a QName, the problem being that an XML Schema may use the same QName for each of an element, attribute, simple and complex type definition. That is the QName does not uniquely define an XML Schema particle.

RDF Datatypes assume XML Schema simple types, so for this specific purpose a URIreference would work -- although there is nothing in the XML world connecting an XML Schema particle name="foo" attribute value to a URIreference but that is another issue.

XML Schema's overloading of particle names was an explicit design decision taken directly from how XML 1.0 itself defines types and type names. XML 1.0 (http://www.w3.org/TR/REC-xml) defines an element type as the GI or name of the element. Element and attribute names are, however, not disjoint. e.g. the following is perfectly legal XML:

<foo foo="12345" />

An attribute itself has a type, either CDATA which is text, ID which is a unique identifier, IDREF whose values reference an element with such a uniquely identifying attribute, NMTOKEN which provides constraints on the string (e.g. no whitespace), NMTOKENS which allows multiple NMTOKENS, IDREFS etc.

It is apparent that creating a URIreference by composing an XML document's base URI with the element or attribute name will not uniquely identify the element or attribute type definition i.e. the part in the DTD or document type definition (this is because elements and attributes share symbol spaces). This has been carried over to XML Schema.

In XML Schema:

<xsd:element name="foo" />

<xsd:attribute name="foo" />

<xsd:simpleType name="foo" />

<xsd:complexType name="foo" />

are all allowed in the same schema, indeed:

<xsd:element name="foo" type="foo" />

defines an element "foo" which has a type defined by the complex type whose name="foo".

XML Schema does however define a type heirarchy, and it is the goal of this proposal to seemlessly integrate the XML Schema type heirarchy into the OWL class heirarchy. Indeed an XML Schema processor, which accepts an input XML infoset and adnorns it with types (and other bits of information) to produce a "post schema validation infoset" or PSVI in XML Schema terms, can be seen as a specialized 'classifier' that operates on 'StructuredProperty' values.

A proposed solution

Class membership of instances can be represented by a subClassOf relationship between the class composed of a single individual and a particular super class. An individual represents some particular RDF graph. In the case of an XML document, or part of an XML document, there exists an Infoset representation. The infoset is modelled as an RDF graph in a very straightforward fashion. Indeed a simple XSLT transform converts an arbitrary XML document into the RDF graph form (e.g. http://www.openhealth.org/WOWG/XMLtoSchema.xsl)

Any of an XML Schema [5], or XQuery formal type [6], or other schema language represented as a DOM Abstract Schema [7], may represent constraints on a particular piece of XML such that the type defines a class whose instance set is the set of XML data values whose Infoset conforms to the constraints defined by the type declaration.

As such, one can develop, in principle, an OWL class definition such that instances of infoset graphs which represent pieces of XML conforming to a particular type, are members of the class.

This work has begun by the development of XSLT transforms that transform instances of XML Schemas and XQuery formal language type declarations into OWL Class definitions: http://www.openhealth.org/WOWG/XSDtoSchema.xsl and http://www.openhealth.org/WOWG/RNGtoSchema.xsl -- although these transforms are not yet complete, this serves as an outline of how the proposed solution would work and how an OWL processor might actually go about deciding, for example, whether a piece of XML does belong to a particular class. It should be noted that this approach will work both for classes defined by an XML Schema QName as well as classes written directly in OWL.

         Jonathan Borden, M.D.
         Assistant Professor of Neurosurgery
         Tufts-New England Medical Center
         Boston MA
		 jonathan@openhealth.org
		 The Open Health Care Group
		 WebOnt stuff
      

References: