CS835 - Data and Document Representation & Processing

Lecture 14 – Document Engineering


Why Businesses use Documents


Documents are Everywhere

• Documents are a purposeful and self-contained collection of information.

–       Interfaces for people.

–       Interfaces to business processes.

• Documents cover a spectrum of types to suit their target audience.

• Using documents as interfaces allows for loosely coupled business processes.

–       The document (and only the document) connects the processes.



Using documents for exchanging business information is natural and intuitive.



Doing business by document exchange

• Every major advance in technology has brought a corresponding evolution in business processes and the document exchanges they require.

• We don’t use pottery, papyrus, and parchment anymore, and electronic versions have replaced many paper documents.


• The basic idea of document exchange has changed very little.


Pottery Tax Receipt 2500 BCE

XML Tax Receipt 2001 CE


<?xml version="1.0" ?>

<!DOCTYPE NAT4191-9.2001>







<EFT_CODE> 97999 999 999 9360</EFT_CODE>







Document Exchange

• Document exchange is the mother of all business patterns.

• Business model patterns:

– marketplace, auction, supply chain, build to order, drop shipment, vendor managed inventory, etc.

• Business process patterns:

– procurement, payment, shipment, reconciliation, etc.

• Document patterns:

– catalog, purchase order, invoice, etc.

• Some new business models are only document exchanges.



Motivating "Document Engineering"


·        Customer selects book from catalog on an online bookstore

·        Customer pays with credit card

·        Book arrives via express shipper two days later

·        From the customer's perspective there is only one "transaction"

·        But the bookstore is a virtual enterprise that follows the drop shipment pattern to coordinate the activities

·        of 4 different service providers transacting with each other

·        This coordination - or choreography - is carried out with document exchanges


Example: Buying a Book Online - GMBooks.com


The Real Question is “What” not “How”

• But what should these documents contain?

• The real challenge is understanding what the content of documents mean.

–We call this “interoperability”

• The technologies for Web Services don’t address interoperability.

–       They ignore it.



• A basic requirement for two businesses to conduct business is that their business systems interoperate.

– The meaning of the information exchanged is understood as intended.

• This has always been true, regardless of the technology used.

• Interoperability requires:

– that parties can exchange information and use the information they exchange.

– that the information being exchanged is conceptually equivalent.

• Easy to express but hard to achieve.

– Variations in strategies, technology platforms, legacy applications, business processes, and terminology.

–       Different “contexts of use”







Approaches to Interoperability

• Interoperability doesn’t require that business systems be identical.

• There are two alternatives:

1. Build expensive customized tightly coupled solutions,


2. Engineer equivalent conceptual models.


What Do We Mean by a Conceptual Model?

• Simplified description of a subject:

– abstracts from its complexity.

– emphasizes some features or characteristics.

– intentionally de-emphasizes others.

• Remove the features for implementation technology and focus on meaning.

– Not “how” but “what” (again!)

• Note: an XML schema is a “physical” model

– Describes the XML expression of documents that share a common information model.

–       Limited by the technology.


Problems with Modeling Documents

• The names of components are only a small part of their semantic definition:

– XML is not self-describing.

– modelers will often choose different names for the same component.

• Different document samples can lead to incompatible models.

• All model expressions have technological limitations.

– XML schemas cannot do everything.

• So how do we solve this challenge?


Def: Document Engineering

• An approach to modeling the document exchanges between enterprises as a means of customizing them for particular industries or domains (contexts of use).

• Comprises of a set of analysis and design techniques that yield meaningful models of document exchanges.

• Encourages re-use of common patterns for models.

• Synthesizes ideas from:

– business process analysis.

– task analysis.

– document analysis.

˗˗ data analysis.


Encouraging the use of Patterns

·        Patterns are models that are sufficiently general, adaptable, and worthy of imitation that we can use them over and over again.

·        Document exchanges for businesses follow common patterns.

·        Using patterns ensures applications and services are robust but adaptable when technology or business conditions change (as they inevitably will).


·        Business model or organizational patterns: marketplace, auction, supply chain, build to order, drop shipment, vendor managed inventory, etc.

·        Business process patterns: procurement, payment, shipment, reconciliation, etc.

·        Business information patterns: catalog, purchase order, invoice, etc. and the components they contain for party, time, location, measurement, etc.


Sidenote on Standards

·        Standards are common patterns that have sanction and/or traction.

·        Sanction – de jure (ISO/UN/IEC)

·        Traction – de facto (widely used)

·        History tells us traction is more important than sanction (HTML, TCP/IP, MSWord, etc.. )

·        Sanction is a means to achieve traction – not a goal in itself!

·        I prefer the general term “pattern” The rest is marketing and politics.


Patterns Promote Interoperability

·        Interoperability requires all members of a trading community to understand the documents.

·        This is facilitated when their syntax and semantics conform to common patterns.

·        XML has become the preferred syntax for representing information in documents.

·        Now we need to define common patterns for the semantics of business documents using XML syntax.

·        a “universal” business language


The Model Matrix


The "Pattern Compass"


Patterns in Document Engineering

·        The essence of Document Engineering is its systematic approach for discovering and exploiting the relationships between patterns of different types

·        Working from the top down to ensure that a business model is feasible

·        Working from the bottom up to ensure that we are designing and optimizing the activities that add the most value

·        We need models of the desired business processes and the documents that they will produce and consume at the same level of detail and implementability


Meeting in the Middle

·        We need to achieve both business and technical interoperability – the former is necessary but insufficient for the latter

·        We need models of the desired business processes and the documents that they will produce and consume at the same level of detail and implementability

·        This is represented in the Model Matrix as "meeting in the middle"

·        Document Engineering is a systematic approach for "getting to the middle"



The Document Engineering Approach


A Checklist for Describing Projects and Case Studies

·        D -- data types and document types

·        O -- organizational processes

·        C -- context (types of products or services, industry, geography, regulatory considerations)

·        U -- user types and special user requirements

·        M -- models, patterns, or standards that apply

·        E -- enterprises and eco systems (e.g., trading communities, standards bodies)

·        N -- the needs (business case) driving the enterprise(s)

·        T -- technology constraints and opportunities


D-O-C-U-M-E-N-T in the Document Engineering Approach



Modeling Documents {and,vs,or} Modeling Processes

·        Documents are always the result of some process and often the input to another one

·        This is most evident for transactional documents where patterns of paired document exchange are thebuilding blocks for supply chains, marketplaces, auctions and other business patterns

·        By understanding the information in the documents, we learn what kinds of processes are possible

·        By understanding the processes, we learn what kinds of information are needed


A Process-Centric Depiction



A Document-Centric Depiction


Benefits of a Document-Centric Modeling Approach

·        Documents are more tangible than processes, easier to analyze and communicate

·        SOA emphasizes documents as the public interfaces to private processes

·        David Cohn: 100,000 nouns enable us to understand the meanings of 10,000 verbs


The Equivalence Problem


The Target Model For The Interoperability Scenarios


Validation Does Not Imply Interoperability

·        After all these cases where interoperability may or may not possible because the conceptual or implementation models differ we need to talk about the "easy" case ... and make sure you recognize that it might not be

·        Suppose the document validates against the recipient's schema

o       The semantics can still be different in important ways (the ID SSN example) – the strongest level of validation can fall short of establishing that the "same tags" have exactly the "same meaning" to the sender and recipient

o       Furthermore, the recipient may not be able to validate all of the business rules that are important

o       This is a good argument for industry standards / reference models / in your conceptual models or using XML vocabularies that represent them in authoritative ways




Documents vs. Data

·        Many people have contrasted "documents" and "data" and concluded that documents and data cannot be understood and handled with the same terminology, techniques, and tools.

·        This document vs. data distinction is embedded and reinforced in XML textbooks, technology, and product marketing

·        And it doesn't always help


Mixing Data and Documents


Data or Document?


The Document Type Spectrum


Crossing the Chasm with Document Engineering

·        Document Engineering harmonizes the terminology and emphasizes what they have in common rather than highlighting their differences:

·        Identifying the presentational, content, and structural components

·        Eliminating synonymy and homophony

·        Identifying and organizing the "good" content components

·        Assembling hierarchical document models to organize components to meet requirements for a specific context for information exchange


Three Types of Information In Documents

·        We need a vocabulary to classify different kinds of information that we find in documents and sets of


·        Content – "what does it mean" information

·        Structure – "where is it" or "how it is organized or assembled" information

·        Presentation – "how does it look" or "how is it displayed" information

·        The amount and relationships among these three kinds of information varies in different kinds of



Presentation Information

·        Human-oriented attributes for visual (or other sensory) differentiation (type font, type size, color, background, indentation, pitch, ...)

·        Good user interface design correlates this with structural or content information

·        May be concealing structural or content information


Presentation View of a Lecture Slide


Analyzing Presentation Components

·        Presentation affects structure and content by applying transformation rules to them

·        To understand the structure and content we must identify and record these rules

·        Some transform rules are explicit

·        Some transform rules are implicit or ambiguous or misleading


Structural Information

·        Physical piece of a document (e.g. table, section, title, header, footer)

·        Frequently a close relationship between structural and presentation items, especially in a paper document. This goes some way to explaining why the document-centric school places such strong emphasis on structural components.

·        Embody the rules on how content components fit together, often hierarchical

·        Often driven by context of document use (e.g. overseas address vs. local address)


Structural View of a Lecture Slide


Analyzing Structural Components

·        The structural components provide the hierarchical "skeleton" or "scaffold" into which the content components are arranged

·        Structural components are often identified by the names attached to pieces of information – think of the outline or table of contents or lists of various kinds

·        Metadata to capture

o       Depth of hierarchy

o       Sub-structures included within a structural container

o       Rules for applying numbers or names to content in the hierarchy


Content Components

·        Content components are the "nouns" in our documents or sets of data – things like "topic," "summary," "name," "address," "price"In publications a lot of the content isn't easily identified by "component type" – it may be "just text" that could be playing any of a very large number of roles in the documen

·        And sometimes you get no help from the set of style or formatting tags in word processors or in HTML, which are very format or structure oriented and not content oriented at all

·        We need XML so we can invent the vocabulary of tags needed to describe component content in a

·        specific document type


Relationships Among Content Components

Content components can be related to one another

·                 Derivational relationships

·                 Referential relationships


Analyzing Content Components

·                 What attributes about each type of content should we record in our analysis?

·                 Names/synonyms/homonyms (what it is called)

·                 Definition (what it "means")

·                 Roles (what it does)

·                 Cardinality/Optionality (occurrence rules)

·                 Restricted values, code sets, defaults

·                 Data Type (text, numbers, date, video)

·                 Relationships/Associations

·                 Origin (Is this new information, or from some other source? Who maintains it?)

·                 Access (who is allowed to view/change/copy/etc. it)

·                 Context (is this information "general purpose" or "core" or is its use limited to specific contexts?)

·                 Permanence (is it static or dynamic? how often does it change?)


Content View of a Lecture Slide


Harvesting and Consolidation

·                 Harvesting – Create a set of candidate content components by extracting them from the information sources while removing presentation and structure

·                 Consolidation– Identify synonymns and homonyms among the candidate content components, assigning a unique name to each unique meaning as part of a controlled vocabulary


Table of Content Components


Consolidating The Harvest

·                 We can begin our consolidation with the candidate components from any of the information sources, but we recommend using the one you believe is the most authoritative or that yielded the most components

·                 The goal is to combine components that are synonyms (different names for the same meaning) and to distinguish any homonyms (same names for different meanings)

·                 It is desirable for a set of components to enable one and only one way to describe something because duplication or redundancy implies choices that could lead to inconsistent models and non-interoperable schemas


Example Consolidation Table -- Courses


Motivating Aggregate Components

·                 Atomic components that hold individual pieces of information Especially in transactional documents,where atomic components have a natural representation as primitive data types ("string," "Boolean," "date") or as datatypes that are derived from these by restriction

·                 Document components that assemble smaller components into the set of information needed to carry out a self-contained purposeful activity

o       Especially in transactional contexts, where documents have a natural correspondence to some unit of work that initiates, records, or responds to a clearly-defined event

·                 Aggregate components are composed of atomic ones and are reused in the assembly of document components

o       They are easier to identify in transactional contexts because they are often the key information that flows from one document to another

o       "Address" or "Person" are obvious examples of aggregates composed of smaller ones

o       Two key questions:

§        How do we select and group atomic components into aggregates?

§        How many aggregates should we create?


Identifying Two Kinds of Component Aggregates

·                 Structural Aggregates -- sets of components defined by parent-child or containment relationships

·                 Conceptual Aggregates -- sets of components that "go together" because of logical dependency


Identifying Aggregate Components in Non-Transactional Documents

o       Aggregates are more elusive on the narrative end of the DTS because there are limits to the rigor with

o       which components can be grouped

o       "Mixed content" models arise when there are few or weak constraints on where atomic components can appear

o       Presentation often masks the atomic components in potential aggregates

o       Structures are often based on conventions for organization and presentation than on semantic relationships

o       But there will still generally be components that "go together" to form reusable structures

o       And "going together" means different things for each set of components

o       Aggregates can be created in two "bottom-up" ways that focus on the atomic components:

o       The first is by rebuilding or making explicit the structures that we took apart in document analysis

o       The second is by creating structure in "blobs" of poorly structured information written in an overly narrative style (with mixed content at best)

o       A more modular style for the information will increase its regularity and reusability; it will eliminate content that has little value to users and reinforce its use as "boilerplate" or via links


The Universal Business Language

·        International, royalty-free library of electronic business documents patterns.

·        Piloted many of the ideas for Document Engineering.

·        Designed in an open and accountable vendor neutral OASIS Technical Committee.

·        Fills the “payload” slot in B2B web services frameworks


UBL is a business vocabulary for XML

HTTP           +  HTML  =  Web Publishing

ebXML/WS +  UBL      =  Web Commerce


UBL: the “Fifth Generation” B2B language

• UBL represents over six years of continuous development in the creation of a standard XML business syntax.

• G1 (1Q 1998): CBL 1.0 (Veo/NIST)

• G2 (2Q 1999): CBL 2.0 (Commerce One) – plus over 20 years of EDI standards development

• G3 (4Q 2000): xCBL 3.0 (Commerce One and SAP)

• G4 (1Q 2003): UBL 0.7 (OASIS)

• G5 (4Q 2004): UBL 1.0 (OASIS)


Example of a UBL Conceptual Model


Example of a UBL schema

<xsd:element name="Order" type="OrderType"/>

<xsd:complexType name="OrderType">


<xsd:element name="BuyersID" type="udt:IdentifierType" minOccurs="0"


<xsd:element name="SellersID" type="udt:IdentifierType" minOccurs="0"


<xsd:element ref="cbc:IssueDate" minOccurs="1" maxOccurs="1"/>

<xsd:element ref="cbc:Note" minOccurs="0" maxOccurs="1"/>

<xsd:element ref="EarliestDate" minOccurs="0" maxOccurs="1"/>

<xsd:element ref="cbc:ExpiryDate" minOccurs="0" maxOccurs="1"/>

<xsd:element ref="ValidityDurationMeasure" minOccurs="0" maxOccurs="1"/>

<xsd:element ref="cbc:TaxTotalAmount" minOccurs="0" maxOccurs="1"/>

<xsd:element ref="cbc:LineExtensionTotalAmount" minOccurs="0" maxOccurs="1"/>

<xsd:element ref="TotalPackagesQuantity" minOccurs="0" maxOccurs="1"/>

<xsd:element ref="cac:BuyerParty" minOccurs="1" maxOccurs="1"/>

<xsd:element ref="cac:SellerParty" minOccurs="1" maxOccurs="1"/>

<xsd:element ref="OriginatorParty" minOccurs="0" maxOccurs="1"/>

<xsd:element ref="FreightForwarderParty" minOccurs="0" maxOccurs="1"/>

<xsd:element ref="cac:Delivery" minOccurs="0" maxOccurs="unbounded"/>

<xsd:element ref="cac:DeliveryTerms" minOccurs="0" maxOccurs="1"/>

<xsd:element ref="cac:SalesConditions" minOccurs="0" maxOccurs="1"/>

<xsd:element ref="DestinationCountry" minOccurs="0" maxOccurs="1"/>

<xsd:element ref="cac:OrderLine" minOccurs="1" maxOccurs="unbounded"/>

<xsd:element ref="cac:PaymentMeans" minOccurs="0" maxOccurs="1"/>






Example of a UBL Document



<cbc:LineExtensionTotalAmount amountCurrencyCodeListVersionID="0.3"





<cbc:Name>Bills Microdevices</cbc:Name>



<cbc:StreetName>Spring St</cbc:StreetName>







<cbc:Name>George Tirebiter</cbc:Name>





Example of a UBL Implementation


UBL 1.0

·        Released May 1st 2004

·        Basic Procurement Business Process Model

·        Order to Invoice

·        XML Schemas (W3C XSD)

·        Order to Invoice Documents

·        Re-usable Common Component Library

·        XML (XSD) Naming and Design Rules

·        Guidelines for schema customization.

·        Pattern library of conceptual models.

·        Forms Presentations and sample documents

·        Download at : http://www.oasis-open.org/committees/ubl


Recent UBL Developments

• UBL International Data Dictionary:

– 600 elements translated into Chinese (Simplified and Traditional), Japanese, Korean, and Spanish.

• UBL Naming and Design Rules (NDR)

– adopted by chemical industry (CIDX), petroleum (PIDX), agriculture (RAPID), real estate (OSCRE/PISCES), U.S. Department of the Navy (DON), U.S. Taxation (IRS).

• UBL Invoice used by the Danish Govt.

– February to April 2005, more than one million invoices exchanged.

– Estimated savings 94 million Euro annually.

• UBL Invoice used by the Swedish Govt.

– Announced September 2005.

• Small Business Subset

–       Simple implementation guide for SMEs.


Work Plan for UBL 2.0

• Extended library.

–       Extended Procurement Process (Europe).

–       Transportation Process Documents (Asia).

–       Electronic Catalogue process (Europe).

• Improved library.

–       Improve architecture.

–       Better document engineering.

• Aligning with UN/CEFACT projects.

Core Component Type library.


• Release early 2006.



• The basic idea of document exchange has changed very little.

• Using documents as interfaces allows for loosely coupled business processes.

• Some new business models are only document exchanges.

• The real challenge is understanding what the content of documents mean.

• Interoperability requires that parties can exchange information and use the information they exchange.

• The best way to support interoperability is to engineer equivalent conceptual models.

• Document Engineering comprises of a set of analysis and design techniques that yield meaningful models of document exchanges.

• Document exchanges for businesses follow common patterns.

• Interoperability requires all members of a trading community to understand common patterns of document syntax and semantics.

• The Universal Business Language is an international, royalty free library of electronic business documents patterns