SE735 - Data and Document Representation & Processing

Lecture 2 - Introduction to Document Engineering


The D-I-K-W Pyramid

Defining data

     Data has no meaning or value because it is without context and interpretation.

     Data are discrete, objective facts or observations, which are unorganized and unprocessed, and do not convey any specific meaning.

     Data items are an elementary and recorded description of things, events, activities and transactions


Defining information

     data that have been processed so that they are meaningful

     data that have been processed for a purpose

     data that have been interpreted and understood by the recipient.


Defining knowledge

     Knowledge is the combination of data and information, to which is added expert opinion, skills, and experience, to result in a valuable asset which can be used to aid decision making

     Knowledge is data and/or information that have been organized and processed to convey understanding, experience, accumulated learning, and expertise as they apply to a current problem or activity


Defining wisdom

     Wisdom is the highest level of abstraction, with vision foresight and the ability to see beyond the horizon

     Wisdom is the ability to act critically or practically in any given situation. It is based on ethical judgement related to an individual’s belief system


The wisdom hierarchy mapping to types of information systems:


Three principal uses of the word "information":

1.   Information-as-process: When someone is informed, what they know is changed.

·         In this sense "information" is "The act of informing...; communication of the knowledge or `news' of some fact or occurrence; the action of telling or fact of being told of something." (Oxford English Dictionary, 1989, vol. 7, p. 944).

2.   Information-as-knowledge: "Information" is also used to denote that which is perceived in "information-as-process": the "knowledge communicated concerning some particular fact, subject, or event; that of which one is apprised or told; intelligence, news." (Oxford English Dictionary, 1989, vol. 7, p. 944).

·         The notion of information as that which reduces uncertainty could be viewed as a special case of "information-as-knowledge". Sometimes information increases uncertainty.

3.   Information-as-thing: The term "information" is also used attributively for objects, such as data and documents, that are referred to as "information" because they are regarded as being informative, as "having the quality of imparting knowledge or communicating information; instructive." (Oxford English Dictionary, 1989, vol. 7, p. 946).

Key characteristic of "information-as-knowledge" – intangibility

·         Cannot touch it or measure it in any direct way

·         Knowledge, belief, and opinion are personal, subjective, and conceptual

·         To communicate them, they have to be expressed, described, or represented in some physical way, as a signal, text, or communication.

·         Any such expression, description, or representation would be "information-as-thing".

Why examine the notion of "information-as-thing"?:

(1)        Clarify its meaning in relation to other uses of the term "information;"

(2)        Affirm the fundamental role of "information-as-thing" in information systems; and

(3)        Speculate on possible use of the notion of "information-as-thing" in bringing theoretical order to the heterogeneous, ill-ordered fields associated with "information science."

Two distinctions:

(1) Between entities and processes.

(2) Between intangibles and tangibles.

These two distinctions yield four different aspects of information and information systems:



(1)        All information systems deal directly with "information-as-thing"

(2)        Information storage and retrieval systems can deal directly only with "information-as-thing"

(3)        Representations of knowledge form a distinguishable subset of information-as-thing and so could, in principle, be used to identify and define another class of information systems in which the primary concern is based on the knowledge represented

(4)        Information-as-process could be the basis for defining a class of information-related studies


What is a Document?

Paul Otlet (1934): Objects as Documents:

Graphic and written records are representations of ideas or of objects, he wrote, but the objects themselves can be regarded as ‘‘documents’’ if you are informed by observation of them.

Walter Schurmeyer (1935):

Nowadays one understands as a document any material basis for extending our knowledge which is available for study or comparison

International Institute for Intellectual Cooperation (1937):

Document: Any source of information, in material form, capable of being used for reference or study or as an authority. Examples: manuscripts, printed matter, illustrations, diagrams, museum specimens, etc.

Suzanne Briet (1951):

A document is evidence in support of a fact.

A document is ‘‘any physical or symbolic sign, preserved or recorded, intended to represent, to reconstruct, or to demonstrate a physical or conceptual phenomenon’’

Inferring from her discussion that:

(1) There is materiality: Physical objects and physical signs only;

(2) There is intentionality: It is intended that the object be treated as evidence

(3) The objects have to be processed: They have to be made into documents; and, we think,

(4) There is a phenomenological position: The object is perceived to be a document.


Donker Duyvis (1942):

A document is the repository of an expressed thought.Consequently its contents have a spiritual character. The danger that blunt unification of the outer form exercises a repercussion on the contents in making the latter characterless and impersonal, is not illusory. . . . In standardizing the form and layout of documents it is necessary to restrict this activity to that which does not affect the spiritual contents and which serves to remove a really irrational variety.

Levy (1994):

A conventional document, such as a mail message or a technical report, exists physically in digital technology as a string of bits, but so does everything else in a digital environment. In this sense, any distinctiveness of a document as a physical form is further diminished, and discussion of ‘‘What is a digital document?’’ becomes even more problematic unless we remember the path of reasoning underlying the largely forgotten discussions of Otlet’s objects and Briet’s antelope.


Introduction to Document Engineering

We need Document Engineering to help us specify, design, and implement the documents that are the inputs and outputs of business services.

Document Engineering synthesizes complementary ideas from information and systems analysis, electronic publishing, business process analysis, and business informatics. Its unifying document-centric perspective helps us conceive and understand the new network-based business models made possible by the Internet and supporting technologies.

The essence of Document Engineering is the analysis and design methods that yield

                    Precise specifications or models for the information that business processes require

                    Rules by which related processes are coordinated, whether between different firms to create composite services or virtual enterprises or within a firm to streamline information flow between organizations.

Document Engineering provides the concepts and methods needed to align business strategy and information technology, to bridge the gap between what we want to do and how to do it. Describing business processes in terms of the more abstract notion of document exchanges makes it easier to understand the constraints imposed by legacy systems and technologies and to recognize the opportunities created by new ones if we focus on conceptual models of the exchanges rather than on how they are implemented. The expressiveness of XML for implementation models bridges the traditional gap between business strategy and its technology realization.

A Simple Business Transaction?

The "Virtual Bookstore"


1.   Customer selects book from catalog on an online bookstore

2.   Customer pays with credit card

3.   Book arrives via express shipper two days later


·         From the customer's perspective there is only one "transaction"

·         But the bookstore is a "virtual enterprise" or "composite application" that follows the drop shipment pattern to coordinate the activities of 4 different service providers transacting with each other

·         This coordination - or choreography - is carried out with document exchanges


Two Ways to Analyze the Virtual Bookstore

1.   In terms of the "drop shipment" business pattern

2.   In terms of the overlapping semantic components that "glue" the business processes together


The Drop Shipment Pattern

Drop shipping is a supply chain management technique in which the retailer does not keep goods in stock, but instead transfers customer orders and shipment details to either the manufacturer or a wholesaler, who then ships the goods directly to the customer.


Composite service with at least three separate transactions or exchanges of information occurred:

1.   Your interaction with the catalog to select the book you want to order.

2.   A document exchange between and a credit authority (a bank or authorization network like VISA or MasterCard) to verify your creditworthiness and charge your account.

3.   A document exchange between and the delivery service with the instructions for picking up and delivering your book.


Pattern typical of many Internet retailers and completely invisible to you as the customer.

Example: Patterns

Ř started out as a drop shipper

Ř  It was forced to rapidly expand its own warehousing and distribution capabilities after Barnes and Noble bought Ingram Books as a competitive response to Amazon

Ř  Amazon had to build a configurable "merchant platform" to handle its growth in product categories, transaction volume, and associated services

Ř  In 2003 Amazon began to use this platform to host various combinations of these services for other businesses



Amazon Web Store


Fulfillment by Amazon



Amazon Drop Shipment


The Extended or Virtual Enterprise

The Virtual Enterprise

·         Coordination by the exchange of information between the retailer and book distributors, shippers, and credit authorities.

·         The retailer doesn’t need its own books and delivery trucks—it can replace inventory and equipment with information

·         Coordination is usually invisible from the customer’s perspective

·         It requires a complex and carefully managed series of document exchanges (often called a document choreography) over a period that may range from hours to weeks

It’s All About Exchanging Documents


The Document Type Spectrum


A Narrative Example


A Transactional Example


A Hybrid Example: Catalog = Data (Document)


Systematic Variation across the Spectrum

§  Homogeneity

§  Prescriptiveness

§  Variety of data types

§  Precision of data types

§  Correlation between presentation, structure, and content


Processes that Involve Both Data and Documents


Document Exchange as a Building Block in Business Model Patterns

·         Doing business by document exchange is natural and intuitive.

·         Businesses use documents to organize their interactions with each other and to package the information needed to carry out a transaction or other meaningful unit of business

·         Exchanges of documents that take place to carry out business models follow common patterns

o   e.g. Supply chains, business-to-business marketplaces, auctions, information brokers or aggregators, and content syndication networks

·         e.g. Order Management Cycle

1.     Order Planning

2.     Order Generation

3.     Cost estimation and pricing

4.     Order receipt and entry

5.     Order selection and prioritization

6.     Scheduling

7.     Fulfillment

·         Procurement

·         Manufacturing

·         Assembling

·         Testing

·         Shipping

·         Installation

8.     Billing

9.     Returns and Claims

10.    Post sales Services


o   Complete cycle involves documents that flow between sales, engineering, finance, logistics, customer service, and other divisions within the organization.

o   The specific documents and divisions vary in different contexts, but the general pattern is ubiquitous.



o   The alignment of business models and technology is easier to achieve when an organization systematically structures its business capabilities as self-contained resources or processes so they can efficiently interact and recombine to meet changing business requirements.

o   Using standard documents as interfaces for business processes is a natural outcome of organizing business functions as more discrete and flexible components.


Business Model Example: Business reference model

o   Business reference model is a reference model, concentrating on the functional and organizational aspects of the core business of an enterprise, service organization or government agency.

o   A business reference model is a means to describe the business operations of an organization, independent of the organizational structures that perform them.

o   The most familiar business reference model is the "Business Reference Model", one of five reference models of the Federal Enterprise Architecture of the US Federal Government.

o   a function-driven framework for describing the business operations of the Federal Government independent of the agencies that perform them.



Document Exchange as Loose Coupling

o   In computing and systems design a loosely coupled system is one in which each of its components has, or makes use of, little or no knowledge of the definitions of other separate components.

o   Two business organizations must agree on what the documents mean and on the business processes they expect each other to carry out with them

o   They don’t need to agree on or even know anything about the technology they use to create and process the documents.

o   The documents are the only visible interfaces to their respective business processes.


Understanding the Meaning of Documents

o   Required:  a complete view of how information is defined and used in different business contexts—what is often called the information model—a formal representation of the structure and semantics of information.

o   There are at least two sides to every document exchange, and all parties need to ensure that they understand the documents in the same way.

o   Solution 1:  every enterprise adopt a common data model and use exactly the same definitions for the document components of their applications.

o   But that’s inconceivable; enterprises, applications, and people just don’t stand still long enough to make it possible.

o   Solution 2:  two parties each create conceptual information models to help them translate or transform the documents they receive so their applications can understand them

o   e.g. US Government Data and Information Reference Model:

o   The Data Reference Model (DRM) is one of the five reference models of the Federal Enterprise Architecture (FEA).

o   The DRM is a framework whose primary purpose is to enable information sharing and reuse across the United States federal government via the standard description and discovery of common data and the promotion of uniform data management practices

o   DRM structure

The DRM provides a standard means by which data may be described, categorized, and shared. These are reflected within each of the DRM’s three standardization areas:

§  Data Description: Provides a means to uniformly describe data, thereby supporting its discovery and sharing.

§  Data Context: Facilitates discovery of data through an approach to the categorization of data according to taxonomies. Additionally, enables the definition of authoritative data assets within a COI.

§  Data Sharing: Supports the access and exchange of data where access consists of ad-hoc requests (such as a query of a data asset), and exchange consists of fixed, re-occurring transactions between parties.

In this scenario:

1. Organization B determines that it has the need for a particular set of data that might be available from Organization A.

2. Organization A uses the DRM to categorize its data (using the Business Reference Model) into a business context.

3. Organization B identifies Organization A’s available data through its business context.

4. Organization A uses the DRM to publish the detailed structure of the actual data element (in support of the business context).

5. Organizations A and B determine if, in fact, the data produced by Organization A will meet the needs of Organization B.

6. Once Organizations A and B determine that the data can be re-used, the information exchange package is used to transmit the data.


XML as an Enabling Technology

o   To exchange documents, computers or business applications require a precise and unambiguous language for describing information models.

o   Since its emergence in the late 1990s, XML—the Extensible Markup Language—has become the preferred format for representing information in documents both on and off the Internet.

o   Web publishers view XML as an improvement on HTML that enables greater automation and consistency in formatting.

o   Programmers see XML as an Internet-friendly, easy-to-parse, and nonproprietary data format that they can use instead of ad hoc languages for application configuration and interprocess communication.

o   Electronic data interchange (EDI) developers see XML as a more expressive, maintainable, and therefore lower cost syntax for creating business messages.

o   XML’s broad impact in publishing, programming, and EDI has made it a unifying technology for implementing applications that use Internet protocols, especially for those that span enterprise boundaries, such as web services.


Using XML-Encoded Models to Design and Drive Applications

o   Using XML to encode implementation models yields an overall rigor, reusability, and programmability unmatched by other representations

o   XML’s facility for document encoding is an excellent match for the document exchange architecture of the Internet

o   Programming paradigms are emerging in which XML schemas, programming language objects, database schemas, and UML models can be treated as equivalent because XML schemas can be used to generate any of the other formats if required


Document Models as Interfaces

o   Simplest example: a document displayed in a browser that users interact with in ways that are determined by the document’s model

o   Narrative end of the Document Type Spectrum : E-books or other structured publications in which user interface features include tables of contents, hypertext links, and navigation aids are generated from the names or attributes of the information components in the document.

o   Transactional end of the Document Type Spectrum: E-forms in which applications collect the information specified in the document’s model to automate processes that previously have relied on printed forms


e.g. Virtual Enterprise Example


Web Services and Service Architectures

o   Using documents as interfaces underlies the idea of service -oriented architectures (SOA) as a way to create new applications or systems such as web services by integrating or combining components of other ones

o   Web service : “an interface that describes a collection of operations that are network accessible through standardized XML messaging

o   Implies that any self-contained application functionality or information resource is turned into a service by packaging it so that it exposes only input and output XML documents.

o   The abstraction level introduced by document exchange, a service can be:

o   Anything that can send or receive a document

o   Anything that can accept a document, process it and return a result

o   Anything that can accept a document and allow the user to act on it

o   Anything that can accept a document and forward it to some other application or destination

o   Anything that can generate a document as a result of user interaction, processing a received document, or some other event.



o   Because web services are loosely coupled and hide implementations, document interfaces allow firms to maintain a clean and stable relationship to partners and customers.

o   By using independent components, web services also make it easier and cheaper to adopt new technologies incrementally without affecting any existing business functionality.

o   Because document interfaces can be implemented in any technology, platform compatibility concerns are lessened. (e.g. cloud services)


Document Specifications and Document Engineering


The Data/Document Chasm

§  It's obviously a continuum... but transactional and narrative documents have traditionally been analyzed with different disciplines and use different tools, terminology, and techniques – little intersection

§  Very different intellectual and domain roots

§  "Document analysis" for narrative documents: literary criticism, graphical design; electronic publishing

§  "Data modeling" for transactional documents: philosophy, linguistics, systems analysis; data automation


Crossing the Data/Document Chasm

§  Document Engineering harmonizes the terminology and emphasizes what they have in common rather than highlighting their differences Identifying the presentational, content, and structural components and defining their relationships to each other

§  Identifying "good" content components

§  Designing, describing, and organizing components to facilitate their reuse

§  Assembling hierarchical document models that organize components according to the requirements of a specific context for information exchange


Data/Document Driven {and,or,vs} Business Process Analysis

§  The document analysis and data modeling approaches focus from the beginning on the structure and content of the "document payload" that will be exchanged – a "bottom up" approach that emphasizes "Does this work from a technical perspective?"

§  In contrast, business process analysis begins with an abstract or broadly scoped perspective on business activities

§  Emphasizes "Does this work from a business perspective?"

§  Inherently a "top down" approach that starts with business models and processes and gets to the "document payloads" only at the end


{and,or,vs} User Task Analysis

§  Task analysis (or user analysis) is the observation of people performing the tasks or use cases when the application or system must support human interfaces and not just other applications

§  Task analysis and document analysis are closely related; document analysis reveals candidate information components and task analysis reveals rules about their intent and usage.

§  Task analysis is especially important when few documents or information sources exist because human problems or errors can suggest that important information is missing


The Document Engineering Approach

Document Engineering organizes its modeling approach into 8 phases:



First phase - Analyzing the Context of Use:

Business and task analysis techniques establish the context for a Document Engineering effort by identifying the requirements and rules that must be satisfied to provide an acceptable solution.

Phases 2 & 3—Analyzing Business Processes and Apply Patterns:

Business process analysis is applied to identify the requirements for the document exchange patterns needed to carry out the desired processes, collaborations, and transactions in the context of use.

Phase 4 —Document Analysis:

Involves identifying a representative set of documents or information sources (including people) and analyze them to harvest all the meaningful information components and business rules

Phase 5 — Component Assembly :

A document component model is developed that represents structures and their associations and content that define the common rules for the possible contexts of use.

Phase 6 — Document Assembly :

The document component model is used to create document assembly models for each type of document required.

Phase 7 — Implementation :

Conceptual models are first encoded using a suitable language to support their physical implementation


Document Engineering in the News


"Tailing Virulent Veggies" (WSJ, 3/13/07)

Dole Food Co., the world's largest producer of fresh vegetables, recently started using radio-frequency identification tags to track leafy greens as they move from fields to trucks and through processing facilities

Scanning RFID Tags on Vegetable Boxes:



"SPEEDy Airline Self-Service" (Hong Kong News, 1/07)



"More Clicks at the Bricks" (Business Week, 12/6/07)

The Web and other information and communications technologies have substantially affected consumer behavior

This story analyzes some of these new patterns of information use (between people and information systems/resources, and between people)

Retail stores are offering new kinds of information services


"Landstuhl to use Electronic Health Records" (Army Times, 1/16/08)


With the new system, health care workers will create PDFs (portable document formats) for inpatient records, allowing them to be sent to the next hospital electronically.

Within a year, CliniComp officials hope to have all PDF records within its system at Landstuhl easily available to the Veterans Affairs Department as service members move from military to civilian status.

This has been a problem in the past; the Defense Department and VA medical records systems are not compatible.

CliniComp does not have the capability to scan and create PDFs for old medical records, so it only applies to records being created now.


"Big Employers Plan EHRs" (WSJ, 11/29/06)


"Insurers Lack Governance" (Gartner,1/9/08)


What Are the Common Themes in These News Items?

Enormous amounts of existing (paper) documents and legacy processes would benefit from automation, process re-engineering, transformation to SOA

New business processes are created / coordinated / choreographed via the management and exchange of electronic documents

Standards / patterns for documents and business processes are essential

Information technology and business processes are co-evolving with many ways to create business value

But projects can be challenging, and their success depends on many factors besides technology