SE735 - Data and Document Representation & Processing

Lecture 7 - The Document Engineering Approach


An Approach, Not a Methodology

Document Engineering is a coherent and pragmatic approach for modeling documents and services that provide solutions that are practical and effective


Three key factors shape the concepts and methods of the Document Engineering Approach:

1.   End-to-end scope.

a.   Must be able to:

                                                            i.         describe the information content and processes in a document exchange

                                                          ii.        identify the context of use and its relevant requirements and constraints

                                                         iii.        analyze and design a solution

                                                        iv.        implement and deploy that solution.

b.   Must expect that the requirements and constraints will change, so our solution must be evolvable.

                                                            i.        Pointless to develop a solution that can’t be adapted to changing environments, no matter how theoretically elegant or powerful it might be.

2.   The breadth of documents that must be analyzed, designed, and implemented.

a.   Document Type Spectrum that spans from narrative, publication-style documents to transactional, data-intensive ones.

b.   These contrasting types of documents have traditionally been analyzed and designed using substantially different approaches, which are unified by emphasizing what they have in common.

3.   The requirement that document exchanges must be implementable in a loosely coupled, technology-independent manner.

a.   Fundamental principle of distributed and service-oriented architectures that the relationships between organizations or service providers must be adaptable and flexible because only the document interfaces are visible.

b.   It is neither necessary nor desirable for each party to know anything about the implementation on the other side of the exchange.


Modeling Methodologies

·                    There is no single correct way to create document and process models.

·                    This approach embodies the principle of loose coupling through document exchanges.

·                    Once the model is defined and communicated, the process taken to develop it is invisible.


·                    Every modeling methodology proposes a set of modeling activities.


·        They may differ in the order in which the activities are carried out or how prescriptive they are about the activities and descriptions of their results

·        How the models are described reflects the metamodel adopted by the methodology.

·        Metamodels define the kinds of information that models contain.

·        Common metamodels also provide a useful basis for libraries of reusable patterns because the models they contain can be interpreted by anyone or any application that understands the metamodel.



The Document Engineering Approach

·       Document Engineering relies on the skills and tools of business process, document, data, and task analysts.

·       Document Engineering exploits different analysis techniques for reaching the same goal.

·       Goal is developing models that are abstract enough to be reused as patterns but concrete enough to be implemented

Analysis Perspectives Used in Document Engineering

·        Business analysis - starts with abstract views of business models and processes

o   This high-level analysis establishes the context for understanding the semantics of the information in the other sections of the matrix.

·        Task analysis (or user analysis) - the observation of people performing the tasks or use cases when the application or system must support human interfaces and not just other applications.

o   Task analysis identifies the specific steps and information that people need to carry out a task, so it is based on actual artifacts and activities, which are represented on the right side of the matrix.

o   Task analysis reveals rules about their intent and usage.

o   Task analysis is important when few documents or information sources exist.

·        Document analysis - starts from analysis of document instances.

o   These techniques extract or disentangle the presentational, structural, and content components of documents or other information sources.

·        Data analysis (or object analysis) - often start from a conceptual perspective about a domain and yield an abstract view of the information components revealed by document analysis.


Artifact-Focused View of Modeling

·                    Concentrates on producing modeling artifacts and reusing existing modeling artifacts when appropriate.


The Modeling Phases, Tasks, and Artifacts


Phases of the Document Engineering Approach

·        Analyzing the Context of Use - involves identifying strategic business objectives in terms of business model requirements and the rules they must satisfy.

·        Analyze the Business Process - create process As-Is models.

·        Applying Patterns to Process Models - designing business processes.

·        Analyze Documents - describing the actual documents needed by a business model

o   To-Be process model identifies the roles that documents will play; document analysis exposes the specific business rules that govern the content, structure, presentation, syntax, and semantics of the information contained in the documents.

·        Analyze Document Components - starts with the harvesting task.

o   Identify the individual semantic components contained in each of the selected documents or information sources.

·        Assemble Document Components - assemble sets of information components into meaningful structures to create a coherent conceptual view we call the document component model.

·        Assemble Document Models - create models for new types of documents based on the components, structures, and associations in our document component model.

o   Apply the rules for assembling the information components necessary for each different type of document required for the given context of use.


Implementing Models in Applications

·        Conceptual models represent substantial investments in understanding sets of business rules and capturing contextual requirements

·        Model Based Applicationtransform conceptual models into working applications.

·        Document implementation model - the realized artifact.

o   Document implementation models realized in markup languages are more commonly known as schemas

·        For models of business processes, realization means adopting a suitable metamodel (such as the ebXML BPSS) to encode the specific rules and the requirements for our given context of use.

·        Business process implementation model - the modeling artifact itself encoded as a document


Analyzing the Context of Use

·                    Document Engineering project - complex set of processes and information components.

·                    Cannot analyze it in entirety.

·                    Must define a subset of the most important components and processes.



·        Requirements are constraints on possible solutions that must be satisfied for the solution to be acceptable.

·        They are most often expressed as functional descriptions of what the solution must do.

·        Can also include performance characteristics, quality attributes, or conformance to regulations or standards.

·        Many requirements will be expressed as rules about the content, structure, and presentation of documents and their components.

o   Used to identify and design new types of documents.

·        Other requirements will be expressed as usage rules or policies about access to information or control of its processing.

o   Used to formalize the definitions of the context in which the documents are used.

·        Collecting requirements and rules is a heuristic and iterative exercise

o   Archaeologist - search for artifacts and try to interpret them even though the organizations or people who created them might be extinct and no longer available to help.

§  Might discover legacy formats and paper documents whose processes have been frozen in time. 

o   Anthropologist - locate people who work with the artifacts, and they may refer or link us to other people, who help us find more artifacts and people.


Analyzing Business Processes

·        Business process - a chain of related activities or events that take specified inputs, add value to them, and yield a specific service or product that can be the input to another business process

·        Two businesses might use different levels of abstraction or granularity to describe the processes they need to connect, making their process descriptions incompatible. 


§  Use the concepts and components provided by a business reference model, whose hierarchical organization of processes has been rigorously designed to reinforce granularity.

§  Express all process models at the granularity where we can identify the documents that they produce and consume.


Analyzing Documents

·        Objective of document analysis is to create a conceptual model that encompasses all the information requirements within the required context of use.

·        Phase begins by determining what documents and information sources we need to analyze


Create the Document Inventory

·        A document inventory is the collection of documents and related artifacts analyzed.

·        The inventory lists the sources identified along with metadata about their purposes, origins, and other attributes that will help us select a subset to analyze in detail.

·        The richer the document inventory, the more effective any analysis will be.

·        Issues:

o   much of what must be analyzed may not be in a traditional document form

o   not all information requirements are necessarily recorded in documents themselves

o   useful metadata about documents and their components may be in the form of document definitions, data models and schemas.

o   additional metadata can be found in style guides, industry standards for the domain, application interfaces, and artifacts from previous studies and analyses

o   inventory should include any undocumented information from the people involved in the exchange of documents


Sample the Document Inventory

·        Need to take a representative sample of inventory

·        Not everything in the inventory is equally valuable

·        May also want to emphasize or give more weight to documents that are especially important or authoritative


Analyzing Document Components


Harvest the Components

·        Harvesting the components - isolating any semantic components they contain

·        Two distinct tasks involved in harvesting:

o   Separating the underlying meaning from presentational components

§  Involves recognizing the stylistic conventions or presentational components being applied to information in its various formats

§  Presentational structures are usually required by people because business applications don’t care

§  Presentational structures are often the most salient patterns in narrative documents.

§  Identifying presentation components and presentational structures allows determination of whether stylistic characteristics are necessary to understand the information contained in the document

o   Disaggregating existing structures


Name the Content Components

·        Content Components givennames to distinguish them and suggest their meaning.

·        Naming components is a contentious, iterative and ongoing activity.

·        Primary modeling artifact from the analysis of information components is a Table of Candidate Content Components.

o   Aligns the components harvested from all the document sources so as to identify synonyms, homonyms, and semantic overlaps.


Consolidate the Candidate Components

·        Need to merge any synonyms (components with different names and the same meaning) by selecting a single term to replace the different ones.

·        Need to split the different senses of homonyms (components with the same name but different meanings) by assigning more distinctive names to each one.

·        This consolidation activity merges the separate sets of candidate components created from each source during the harvesting activity into a master or combined set.

·        The modeling artifact produced is called a Consolidated Table of Content Components.


Assembling Document Components

·        First step in creating models of documents from this set is to establish the required structures and identify any associations between them


Formalize the Component Model

·        More rigorous techniques for assembling structural components produce more predictable results.

·        Assemble components based on the concept of functional dependency

·        Techniques used by database designers to yield relational models that minimize redundancy and maintain information integrity.

·        This modeling artifact is called a document component model but it may be more familiar to data analysts as a domain model

o   This model presents an overall conceptual view of the all the information components required for a given context of use.

o   It is convenient to represent this model as a UML class diagram.

·        From this set of associated semantic structures we can assemble all our new document models that may span the transactional and narrative ends of the document type spectrum


Associations between Structures

·        The document component model that emerges from analysis does not describe a single document structure.

o   It defines a network of all potential document structures that might be required within context of use

·        Specific types of documents are designed by organizing their structural components into document assembly models.

o   A document assembly model is created by defining a specific path through this network of associations.


Assembling Document Models

·        First consideration in designing documents is that they are hierarchical in their structure.

·        A document can be seen as a set of nested structure of components.

o   This is why models of documents are often expressed as tree diagrams because such a hierarchy is the best way to represent them.

·        But … the document component model represents a network, not a hierarchy.

o   It cannot define a document because it has no definite roots, branches, or leaves.

·        To create a suitable hierarchical model of a document :

o   First select the entry point  - the structural component required as the root of the hierarchy.

o   Then assemble a document model by adding the required roles and associations as dictated by the business rules and requirements of the document’s context of use. We refer to this task as.


Implementing Models

·        Need to create physical, computable artifacts from our models to realize model based applications.

·        The best available way to realize physical models from conceptual ones is to encode them in an XML schema language.


Encoding Document Implementation Models

·        Document assembly models are realized by encoding them as document implementation models.

·        The implementation language influences the potential to reuse existing patterns.


Encoding Business Process Implementation Models

·        Business process implementation models encode the To-Be process, collaboration, and transaction models defined together with any patterns adopted or adapted for our new designs.

·        Business service interfaces can then interpret these documents to guide the processing of the documents they receive


7.11 Summary of Modeling Phases and Artifacts




Analyzing the Context

UML use case diagrams

Analyzing/Designing Business Processes

Business Domain View Worksheet

UML use case diagrams

Analyzing/Designing Business Collaborations

Business Process Area Worksheet

UML activity diagrams

Analyzing/Designing Business Transactions

Business Transaction View Worksheet

UML sequence diagrams

Applying Patterns to Business Processes

Document checklist

Analyzing Documents

Document inventory

Analyzing Document Components

Consolidated table of content components

Assembling Document Components

UML class diagram

Assembling Document Models

UML class diagram or spreadsheet assembly model

Implementing Model-Based Applications

XML schema for document models

XML schema for process models