SE735 - Data and Document Representation
& Processing |
Lecture 2 - Introduction to Document Engineering |
The
D-I-K-W Pyramid
Defining
data
•
Data has no meaning or value because it is
without context and interpretation.
•
Data are discrete, objective facts or
observations, which are unorganized and unprocessed, and do not convey any
specific meaning.
•
Data items are an elementary and recorded
description of things, events, activities and transactions
Defining
information
•
data that have been processed so that they
are meaningful
•
data that have been processed for a purpose
•
data
that have been interpreted and understood by the recipient.
Defining
knowledge
•
Knowledge is the combination of data and
information, to which is added expert opinion, skills, and experience, to
result in a valuable asset which can be used to aid decision making
•
Knowledge is data and/or information that
have been organized and processed to convey understanding, experience,
accumulated learning, and expertise as they apply to a current problem or
activity
Defining
wisdom
•
Wisdom is the highest level of abstraction,
with vision foresight and the ability to see beyond the horizon
•
Wisdom is the ability to act critically or
practically in any given situation. It is based on ethical judgement
related to an individual’s belief system
The wisdom hierarchy mapping to types of information
systems:
Three principal uses of the word "information":
1.
Information-as-process:
When someone is informed, what they know is changed.
·
In this sense "information" is
"The act of informing...; communication of the knowledge or `news' of some
fact or occurrence; the action of telling or fact of being told of
something." (Oxford English Dictionary, 1989, vol. 7, p. 944).
2.
Information-as-knowledge:
"Information" is also used to denote that which is perceived in
"information-as-process": the "knowledge communicated concerning
some particular fact, subject, or event; that of which one is apprised or told;
intelligence, news." (Oxford English Dictionary, 1989, vol. 7, p. 944).
·
The notion of information as that which
reduces uncertainty could be viewed as a special case of
"information-as-knowledge". Sometimes information increases
uncertainty.
3.
Information-as-thing:
The term "information" is also used attributively for objects, such
as data and documents, that are referred to as "information" because
they are regarded as being informative, as "having the quality of
imparting knowledge or communicating information; instructive." (Oxford
English Dictionary, 1989, vol. 7, p. 946).
Key
characteristic of "information-as-knowledge" – intangibility
·
Cannot touch it or measure it in any direct
way
·
Knowledge, belief, and opinion are
personal, subjective, and conceptual
·
To communicate them, they have to be
expressed, described, or represented in some physical way, as a signal, text,
or communication.
·
Any such expression, description, or
representation would be "information-as-thing".
Why
examine the notion of "information-as-thing"?:
(1)
Clarify its meaning in relation to other
uses of the term "information;"
(2)
Affirm the fundamental role of
"information-as-thing" in information systems; and
(3)
Speculate on possible use of the notion of
"information-as-thing" in bringing theoretical order to the
heterogeneous, ill-ordered fields associated with "information
science."
Two
distinctions:
(1) Between entities and processes.
(2) Between intangibles and tangibles.
These two distinctions yield four different aspects of
information and information systems:
INFORMATION,
INFORMATION SYSTEMS, INFORMATION SCIENCE
(1)
All information systems deal directly with
"information-as-thing"
(2)
Information storage and retrieval systems
can deal directly only with "information-as-thing"
(3)
Representations of knowledge form a
distinguishable subset of information-as-thing and so could, in principle, be
used to identify and define another class of information systems in which the
primary concern is based on the knowledge represented
(4)
Information-as-process could be the basis
for defining a class of information-related studies
Paul Otlet (1934): Objects as
Documents:
Graphic and written
records are representations of ideas or of objects, he wrote, but the objects
themselves can be regarded as ‘‘documents’’ if you are informed by observation
of them.
Walter Schurmeyer (1935):
Nowadays one understands
as a document any material basis for extending our knowledge which is available
for study or comparison
International Institute for Intellectual Cooperation
(1937):
Document: Any source of
information, in material form, capable of being used for reference or study or
as an authority. Examples: manuscripts, printed matter, illustrations,
diagrams, museum specimens, etc.
Suzanne Briet (1951):
A document is evidence in
support of a fact.
A document is ‘‘any
physical or symbolic sign, preserved or recorded, intended to represent, to reconstruct,
or to demonstrate a physical or conceptual phenomenon’’
Inferring from her discussion that:
(1) There is materiality:
Physical objects and physical signs only;
(2) There is
intentionality: It is intended that the object be treated as evidence
(3) The objects have to be
processed: They have to be made into documents; and, we think,
(4) There is a
phenomenological position: The object is perceived to be a document.
Donker Duyvis (1942):
A document is the
repository of an expressed thought.Consequently its
contents have a spiritual character. The danger that blunt unification of the
outer form exercises a repercussion on the contents in making the latter
characterless and impersonal, is not illusory. . . . In standardizing the form
and layout of documents it is necessary to restrict this activity to that which
does not affect the spiritual contents and which serves to remove a really
irrational variety.
Levy (1994):
A conventional document,
such as a mail message or a technical report, exists physically in digital
technology as a string of bits, but so does everything else in a digital
environment. In this sense, any distinctiveness of a document as a physical
form is further diminished, and discussion of ‘‘What
is a digital document?’’ becomes even more problematic unless we remember the
path of reasoning underlying the largely forgotten discussions of Otlet’s objects and Briet’s
antelope.
We need Document Engineering to help us specify,
design, and implement the documents that are the inputs and outputs of business
services.
Document Engineering synthesizes complementary ideas
from information and systems analysis, electronic publishing, business process
analysis, and business informatics. Its unifying document-centric perspective
helps us conceive and understand the new network-based business models made
possible by the Internet and supporting technologies.
The essence of Document Engineering is the analysis and
design methods that yield
•
Precise specifications or models for the
information that business processes require
•
Rules by which related processes are
coordinated, whether between different firms to create composite services or
virtual enterprises or within a firm to streamline information flow between organizations.
Document Engineering provides the concepts and methods
needed to align business strategy and information technology, to bridge the gap
between what we want to do and how to do it. Describing business processes in
terms of the more abstract notion of document exchanges makes it easier to
understand the constraints imposed by legacy systems and technologies and to
recognize the opportunities created by new ones if we focus on conceptual
models of the exchanges rather than on how they are implemented. The
expressiveness of XML for implementation models bridges the traditional gap
between business strategy and its technology realization.
A Simple Business Transaction?
The "Virtual
Bookstore"
Scenario:
1.
Customer selects book from catalog on an online
bookstore
2.
Customer pays with credit card
3.
Book arrives via express shipper two days later
·
From the customer's perspective there is only one
"transaction"
·
But the bookstore is a "virtual
enterprise" or "composite application" that follows the drop
shipment pattern to coordinate the activities of 4 different service providers
transacting with each other
·
This coordination - or choreography - is carried
out with document exchanges
Two Ways to Analyze
the Virtual Bookstore
1.
In terms of the "drop shipment" business
pattern
2.
In terms of the overlapping semantic components
that "glue" the business processes together
The Drop Shipment Pattern
Drop shipping is a supply
chain management technique in which the retailer does not keep goods in stock,
but instead transfers customer orders and shipment details to either the
manufacturer or a wholesaler, who then ships the goods directly to the
customer.
Composite
service with at least three separate transactions or exchanges of information
occurred:
1.
Your interaction with the GMBooks.com
catalog to select the book you want to order.
2.
A document exchange between GMBooks.com
and a credit authority (a bank or authorization network like VISA or
MasterCard) to verify your creditworthiness and charge your account.
3.
A document exchange between GMBooks.com
and the delivery service with the instructions for picking up and delivering
your book.
Pattern
typical of many Internet retailers and completely invisible to you as the
customer.
Example: Amazon.com
Patterns
Ř Amazon.com started
out as a drop shipper
Ř It was forced to
rapidly expand its own warehousing and distribution capabilities after Barnes
and Noble bought Ingram Books as a competitive response to Amazon
Ř Amazon had to build a
configurable "merchant platform" to handle its growth in product
categories, transaction volume, and associated services
Ř In 2003 Amazon began
to use this platform to host various combinations of these services for other
businesses
Amazon Web Store
Fulfillment by Amazon
Amazon Drop Shipment
The
GMBooks.com Virtual Enterprise
·
Coordination by the exchange of information
between the retailer and book distributors, shippers, and credit authorities.
·
The retailer doesn’t need its own books
and delivery trucks—it can replace inventory and equipment with information
·
Coordination is usually invisible from
the customer’s perspective
·
It requires a complex and carefully
managed series of document exchanges (often called a document choreography)
over a period that may range from hours to weeks
The Document Type Spectrum
A Narrative Example
A Transactional Example
A Hybrid Example: Catalog
= Data (Document)
Systematic Variation
across the Spectrum
§ Homogeneity
§ Prescriptiveness
§ Variety of data types
§ Precision of data types
§ Correlation between presentation, structure, and content
Processes that
Involve Both Data and Documents
·
Doing business by
document exchange is natural and intuitive.
·
Businesses use documents
to organize their interactions with each other and to package the information
needed to carry out a transaction or other meaningful unit of business
·
Exchanges of documents
that take place to carry out business models follow common patterns
o
e.g. Supply chains,
business-to-business marketplaces, auctions, information brokers or aggregators,
and content syndication networks
·
e.g. Order Management Cycle
1. Order Planning
2. Order Generation
3. Cost estimation and pricing
4. Order receipt and entry
5. Order selection and prioritization
6. Scheduling
7. Fulfillment
·
Procurement
·
Manufacturing
·
Assembling
·
Testing
·
Shipping
·
Installation
8. Billing
9. Returns and Claims
10. Post sales
Services
o Complete cycle involves documents that flow
between sales, engineering, finance, logistics, customer service, and other
divisions within the organization.
o The specific documents and divisions vary in
different contexts, but the general pattern is ubiquitous.
Standards:
o The alignment of business models and
technology is easier to achieve when an organization systematically structures
its business capabilities as self-contained resources or processes so they can
efficiently interact and recombine to meet changing business requirements.
o Using standard documents as interfaces for
business processes is a natural outcome of organizing business functions as
more discrete and flexible components.
Business Model
Example: Business reference
model
o Business reference model is a reference model,
concentrating on the functional and organizational aspects of the core business
of an enterprise, service organization or government agency.
o A business reference model is a means to
describe the business operations of an organization, independent of the
organizational structures that perform them.
o The most familiar business reference model is
the "Business Reference Model", one of five reference models
of the Federal Enterprise Architecture of the US Federal Government.
o
a function-driven
framework for describing the business operations of the Federal Government
independent of the agencies that perform them.
o In computing and systems design a loosely
coupled system is one in which each of its components has, or makes use of,
little or no knowledge of the definitions of other separate components.
o
Two business
organizations must agree on what the documents mean and on the business
processes they expect each other to carry out with them
o
They don’t need to agree
on or even know anything about the technology they use to create and process
the documents.
o
The documents are the
only visible interfaces to their respective business processes.
o Required: a complete
view of how information is defined and used in different business contexts—what
is often called the information model—a formal representation of the structure
and semantics of information.
o There are at least two sides to every document
exchange, and all parties need to ensure that they understand the documents in
the same way.
o Solution 1: every
enterprise adopt a common data model and use exactly
the same definitions for the document components of their applications.
o But that’s inconceivable; enterprises, applications, and
people just don’t stand still long enough to make it possible.
o Solution 2: two parties each
create conceptual information models to help them translate or transform the
documents they receive so their applications can understand them
o e.g. US Government Data
and Information Reference Model:
o The Data Reference Model (DRM) is one of the five
reference models of the Federal Enterprise Architecture (FEA).
o The DRM is a framework whose primary purpose is to
enable information sharing and reuse across the United States federal
government via the standard description and discovery of common data and the
promotion of uniform data management practices
o DRM structure
The
DRM provides a standard means by which data may be described, categorized, and
shared. These are reflected within each of the DRM’s three standardization
areas:
§ Data Description: Provides a means to uniformly
describe data, thereby supporting its discovery and sharing.
§ Data Context: Facilitates discovery of data through an
approach to the categorization of data according to taxonomies. Additionally,
enables the definition of authoritative data assets within a COI.
§ Data Sharing: Supports the access and exchange of data
where access consists of ad-hoc requests (such as a query of a data asset), and
exchange consists of fixed, re-occurring transactions between parties.
In this scenario:
1.
Organization B determines that it has the need for a particular set of data
that might be available from Organization A.
2.
Organization A uses the DRM to categorize its data (using the Business
Reference Model) into a business context.
3.
Organization B identifies Organization A’s available data through its business context.
4.
Organization A uses the DRM to publish the detailed structure of the actual
data element (in support of the business context).
5.
Organizations A and B determine if, in fact, the data produced by Organization
A will meet the needs of Organization B.
6.
Once Organizations A and B determine that the data can be re-used, the information
exchange package is used to transmit the data.
o To exchange documents, computers or business applications require a
precise and unambiguous language for describing information models.
o Since its emergence in the late 1990s, XML—the Extensible Markup
Language—has become the preferred format for representing information in
documents both on and off the Internet.
o Web publishers view XML as an improvement on HTML that enables greater
automation and consistency in formatting.
o Programmers see XML as an Internet-friendly, easy-to-parse, and
nonproprietary data format that they can use instead of ad hoc languages for
application configuration and interprocess
communication.
o Electronic data interchange (EDI) developers see XML as a more
expressive, maintainable, and therefore lower cost syntax for creating business
messages.
o XML’s broad impact in publishing, programming, and EDI has made it a unifying
technology for implementing applications that use Internet protocols,
especially for those that span enterprise boundaries, such as web services.
o Using
XML to encode implementation models yields an overall rigor, reusability, and
programmability unmatched by other representations
o XML’s facility for document encoding is an excellent match for the
document exchange architecture of the Internet
o Programming paradigms are emerging in which XML schemas, programming
language objects, database schemas, and UML models can be treated as equivalent
because XML schemas can be used to generate any of the other formats if
required
Document Models as Interfaces
o Simplest example: a
document displayed in a browser that users interact with in ways that are
determined by the document’s model
o Narrative end of the Document Type Spectrum : E-books or other structured publications in
which user interface features include tables of contents, hypertext links, and
navigation aids are generated from the names or attributes of the information
components in the document.
o Transactional end of the Document Type Spectrum: E-forms in which applications collect the
information specified in the document’s model to automate processes that
previously have relied on printed forms
e.g. Virtual Enterprise Example
Web Services and Service Architectures
o Using documents as interfaces underlies the
idea of service -oriented architectures (SOA) as a way to create new
applications or systems such as web services by integrating or combining
components of other ones
o Web service : “an interface that describes a
collection of operations that are network accessible through standardized XML
messaging
o
Implies that any
self-contained application functionality or information resource is turned into
a service by packaging it so that it exposes only input and output XML
documents.
o The abstraction level introduced by document
exchange, a service can be:
o
Anything that can send or
receive a document
o
Anything that can accept
a document, process it and return a result
o
Anything that can accept
a document and allow the user to act on it
o
Anything that can accept
a document and forward it to some other application or destination
o
Anything that can
generate a document as a result of user interaction, processing a received
document, or some other event.
Advantages:
o Because web services are loosely coupled and
hide implementations, document interfaces allow firms to maintain a clean and
stable relationship to partners and customers.
o By using independent components, web services
also make it easier and cheaper to adopt new technologies incrementally without
affecting any existing business functionality.
o Because document interfaces can be implemented
in any technology, platform compatibility concerns are lessened. (e.g. cloud
services)
The Data/Document Chasm
§
It's obviously a continuum... but transactional and
narrative documents have traditionally been analyzed with different disciplines
and use different tools, terminology, and techniques – little intersection
§
Very different intellectual and domain roots
§ "Document analysis" for narrative documents: literary
criticism, graphical design; electronic publishing
§ "Data modeling" for transactional documents: philosophy,
linguistics, systems analysis; data automation
Crossing the Data/Document Chasm
§ Document Engineering harmonizes the terminology and emphasizes what they
have in common rather than highlighting their differences Identifying the
presentational, content, and structural components and defining their
relationships to each other
§ Identifying "good" content components
§ Designing, describing, and organizing components to facilitate their
reuse
§ Assembling hierarchical document models that organize components
according to the requirements of a specific context for information exchange
Data/Document Driven {and,or,vs}
Business Process Analysis
§ The document analysis and data modeling approaches focus from the
beginning on the structure and content of the "document payload" that
will be exchanged – a "bottom up" approach that emphasizes "Does
this work from a technical perspective?"
§ In contrast, business process analysis begins with an abstract or
broadly scoped perspective on business activities
§ Emphasizes "Does this work from a business perspective?"
§ Inherently a "top down" approach that starts with business
models and processes and gets to the "document payloads" only at the
end
{and,or,vs} User Task
Analysis
§ Task analysis (or user analysis) is the observation of people performing
the tasks or use cases when the application or system must support human
interfaces and not just other applications
§ Task analysis and document analysis are closely related; document
analysis reveals candidate information components and task analysis reveals
rules about their intent and usage.
§ Task analysis is especially important when few documents or information
sources exist because human problems or errors can suggest that important
information is missing
Document
Engineering organizes its modeling approach into 8 phases:
First
phase - Analyzing the
Context of Use:
Business and task analysis techniques establish the context
for a Document Engineering effort by identifying the requirements and rules
that must be satisfied to provide an acceptable solution.
Phases
2 & 3—Analyzing Business Processes and Apply Patterns:
Business process analysis is applied to identify the
requirements for the document exchange patterns needed to carry out the desired
processes, collaborations, and transactions in the context of use.
Phase
4 —Document Analysis:
Involves identifying a representative set of documents
or information sources (including people) and analyze them to harvest all the
meaningful information components and business rules
Phase
5 — Component Assembly :
A document component model is developed that represents
structures and their associations and content that define
the common rules for the possible contexts of use.
Phase
6 — Document Assembly :
The document component model is used to create document
assembly models for each type of document required.
Phase
7 — Implementation :
Conceptual models are first encoded using a suitable
language to support their physical implementation
"Tailing
Virulent Veggies" (WSJ, 3/13/07)
Dole Food Co., the world's largest producer of fresh
vegetables, recently started using radio-frequency identification tags to track
leafy greens as they move from fields to trucks and through processing
facilities
Scanning RFID Tags on Vegetable Boxes:
"SPEEDy Airline Self-Service" (Hong Kong News, 1/07)
"More
Clicks at the Bricks" (Business Week, 12/6/07)
The Web and other information and communications
technologies have substantially affected consumer behavior
This story analyzes some of these new patterns of
information use (between people and information systems/resources, and between people)
Retail stores are offering new kinds of information
services
"Landstuhl to use Electronic Health Records" (Army
Times, 1/16/08)
With the new system, health care workers will create
PDFs (portable document formats) for inpatient records, allowing them to be
sent to the next hospital electronically.
Within a year, CliniComp
officials hope to have all PDF records within its system at Landstuhl
easily available to the Veterans Affairs Department as service members move
from military to civilian status.
This has been a problem in the past; the Defense
Department and VA medical records systems are not compatible.
CliniComp
does not have the capability to scan and create PDFs for old medical records,
so it only applies to records being created now.
"Big
Employers Plan EHRs" (WSJ, 11/29/06)
"Insurers
Lack Governance" (Gartner,1/9/08)
What
Are the Common Themes in These News Items?
Enormous amounts of existing (paper) documents and
legacy processes would benefit from automation, process re-engineering, transformation
to SOA
New business processes are created / coordinated /
choreographed via the management and exchange of electronic documents
Standards / patterns for documents and business
processes are essential
Information technology and business processes are
co-evolving with many ways to create business value
But projects can be challenging, and their success
depends on many factors besides technology