SE735
- Data and Document Representation & Processing |
Lecture 10 -
Analyzing Documents and Document
Components |
Analyzing Documents
Classifying Requirements in
Document-Intensive Contexts
·
SOLUTION requirements – the
functional, performance, quality attributes
·
INFORMATION or DATA requirements – what
information is needed, what are its datatypes,
possible values
·
DOCUMENT or STRUCTURE requirements
– how is the information organized / assembled / packaged into sets of related
information
·
PRESENTATION or SYNTACTIC
requirements – how is the information presented or formatted or rendered – the
physical or output model
·
PROCESSING and USAGE requirements –
what relationships between documents have a business purpose
From Chapter 8 of Document
Engineering
Rules that Apply to Conceptual
Models
· Semantic
· Structural
· Usage
Rules that (Can Also) Apply to
Physical Models
· Syntactic
· Processing
· Presentational
Rules that Apply to Instances or
Implementations
·
Content
Requirements in the Model Matrix
Context Dimensions x Rule Types
Rigorous Requirements Processes are
Document-Intensive
·
Occasionally requirements are
developed in an extremely rigorous way
·
Large companies or organizations
(example: government, military, General Motors) may conduct a "contract
definition phase" or issue a "Request for Information" (RFI)
document in which they engage multiple companies or consultants to define a
problem and come up with some preliminary solution or design concepts
·
These requirements may be heavily
constrained by legacy technology or processes unique to the customer
·
The best ideas then get turned into
a "contract specification" or a "Request for Quote" (RFQ)
document.
A More Typical Requirements Process
·
Most companies carry out IT
planning activities (in software companies these are called "product
management" or "product marketing" activities) to define
requirements for systems or applications
·
The results of this activity are
recorded in a "Product Requirements Document" or PRD
·
These requirements are then
negotiated with the organization that will design and develop them (engineering
or other R&D organization) the requirements specifications can provide a
roadmap for future versions or products
When There Isn't A
Requirements Process There Are No Documents
·
Some companies – especially new or
undisciplined ones – don't bother with requirements
·
Many people want to "get on
with it" and are biased toward working on the "end product"
(doing the programming or implementation activity) rather than working on less
tangible intermediate artifacts (like those produced by requirements, analysis,
and design activities)
·
So they don't specify requirements
in any rigorous way and no models of information or process are created
· Not saying
that "prototyping is bad" or "agile methods don't work" but
"model-guided prototyping" is a lot better -- it is essential to
systematically explore some design alternatives
Who Performs Document Analysis?
·
Standard approach is facilitation
by document analysis experts in face-to-face "workshops" with broad
participation
·
Document creators/users reach
consensus with expert help, and then experts systematize it into models and
schemas
·
Document analysis is often carried
out as a consulting engagement – with all the complications of defining the
project, managing expectations and relationships, and packaging the results for
effective use
Creators/Users in Document Analysis
·
What will they know?
·
What won't they know?
· What factors will constrain their interactions with you?
Experts / Consultants in Document
Analysis
·
This is YOUR role
·
What will you know?
·
What won't you know?
·
What factors will constrain your
interactions with others?
Generic Requirements in
Document-Intensive Environments
· Automated information capture --
Eliminate manual entry (or reentry) of information when documents are created
· Straight-through processing --
Minimize the need for any human intervention as a document flows through some
specified processes.
· Timeliness -- Make
information available to those who need it when it is needed and when promised,
and update it promptly when it changes.
· Accuracy -- Ensure that every piece of
information in a document is correct.
· Completeness -- Ensure
that a document contains all the information it should or that its recipient
(person or application) expects.
· Automated validation -- Provide
a schema or specification that enables information to be validated.
·
Interoperability -- Enable
information to be used "as is" or via automated transformation by
other systems or applications
·
Standards compliance -- Conform
to regulations or standards for information structure, accessibility,
availability, security, and privacy.
·
Customizability --
Facilitate the internationalization, localization, and subsetting
of information.
·
Usability -- Present information in a format
or medium that is easy to use and understand by its intended users.
·
Identifiability -- Ensure that the design or appearance of a document signals
that it comes from our organization or company ("branding")
What a "Document" Is
·
Every major advance in transportation,
communications, manufacturing, financial technology or "governance"
has required new types of documents
·
But the basic idea of a document
has been surprisingly stable for a couple of millennia
o
A document is a self-contained
package of related information
o Documents organize business interactions around the information
needed to carry out transactions
o Documents are the inputs and outputs of business processes
·
In most Document Engineering
efforts a critical step is creating a document inventory and classifying the
"documents" you locate
·
You need to take a very broad view
about what's a document because much of what's important to analyze isn't a
traditional document
·
Much of what we analyze comes from
people or systems or machines, and the lines between "requirements
analysis," "document analysis," and "user-centered
design" aren't always sharp
·
You can think of what you learn
from people as instances of "interview" or "observation"
document types
Recognizing Documents
·
Documents are packages used for exchanging
information.
·
Packages may be:
o Paper form (printed/written, formal/informal)
o Digital form (computer files, structured/unstructured, databases)
·
Exchanges may be:
o Messages (emails, EDI)
o Online or Web
o Postal, Fax
·
Sets of data in databases, spreadsheets,
accounting systems
·
Completed Printed forms
·
Job aids, "cheat sheets,"
sticky notes and other informal or unofficial documents
·
Lots of undocumented information in
people's heads that you write down after talking to them
Document Types (vs
Instances)
·
Blank Printed forms
·
Web forms
·
Database schemas
·
Documents that describe APIs or
maybe even the code that implements them
·
Style sheets or templates in office
applications
Finding the Right Documents for the
Inventory
·
Not all types of documents are equally
important; is a document intrinsic to a business process or a
derivative/aggregate of it?
·
If there are many instances of a
particular type, we might have to be concerned about representiveness
and selection biases
·
Don't assume that job titles and formal
organizational structure reflect what people actually do
·
Don't assume that the names given
to documents fit the people, tasks, and organizations in which we locate them
·
Regardless of its title, make sure
a document is being used before you conclude it is important
Names for Document Types and
Instances
·
Sometimes there are rules for names
of document types
·
Sometimes there are rules for names
of document instances
·
Sometimes the names of document
types or instances aren't informative
· Names are just one kind of metadata attached to document
instances; there is lots more
Iteration in Document Inventory
· Identifying
all the potentially relevant documents or information sources is inherently an
iterative task
o
Documents may refer or link to other documents
o
Documents may refer to people, who can refer to other documents or
people
· Developing
a causal model of the domain can help identify the intrinsic documents
o
Where are the "headwaters" for the information -- what
events or processes cause it to be created?
o
A causal analysis can suggest other correlated information
"streams" that merge with the primary source you've identified
Using Process Patterns to Find
Documents: The Document Checklist
Analyzing the Document Inventory
·
You need to arrange the results of
your inventory so you can think about it as a whole and in parts
·
What aspects of documents vary
systematically across the categories in the inventory?
·
What other aspects of documents
vary, but not systematically across the categories?
·
We need some concepts and
vocabulary for answering these questions
Categories of Document Types
· There are a
few hundred common types of documents used in business transactions
· But
transactions are just one category of document types
· Other categories
with many distinct types include:
o
Software and system documentation
o
Procedures, policies, laws, and regulations
o
Reference books, encyclopedias, dictionaries
o
Catalogs
· Organizations
often use or produce multiple document types within the same category
Document "Collections" or
"Chains" or "Clusters" or "Complements"
· Some sets
of document types in an inventory are related to each other
· Some
document types are themselves sets of documents of another type
· Other
document types fit together in a kind of sequential or process relationship
where information flows from one to another in the normal way in which they are
used or created
· Transactional
documents often come in pairs that must be correlated
· Documents can
have complementary (they are useful together) or uncomplementary
(they are not useful together) relationships, and the relationships aren't
necessarily symmetric
The Document Type Spectrum
Systematic Variation in Document
Types Across the Spectrum
·
Instances more heterogeneous on
narrative end
·
Types are "broader" and
more descriptive, less prescriptive on narrative end
·
The set of content types within a
document type is much greater on the transactional end because the leaves
aren't "just text"
·
More need for "metadata"
augmentation of documents on narrative end, because on transactional end what
would be metadata is more likely to be explicitly contained in the content
already
·
Presentational information more
likely to be correlated with content and structure on narrative end
Organizing the Inventory
For every document or information
source you should collect:
· Name
· Source
(where/who found)
· Definition
· ?
· ?
· Any
metadata that helps you decide whether to analyze it
Sampling the Inventory
·
Sample from all parts of the
document type spectrum
·
Sample more from heterogeneous
categories
·
Sample documents based on priority
of requirements
·
Sample based on importance or
authoritativeness
Generic Inventory Questions
For documents received by an organization, we can ask the following
questions:
·
What is
the official name of the document? Does it also have other informal or
unofficial names?
·
From whom
(or in what process) do you receive the document?
·
Why do
you receive it?
·
What are
you expected to do with it?
·
How often
do you receive it?
·
What
events trigger the sender’s actions?
·
Does the
document contain all the information you need for the process for which the
document is received?
·
Does the
document contain information that is unnecessary for the process?
·
What do
you do with the document after your process has been carried out?
·
To whom
(or to which organization) do you send it?
For documents sent by the organization, we ask:
o
What is
the official name of the document? Does it also have other informal or
unofficial names?
o
To whom
(or what process) do you send the document?
o
Why do
you send it?
o
What do
you expect the recipient to do with it?
o
How often
do you send it?
o
What
events trigger your actions?
o
Does the document
need to conform to any standards for content, structure, or presentation?
o
What does
the recipient do with the document after their process has been carried out?
Organizational Issues in Document
Analysis
·
Org charts can suggest business
processes (and their associated documents), people who can tell us about them,
and the context boundaries we can enforce
·
The level at which you interact
with an organization - the kinds of people you interact with - strongly shapes
what you learn about it
· The concreteness of document analysis makes it more "bottom
up" than business process analysis
Strategic Document Analysis
·
Document Analysis IN a
Strategic Effort:
o Microsoft merges with Yahoo! and assesses how each side does
business to decide what practices/ orgs/people should be retained
o One of the last phases of efforts like these is Document Analysis
to ensure that the "keeper" processes of the merging firms are
effectively combined
·
Document Analysis AS a
Strategic Effort:
o Analyze the information creation, management, processing, and
distribution activities of an enterprise or organization to support the
development of a data and process dictionary, an information architecture, or
an enterprise data model
o Often the foundation activity for introducing a "content management"
or "knowledge management" system
Information as a Strategic Asset
· Identify
"overlaps, gaps, and opportunities" in alignment of information
assets with goals of the enterprise
· Eliminate redundancy,
identify what information must be collected that isn't, and that which might be
· Increase
reuse
· Increase
consistency
· Enable
flexible creation of customized/personalized information products
· There will
be lots of documents and data sets to analyze, but this kind of effort will be
much less focused on these existing information artifacts than a tactical
document analysis project is
Tactical Document Analysis
·
Analyze the existing information
used by some constrained set of processes in an enterprise so that the
processes can be improved, automated, re-engineered, re-purposed
·
Two most common tactical efforts:
o
Document automation
o
Online publishing
Document Automation
· Transforming
printed transactional documents or forms into electronic versions
· The
business driver is often a "request" by a dominant business to its
partner to automate the exchange of transactional information in conformance
with its proprietary document specifications
· This means
that the real goal can be to take an existing process (often, someone else's)
and encode it in electronic documents
Document Components
Document Analysis: From Physical to
Conceptual Models
· When we analyze
information sources: interviews, documents, sets of data whatever - our goal is
to identify and describe the "significant things" or the
"information components" and their characteristics or attributes
· But when
you analyze documents the information components aren't as immediately apparent
because they are contained in structures and rendered in some presentation
· So we have
to remove the presentational information and dis-assemble the structural
information to find the content information that is our highest priority
· As we take
away presentation and structure, we are abstracting away or generalizing from a
physical implementation and creating our first conceptual or logical model of
the information components
Three Types of
Information In Documents
We need a vocabulary to
classify different kinds of information that we find in documents and sets of
data
· Content
–
"what does it mean" information
· Structure
–
"where is it" or "how it is organized or assembled"
information
· Presentation
–
"how does it look" or "how is it displayed" information
Components
·
Components – the units of content
o
Any piece of information that has a unique label or
identifier is a candidate component
o
Any piece of information that is self-contained and
comprehensible on its own is a candidate component
o
A component is a logical unit, with no presentation
implied; it may be organized structurally
· These
definitions are very helpful for finding components in some types of documents
but less so in others
· It
depends on the presence of, and relationships with, the structural and
presentational information*
Content Components
We can identify components as the
separate units of content to be organized -- "pure content" with no
structure or presentation assigned or implied
Document Engineering: Optimizing
"Content + Structure + Presentation"
The "Document Engineering
Methodology" can be thought of as:
1. Distinguishing
the three kinds of information in instances or artifacts
2. Carefully
describing their current and desired relationships
3. Creating
conceptual models that describe the content information as it is and as it
could be
4. Using
principles of "good design" and patterns to refine the conceptual
model
5. Reassembling
or recombining the three kinds of information to achieve the desired
relationships in the "instances" or "artifacts," beginning
with the conceptual model and then adding structure (creating document schemas)
and then adding presentation (with transforms or stylesheets)
Document Engineering and
Information Architecture
This formulation of the Document
Engineering approach is essentially equivalent to how Information Architecture
is defined:
Information
Architecture = (((content + information structure) + navigation structure) +
presentation structure) + presentation design
The Most Important Principle for
Information Architecture
· We say
"the document is about … the photograph is about… the movie is about"
· We're
expressing a distinction between information as conceptual or as content: and
the physical container or medium, format, or technology in which the
information is conveyed
· It is very
useful to think abstractly about "information content" without making
any assumptions or statements about the "presentation" or
"rendition" or "implementation"
· Separating
content from its structure and presentation is the most important principle of
Information Architecture
Presentation Information
· Human-oriented
attributes for visual (or other sensory) differentiation (type font, type size,
color, background, indentation, pitch, ...)
· In general,
presentation information is the least important stuff you find in documents but
o
Good information architecture and user interface design correlates
this with structural or content information
o
You might have a requirement to preserve it or make it more
consistent
Presentation Fidelity and Integrity
·
Presentation Fidelity is a requirement to preserve the original presentation, often
exactly
o For example, with International Letters of Credit and Bills of
Lading you can readily imagine a bank or customs inspector carefully comparing
computer-generated and original printed documents.
o More common is the requirement to replace ad hoc, inconsistent or
incomplete presentation components with rule-governed presentation
·
Presentation Integrity is a requirement to assemble the document model in "document
order" – that is, to organize the elements so that their valid order
matches the order in which they would want them to appear in a document
instance
Extracting Presentation
Rules
· Presentation
affects structure and content by applying transformation rules to them
· To
understand the structure and content we must identify and record what the rules
of the transformation were
· Explicit
transform rules can be encoded in templates, stylesheets
or source code?
But Sometimes Rules Can't be Extracted
·
No access to source formats or
source code
·
Rules may be inaccessible in source
formats ("override" formatting in word processors instead of style
tags)
·
Rules don't exist or are
inconsistently followed (author has "fontitis"
with "ransom note" presentation style)
Correlations or Conventions with
Presentation Information
·
Color, pitch, other perceptual
dimensions can be correlated with semantic distinctions
·
Type size is usually correlated
with the structural hierarchy
·
Content types can have
characteristic layouts or text attributes
·
Adjacency can suggest a semantic
relationship, like that between figure and caption
·
Presentation order is sometimes
semantically significant
Binding Structure to Presentation –
Alternatives
Gestalt Principles -- Reinforcing
Structure with Presentation
Structural Information
· Physical
piece of a document or user interface (e.g. table, section, header, footer,
panel, window)
· Embodies
the rules on how content components fit together, often hierarchical
· Often
driven by context of document use
· Most
applications and web sites are organized with a small set of structures:
o
Lists/hierarchies
o
Networks/links
Structural Integrity
A requirement to preserve some
aspects of structure, but not necessarily any presentation:
· Identical
page boundaries for the electronic and printed versions of documents,
especially when document revisions are highly localized (as in "looseleaf" publications with their placeholder pages
that say "this page intentionally left blank"
· Chronological
order for a narrative biography or history
· "Putting
it together" instructions (don't want to say "assembly" here)
for a bicycle or piece of furniture need to follow the order in which they are
most easily or safely put together.
Analyzing Structural Components
·
The structural components can
provide the hierarchical "skeleton" or "scaffold" into
which the content components are arranged
·
Presentational Structures provide a
framework for presentation -- table, section, title, header, footer
·
Semantic Structures are logical
groups of conceptually-related components - parts of an Address, Phone number
·
Structural components are often
identified by the names attached to pieces of information – think of the
outline or table of contents or lists of various kinds
·
Metadata to capture
o
Depth of hierarchy
o
Sub-structures included within a structural container
o
Rules for applying numbers or names to content in the hierarchy
Content Components
· Content
components are the "nouns" in our documents or sets of data – things
like "topic," "summary," "name,"
"address," "price"
· In
publications a lot of the content isn't easily identified by "component
type" – it may be "just text" that could be playing any of a
very large number of roles in the document
· And
sometimes you get no help from the set of style or formatting tags in word
processors or in HTML, which are very format or structure oriented and not
content oriented at all
· We need XML
so we can invent the vocabulary of tags needed to describe component content in
a specific document type
Identifying Content Components
· Easier in
Transactional-type documents:
o
Documents designed to convey explicit content
o
Strong data typing with metadata for field length, range and
value, other restrictions.
· Few and
somewhat arbitrary presentational characteristics
· Information
about content components in:
o
Physical implementation models (schemas)
o
Source code of any relevant applications that process documents
Relationships Among
Content Components
Content components can be related
to one another
· Derivational
relationships
· Referential
relationships
Links
· Links are
relationships between components that can express content as well as structural
information
· A link is
represented in a logical model by its:
o
Anchors -- the point, region, or span within the components to
which it refers
o
Type -- the semantics that the link relationship represents; not
always explicit
o
Directionality -- is the link one or two-way? Is the relationship
meaningful in both directions? Does the reverse direction link mean the
inverse?
o
Cardinality -- 1 to 1 to many?
"Mixed Content"
· Narrative
documents can hide or obscure candidate components in paragraphs or other
blocks of text
· Document
analysts refer to these as "Mixed Content" components because they
are mixed into surrounding text that may be more generic or untyped
· A common
form of mixed content is an otherwise unstructured text paragraph that contains
emphasized words, glossary terms, references to tables or figures, citations to
supporting documents, or links to footnotes or endnotes
Analyzing Content Components
What attributes about each type of
content should we record in our analysis?
· Names/synonyms/homonyms
(what it is called)
· Definition
(what it "means")
· Cardinality/Optionality
(occurrence rules)
· Restricted
values, code sets, defaults
· Data Type
(text, numbers, date, video)
· Relationships/Associations
· Origin (Is
this new information, or from some other source? Who maintains it?)
Systematic
Variation in Document Types Across the Spectrum
· Instances
more heterogeneous on narrative end
· Types are
"broader" and more descriptive, less prescriptive on narrative end
· The set of content
types within a document type is much greater on the transactional end because
the leaves aren't "just text"
· More need
for "metadata" augmentation of documents on narrative end, because on
transactional end what would be metadata is more likely to be explicitly
contained in the content already
· Presentational
information more likely to be correlated with content and structure on
narrative end
Relationships
Between Text and Non-text
Another useful dimension for
thinking about content considers the relationship in documents between the text
and non-text information that they contain
· Text-dominated – most of
the content is conveyed by text components, with non-text components
unnecessary or in an incidental role (examples: legal documents, accounting information,
invoice)
· Text-framework – the
document reflects the organization defined by the text components, but non-text
components provide content enhancements (examples: encyclopedia, maintenance
manual, product catalog, purchase order)
· (multimedia) Non-text dominated or
text-enhanced – most of the content conveyed by non-text components; which
provide the framework for the text; text components carry metadata, annotate or
explain intrinsically non-textual content (examples: photos, video, engineering
drawing, atlas, art book)
The
relationship between text and non-text information can vary at all points on
the document type spectrum
·
Narrative document type can
be philosophy (all text) or anatomy (lots of non-text)
·
Transactional document type can be
invoice (all text) or RFQ (lots of non-text)
Dictionaries, Encyclopedias, and
Reference Books
· Usually
very carefully designed, with regular structure that is exploited in
information access and navigation features to enhance usability
· Often have
rich repertoire of content component types (pictures, maps, charts, formulas,
tables)
· Mixed
content in paragraphs or other text blocks will contain numerous content
types
Engineering Compendium – Typical
Entry
Encyclopedia Entry
Oxford English Dictionary – Typical
Entry
Procedures, Policies, Laws, and
Regulations
· Usually
mostly text, created and used by people Information that is often extremely
important to companies and highly-paid professionals because the cost of
finding (or not finding) information can be high
· Often has
high "intrinsic hypertext" character with many explicit and implicit
links between content components
· Often
follow structural conventions and standards with regular numbering and naming
schemes
· Versioning
and configuration requirements can pose problems
· Making this
type of content computable or executable is a huge R&D area (XML standards
like XACML, policy engines and wizards, expert systems)
Catalogs
· Many
different types
· Some are
extracted from ERP system or product database
· Often
contain a mixture of structured and unstructured content
· Often a
challenge to match the user's vocabulary and ontology for a product domain
Industrial Parts
Software "Man Page"
Home Blueprint
Recipe
Transaction
Documents
· Printed or
electronic forms
· Data-intensive,
designed to capture and present small information components
· Inputs and
outputs of business processes and often created and consumed by computers
· Few and
somewhat arbitrary presentational characteristics
· Strongly datatyped with field length, range and value, other
restrictions
Tax Form
Harvesting
and Consolidation
Harvesting – Create a
set of candidate content components by extracting them from the information sources
while removing presentation and structure
· As we
identify candidate content components, we need to record its properties (or
attributes or behaviors) that let us understand it and distinguish it from
other ones
· A practical
way to do this for each document or information source being analyzed, create a
table or spreadsheet containing the candidate component and the useful metadata
Consolidation– Identify
synonyms and homonyms among the candidate content components, assigning a
unique name to each unique meaning as part of a controlled vocabulary
· How
rigorously we must assign "good names" and "good
definitions" depends on the size of the document inventory and the scope
of the project
· Names might
follow precise rules to ensure that they can be reliably stored and located in
a data dictionary a la ISO 11179
Seek Semantic Clarity
and Precision
"What's in a
Name?"
(http://www.vertaasis.com/articles/whats_in_a_name.htm
recommends three "levels" of models (or names)
· Business names
–
a format that lets the semantics be easily readable and verifiable by a
business person (not a modeling or XML expert). This should use familiar words
and be completely technology-independent
· Logical names
–
a format optimized for the expression of the design or model; essential that
they are expressive enough to reflect the relationships between model
components.
· Physical names
–
the format required by the implementation technology for the model
Defining
What Something Means
·
Definitions
·
Definitions in a controlled
vocabulary
·
Data types
·
Metadata
·
Metamodels
·
Formal assertions
·
Ontologies and thesauri
The Simplest Information Component
Model
· The simplest
or minimal information component model is a glossary – a list of the words used
to describe or name the "things of significance" and what they mean
· This simple
data model is augmented as attributes or characteristics of the significant
things are identified and recorded
· The model
is further developed as relationships or associations or links between the
"significant things" are identified and recorded
What Metadata to Record About Candidate Components
What attributes about each type of
content might we record in our analysis?
· Names/synonyms/homonyms
(what it is called)
· Definition
(what it "means")
· Identifiers
· Cardinality/Optionality
(occurrence rules)
· Restricted
values, code sets, defaults
· Data Type
(text, numbers, date, video)
· Relationships/Associations
(participation in structures)
· Origin (Is
this new information, or from some other source? Who maintains it?)
· Access (who
is allowed to view/change/copy/etc. it)
· Permanence
(is it static or dynamic? how often does it change?)
· Business
processes in which it participates
Analyzing Tables
· A table is
a systematic pattern of relationships among content, structure, and
presentation information, typically represented in a set of embedded
rectangular grids
· A table presents
information by organizing some set of meaningful elements to emphasize the
relationships between the elements and the manner in which combinations of
elements interact
· Most tables
(90% of them?) follow regular matrix or structural patterns in which the
organization of information (and the presentation applied to it) is consistent
with (or reinforces) the relationships between the content that is contained in
the cells or regions defined by the matrix
· The nature
of these relationships is often explicitly represented in the headings for
rows, columns, or other structural elements
· When the
relationships are not explicit, they can often be determined by analyzing the datatypes and content of the cells or the manner in which
the content varies from cell to cell
· The mere
existence or non-existence of values within the cells can have semantic
significance
Document Types as Tables, Tables as
Document Types
· A
transactional document type is often little more than a table (of items
ordered, purchased, shipped, etc.) with some additional information about the
parties to the transaction
· A table
embedded in another document might be best understood as a "mini-document
type" of its own, especially when the "containing" document type
is more narrative than transactional
The Trouble with Tables
·
The obvious and optimal analysis of
the information in a table is in terms of these content relationships:
· Unfortunately,
the predictable geometry for organizing their content has led to tables being
analyzed and implemented in terms of the structure of their presentation rather
than a set of content relationships
· And in addition
to the problem that most tables aren't represented in ways that capture their
"tablehood" essence, it has been estimated
that 95% of the information marked up as <TABLE> on the web is not really
a table
· And some small
percentage of things that are tables according to our definition defy content
encoding because they combine content, structure, and presentation in ways that
are often impossible to untangle or that are highly idiosyncratic but
conventional
Transforming Presentation to
Content
·
Deconstructing tables into their
content types is an instance of the more general goal of transforming
presentation to content
·
Other presentation components and
conventions that carry semantic information should be made explicit as content
components
o
The mere existence or non-existence of values within the cells of
a table can have semantic significance.
o
Color coding: Red text or box around text -> warning
o
Adjacency: figure and caption -> illustration aggregate
Analyzing "Possible
Values"
· It is
critical to capture any rules governing the possible values for a component
· Sometimes
possible values are conventional, fixed, and span the entire semantic range for
some domain (days of week, AM/PM)
· Determine
who can control the value sets (internal [Manufacturer part #s] vs external [Bar codes])
· Patterns
like regular expressions are often useful but not sufficient for validation
· And if the
set of possible values is just historical and not well motivated, fix it in
your component design
Code Sets
· Codes are
constrained sets of values
· Codes
establish their meaning by reference to those values, often by abbreviations
· Using codes
in vocabularies and metadata promotes consistency and makes meaning unambiguous
· You especially
want to avoid doing a partial enumeration in a domain where a standard set of
enumerated values already exists
· Most
organizations have internal code sets or business rules that implicitly define
them
External and Internal
Codes
· External
codes are those maintained by some entity or organization outside of your
control (ISO, ANSI, etc.)
· The
ISO
code sets for countries (3166), currencies (4217), quantities
and units of measure (31) are the bedrock ones that you should generally defer
to without question
o
ISO 639 - language codes
o
ISO 3166 - country codes
o
ISO 4217 - currency codes
o
IATA port codes (e.g. airport)
· Internal
codes are code sets that you can define and control
How This All Relates to Content
Models in Vocabularies
· EXAMPLE:
"country code" or "currency code" are "Fregan" and can be reduced to context-free
enumerations, but "country" or "money" can't began they're
"Wittgensteinian"
· Put very
simply: The meaning of a tag can rarely be defined in terms of its legal values
· This
doesn't mean that we can't use money as a component in an information model,
but it warns us that we can be more precise if we pretend that money can be
understood as "currency code" and an "amount"
· And
whenever a "code set" exists in the world, make sure you capture it
in your semantic description
Consolidating The
Harvest
· We can
begin our consolidation with the candidate components from any of the
information sources, but we recommend using the one you believe is the most
authoritative or that yielded the most components
· The goal is
to combine components that are synonyms (different names for the same [or
highly similar] meaning) and to distinguish any homonyms (same names for
different meanings)
Guidelines for Minimizing Synonymy
· Components
that are similar but not identical in semantics often pose the most problems
because they encourage multiple inconsistent ways to tag the same content
o This is not
only not a good thing, it is a very bad thing
· Synonymous components
often arise in harvests from information sources from different authors,
organizations, and perspectives on the domain
o
Are the differences between the proposed components substantive
(that you can explain using the metadata in your harvest table) or stylistic
(based on writing or encoding style)?
o
Are the differences "real" but "unimportant"
to users or applications? (spurious precision)
Example:
Structured
Product Labeling in Pharmaceutical Industry (Thomas,
XML 2004)
· Rationale
and description of SPL
· Insights
about document architectures and standards activities Impact of SPL on the
"label life cycle"
o
Internal format, or interchange format?
o
Process conversion or content conversion?
Structured Product Labeling
· "Labels"
are complex document types to model. Why?
· Drugs have
been regulated a long time, and the standards for labels have also evolved.
How?
SPL Resources
· Executive Summary
An Introduction to Structured Product Labeling
· SPL
Schemas and Documentation
· Labeling
Regulations in 21 CFR 201
· National
Drug Code Directory
21 CFR 201
Label Content Mandated in 21 CFR
201.56
· Quality
o
Chemical composition
o
Strength and physical form
o
Rules for storage and handling
· Efficacy
o
Medical conditions for which it is indicated
o
Therapeutic conditions of use
o
Proper dosage for accepted indications
· Safety
o
Potential side effects
o
Contraindications of use
o
Rules for monitoring patients
Drug Label Distribution
Evolving Healthcare
Standards and Document Architectures
· It
has been recognized for a long time that standards in healthcare and pharma were important; first standards were for paper documents
· SPL's
development was begun by the Pharmaceutical Research and
Manufacturing Association but moved into HL7,
an ANSI-accredited standards organization focusing on clinical and
administrative information for healthcare
· The
HL7 Version 2.x series contains 100s of separate messages and is the most
common standard used for patient medical records, with legal status in the US
and several other countries
· Until
recently HL7 committees developed document type standards in a "bottom
up" way, roughly one at a time, with whatever tags a specific document
type needed
SPL in the HL7 RIM
· The
RIM uses very abstract structures to specify the information hierarchy in a
document type: structured body -> component -> section
· All
components derived from six base ones in an O-O type hierarchy, ACTS, ENTITIES,
ROLES, PARTICIPATION, ACTRELATIONSHIP, ROLELINK
· Using
the RIM establishes ontological relationships among all the elements used in
any HL7 v 3.0 document type
· This
goes way beyond even the approach in UBL and makes reuse robust for automated
processes
· But
this means that an SPL instance doesn't have the tag names
you'd expect
Drug Label Life Cycle
SPL and the Labeling Process
·
SPL could be used as an internal
format in the authoring, revision, publishing processes or simply used as the
submission format to the FDA
·
Arguments in favor of internal SPL
include...
· Arguments
in favor of interchange SPL include...
Drug Label Life Cycle - Where's
SPL?
Common Data
Model for Design Document Exchange (HICSS 2005)
NOTE: The
design documents themselves are exchanged as files; all of this work is to
harmonize the "wrapper" metadata to enable synchronized document
management, version control, etc.
·
How does document exchange in
outsourced product development differ from in-house product management?
·
What are the advantages of a common
data model compared with pairwise document transformation?
·
What are the disadvantages of a
common data model?
· The priorities for the common data model were PDM integration,
simplicity of implementation, and reuse of the model. How did this influence
the model?
The Scale of the Challenge
· The
existing document models of just 6 different firms have:
o
from 21 to 111 information components
o
250 different components overall
o
only 25 (after semantic consolidation) that are used by three or
more firms
· One company
had 127 different types of design documents – pretty clear why we can't come up
with a single model for them
The Consolidated Harvest Table
The Common Data Model - Table
Format
The Common Data Model - Table
Format