SE735 - Data and Document Representation & Processing

Lecture 11 - Assembling Document Components and Document Models



After Consolidation, Then What?


§  We have now reached the point where we have captured the business rules and content components of the domain / document inventory in which we're working


§  We have separated the Presentational, Structural and Content Components


§  We have developed a conceptual model of our consolidated and essential "atomic" content components -- semantic equivalence classes


§  We will have some sense of the distribution of the content components and can distinguish those that are "core" -- that appear in all or almost all contexts in the domain -- from those that are more context-dependent


§  Now we have to ensure that we can reuse these components when we assemble document models from them



Why Analysis Models Aren't Good Enough



§  Document artifacts differ a great deal in how they combine content, structure, and presentation components


§  Some combinations are idiosyncratic and ad hoc or represent compromises between incompatible requirements that make structures less than optimal


§  If we are completely constrained by the artifacts as they exist in our component model, we will preserve both their good and bad aspects


§  So our analysis models of components and aggregates may need to be revised to allow alternative ways of satisfying our requirements



Design and Re-design with Conceptual Models



§  The component model may present many attractive options for re-design and reuse of our content components


§  Design means changing our model, not simply improving the way we view it. This is when we actually get to apply our insights about reuse and patterns


§  During design we can devise more consistent component names, remove repeating or reoccurring content and structure, increase reuse of standard patterns or components, replace implicit components with explicit ones, and otherwise create a more abstract, concise, and context-free representation of the essential characteristics



Analogy: The Build-to-Order Computer Factory



§  Designing a factory that makes "build-to-order" computers:


§  You might start with some collection of computers and take them apart to see what pieces are needed to assemble them (ANALYSIS)


§  Because you want to be able to make these items with reasonable quality but at less cost and at greater efficiency, you redesign the computers to use standard components (DESIGN FOR REUSE)


§  You organize the components and the assembly lines to make it easy to locate components when you get an order (ORGANIZE FOR REUSE)



Analogy: Designing a Flexible Domain Model



§  Analyze a set of hand-crafted applications with printed or online data entry forms to identity pieces of information each contains (ANALYSIS)


§  You want the complete "enterprise model" for the domain to be able to represent any application or form with less cost and at greater efficiency, you redesign the pieces of information from analysis to be more standard and context-free (DESIGN FOR REUSE)


§  Organize the components to make it easy to locate the components for the specific contextualized model for an application or form (ORGANIZE FOR REUSE)



Generalizing and Specializing Components




§  "Delivery Date" and "Ship Date" suggest a "Date" component




The Contextualization Continuum



o   Your set of components has to find a balance between precision and generality (or flexibility)


o   You want a set of components that can be reused across related document types in some context (or group of related contexts)


o   Contexts fit into a continuum





Components: A Reminder



o   Components – the units of content


o   Any piece of information that has a unique label or identifier is a candidate component

o   Any piece of information that is self-contained and comprehensible on its own is a candidate component


o   A component is a logical unit, with no presentation implied; it may be organized structurally

o   These definitions are very helpful for finding (aggregate) components in some types of documents but less so in others


o   It depends on the presence of, and relationships with, the structural and presentational information




Motivating Aggregate Components



o   ATOMIC components that hold individual pieces of information


§  Especially in transactional documents,where atomic components have a natural representation as primitive data types or as datatypes that are derived from these by restriction


o   DOCUMENT components that assemble smaller components into the set of information needed to carry out a self-contained purposeful activity


§  Especially in transactional contexts, where documents have a natural correspondence to some unit of work that initiates, records, or responds to a clearly-defined event

o   It is much more difficult to define the components of "in between" sizes, especially as you move away from the transactional end of the document type spectrum


o   AGGREGATE components are composed of atomic ones and are reused in the assembly of document components


o   They are easier to identify in transactional contexts because they are often the key information that flows from one document to another


o   "Address" or "Person" are obvious examples of aggregates composed of smaller ones


o   Two key questions:

§  How do we select and group atomic components into aggregates?

§  How many aggregates should we create?




Identifying Two Kinds of Component Aggregates


o   Structural Aggregates -- sets of components defined by parent-child or containment relationships


o   One way to do this is by putting all the unique components on index cards and then sort them into clumps or clusters.

o   First sort all the components that go together because of containment or structural rules (X contains Y and Z).


o   Conceptual Aggregates -- sets of components that "go together" because of logical dependency


o   After you've identified all the structural aggregates, you can start to further cluster those intermediate clusters on the basis of dependency rules - what things "go together" logically




Identifying Aggregate Components in Non-Transactional Documents [1]


o   Aggregates are more elusive on the narrative end of the DTS because there are limits to the rigor with which components can be grouped


o   "Mixed content" models arise when there are few or weak constraints on where atomic components can appear


o   Presentation often masks the atomic components in potential aggregates


o   Structures are often based on conventions for organization and presentation than on semantic relationships


o   But there will still generally be components that "go together" to form reusable structures







Identifying Aggregate Components in Non-Transactional Documents [2]



o   Aggregates can be created in two "bottom-up" ways that focus on the atomic components:


o   The first is by rebuilding or making explicit the structures that we took apart in document analysis


o   The second is by creating structure in "blobs" of poorly structured information written in an overly narrative style


o   A more modular style for the information will increase its regularity and reusability; it will eliminate content that has little value to users and reinforce its use as "boilerplate" or via links





Home Blueprint: Analysis Model


o   The layout doesn't reinforce the idea of reusable components and aggregates...


o   Every house has an exterior view, and an interior view for each floor


o   Each of these views has a caption/annotation


o   houseplan(theme,views,specs), view, (illustration+), illustration(figure,description+)







House Plan Redesign - Logical Aggregates

















































House Plan Redesign - Revised Presentation


















































Extracting Repetitions


o   Different aggregates might have the same components


o   "Contract" and "Shipment" might both have "Start Date," "End Date" and "Duration"


o   The repeated components can be extracted and created as a reusable aggregate


o   In this example we might call the common pattern the "Period"



Reuse of Existing or External Patterns



o   Many of the patterns that you might identify as repetitions would have also been identified in a previous analysis of your domain or context


o   You should determine if their analysis yielded components for you to reuse


o   Pay particular attention to "standards" if they come from credible sources


o   But don't accept someone else's analysis and models if you don't understand them


o   And NEVER assume that a component model is appropriate solely on the basis of its name because:


o   Names aren't "self-describing"


o   Otherwise, why are we bothering to collect all the bits of metadata about each candidate component?





Identifying Aggregate Components in Non-Transactional Documents


o   We also can identify aggregates via a kind of reverse engineering of the document models required or suggested by the context


o   We call this CORE PLUS CONTEXTS -- factoring a set of related document models into the aggregates that are needed to assemble them






Identifying Aggregates in More Formal Ways



o   The heuristic and informal approach we just followed was called "document analysis" in the SGML/publishing/content management tradition


o   But these techniques are shaped by the skill and biases of the document analyst; they don't yield uniform results and don't scale well


o   If there is to be a discipline of Document Engineering, we need a more formal and deterministic modeling approach that is consistent with classical document analysis





Data Modeling Principles for Designing Content Components

o   The more transactional our design situation is the more we can (and need) to apply additional rules or requirements that apply to "data-intensive" components


o   Data Integrity– the information must be correct


o   Referential Integrity


    repeated data components (in an "information supply chain" or set of documents) must agree in values


o   Identification Integrity – the identify of data should be unambiguous


o   These three principles embody the concept of Essentiality – the model contains only the essential information with no duplication


o   Relational Theory is the basis for identifying essential components



Relational Theory



§  Mother of all data modeling approaches (Codd, 1970s; Date, 1980s)


§  Relational theory gets its name from the fact it defines relations, two dimensional matrix views of data, which we know better as tables


§  The focus of relational theory is …

§  Keys (essential data identification) - uniquely referencing structures- prevents identification integrity problems

§  Functional Dependency (essential data aggregates) - aggregating data into logical groups or sets (or relations, tables, entities, object classes)- prevents referential integrity problems

§  Normalization – formal techniques for identifying and defining functional dependencies; yields a set of progressively more rigorous "normal forms"









First Normal Form


§  The consolidated list of unique candidate components is equivalent to 1NF in relational theory


§  Make all of the components discrete – only take a single value in their set


§  Remove repeating sets of data into their own (new) set


§  Identify components that are keys


§  For example, an Order may contain components for item descriptions, prices and quantities.

·       Because there can be many repetitions of these components, we need to introduce a component like LineItem as an aggregate data set that contains them

·       Now the model for Order is that it contains one of more unique LineItems (discrete with no repetition)




Second Normal Form



§  2NF separates all non-dependent components


§  Split off into separate (new) sets any components that do not wholly depend on the entire key


§  Focus on sets of components with composite keys


§  For example, for a given LineItem, the Description and Price may be the same each time, but the Quantity depends on the particular occurrence of a LineItem.


§  2NF would separate these non-dependent components into their own set, possibly called Product or Item




Third Normal Form


§  Ensure that all non-key components are independent of one another


§  As for 2NF but for components that are not keys.


§  For example, Order may contain a CustomerName and their AccountCode

·       These non-key values may have some dependency on each other and should be separated into another set, e.g. Customer.

§  3NF also involves removing any derived or calculated components




Normalization Example: Model of Lecture Notes






Content View of a Lecture Slide










































Candidate Components from Lecture Slide















































Candidate Components as Relation

























Normalized Model


Representations of Normalized Model – Primary Key Path



Representations of Normalized Models – UML Class Diagram






Document Model Assembly






Where We Are in the Methodology



§  We have identified primitive and aggregate components


§  We've used heuristic or formal means (or both)


§  The methods we used and the results reflect the mixture of transactional and non-transactional documents in our context

·       Number of components

·       Size of components

·       Precision of rules for datatypes, associations, cardinality






Why We Need Hierarchical Models









The Component Model: A Set of Relations






































Multiple Paths













































Book / Author / Edition / Publisher


































Hierarchical Interpretation [1]







































Hierarchical Interpretation [2]
















































Hierarchical Interpretation [3]
















































Document Model Assembly










Document Model Assembly and the Document Type Spectrum






Document Model Assembly – Transactional Document Types


§  Since transactional documents and data-intensive contexts tend to have more rules, their component models are more complex and there are more alternative document assembly models


§  These alternate assembly models may differ in which information from the instance they present (they may be queries or views of the instance rather than a one-to-one rendering) and in the order or structure with which they present it.


§  If the sequence is important it should be a component in the model and assembled in our logical documents




The Rules of Assembly


The rules represented in the component model must be followed during any document model assembly:












Assembly Order and Containership






































Document Model Assembly - Non-transactional Document Types









A Common Document Assembly Pattern


































Event Calendars: Assembly














































Time-based Calendar Assembly















































Time-based Calendar Model















































Location-based Calendar Assembly















































Location-based Calendar Model















































Event-based Calendar Assembly













































Event-based Calendar Model















































How Many Document Models?




§  In many domains because of the rich network of associations you can assemble a large number of different document models from the same component model


§  Determining how many document types to assemble is another design problem in its own right


§  During a document analysis phase you will create an inventory of existing document types but this isn't necessarily the set of logical document types you'll end up with after you design.


§  There may be several types of documents that you want to treat as equivalent by assembling a single more general document model, or you could assemble several separate models





"Core + Contexts" Modeling & Assembly











The Customization / Contextualization Challenge









The Components in a Domain












































Subtractive Refinement












































Interoperability Challenges with Subtractive Refinement [1]


Interoperability Challenges with Subtractive Refinement [2]







































o  Customization by subtraction doesn’t work because the overlapping information isn’t explicitly identified.


Core plus Contextualization Approach

o  Alternative and more attractive approach to customization is to organize sets of document assembly components into a core library that can be reused by each contextualized document assembly.

o  The basic goal is to create a family of related document assembly models that share a common set of structures.


Core and Context Components



Core plus Contextualization in a Schedule of Classes


Core plus Contextualization in a Course Catalog


o The distinction between core and contextualized components is the degree of reuse.


Limitations of "Core + Contexts"

o   One approach is to create the smallest possible core components and leave room for them to be customized or contextualized by additional components

o   But the set of contexts that emerge is strongly shaped by the document types you are expecting to assemble, and models shaped by implementation considerations are less robust

o   Furthermore, the criteria or heuristics used to decide what "goes together" are informal and don't yield consistent results

o   But it isn't a question of "either/or" here between the traversal and the c+c approach. Think of them as influences or philosophies or approaches for document assembly that you need to balance.


Documenting the Model

o   The process of building document assembly models facilitated by using a formal notation to describe the resulting hierarchy of components.

o   Common notations are UML class diagrams, ELM tree diagrams, or tables.

o   If these notational forms are rich enough in their metadata, often encoding them into a language for implementation (such as XML Schema) can be formalized or even automated using an application program.

o   Document assembly models described as UML Class Diagrams commonly use aggregations and composite associations.

o   Aggregations (denoted by an open diamond on the parent end of the association) describe an assembly path that is optional.

o   Compositions (denoted by a filled diamond at the parent end of the association) are a specialized form of aggregation where the assembly path is mandatory.


Calendar by Event Assembly Model as a UML Class Diagram


Sample Event Calendar Assembly Model as an ELM Diagram


o   Most graphical notations do not describe all the metadata needed to describe a complete document assembly model.

o   Using tables or spreadsheet formats for document assembly models overcomes some of the limitations of graphical representations.

o   The flexibility of a customizable metadata allows for more complete semantic descriptions and customized assembly metamodels.

o   Tables are a more familiar syntax for a nontechnical audience than either the UML or ELM notations.