SE735 - Data and Document Representation & Processing |
Assignment #4 |
Due
April 25, 2013
This assignment is related to your project.
ACTIVITY 1:
Exactly which information
sources you will analyze depends on your projectbut
you should find at least three types of documents or information sources and
analyze them in detail.
For some of your document
types, there is just one instance - the particular printed form that is
currently in use. So there is only one thing to analyze. But for most of your
document types, you should be able to find multiple instances (e.g., job
descriptions, job applicant descriptions, graduation requirements, hotel
descriptions, map APIs, etc.) and analyzing more than one of them will give you
a more complete and robust model. In the ideal project you would follow the
"law of diminishing returns" and continue analyzing instances of each
type until you are no longer identifying new candidate components, but for this
assignment you should try analyze at least three instances of each document
type.
For each of the three types
of documents or information sources, create a separate "Harvest
Table" for the content components in each instance you analyze. You can
use a spreadsheet, HTML table, or any other mechanism you choose to organize
the information. You should have as many columns of metadata as you need to
understand the semantics of your components (use the checklists from the
lectures), but at a minimum you should record:
·
the component
physical name (e.g., a label or field heading) if it has one
·
the component logical
name (which you won't know at first)
·
a definition
·
its datatype (if it has one)
·
any other
restrictions on possible values (e.g., relevant code lists)
ACTIVITY 2 :
Create a "Consolidated
Table of Content Components" like those in Figure 12-13 of the Document
Engineering text for each of your three (or more) document types. Revise the
definitions for any of the affected components to ensure that they are
effective in bounding the new equivalence classes that you're creating by
merging synonyms and splitting out homonyms. For each merger or split, please
write a one sentence explanation -- e.g., point out the distinction that was
explicit in the harvest table that you are ignoring for synonyms or which was
implicit and which you are emphasizing for homonyms. If you can, make the very
coarse distinction between components that are essential or mandatory in the
model and "everything else."
ACTIVITY 3:
Identify and analyze any
existing standards, domain-specific languages or code sets or other
specifications that can suggest business rules or controls on possible values
for information components. Explain what you found and your decision about
whether or not to use this information in your consolidated model.
ACTIVITY 4: It is likely
that as a result of this document analysis activity, you will have identified
some processes or services that can make use of the information components in
your domain in ways you'd hadn't yet considered. You should give each process
or service a short name and a one-sentence description.
ACTIVITY 5: Identifying new
processes usually goes hand-in-hand with revising existing ones. So you'll need
to reconcile the list of processes or transactions you've developed to this
point. For each of the processes that you now have left, identify an existing
or potential document type that would "package" the information
produced by the process. This should leave you with a coherent view of how
documents and processes fit together and get you ready to assemble your
documents, which you'll do in the last assignment.