SE735 - Data and Document Representation & Processing

Assignment #4


Due April 25, 2013


This assignment is related to your project.

Analyzing Information Sources and Harvesting Components



Exactly which information sources you will analyze depends on your projectbut you should find at least three types of documents or information sources and analyze them in detail.

For some of your document types, there is just one instance - the particular printed form that is currently in use. So there is only one thing to analyze. But for most of your document types, you should be able to find multiple instances (e.g., job descriptions, job applicant descriptions, graduation requirements, hotel descriptions, map APIs, etc.) and analyzing more than one of them will give you a more complete and robust model. In the ideal project you would follow the "law of diminishing returns" and continue analyzing instances of each type until you are no longer identifying new candidate components, but for this assignment you should try analyze at least three instances of each document type.

For each of the three types of documents or information sources, create a separate "Harvest Table" for the content components in each instance you analyze. You can use a spreadsheet, HTML table, or any other mechanism you choose to organize the information. You should have as many columns of metadata as you need to understand the semantics of your components (use the checklists from the lectures), but at a minimum you should record:

         the component physical name (e.g., a label or field heading) if it has one

         the component logical name (which you won't know at first)

         a definition

         its datatype (if it has one)

         any other restrictions on possible values (e.g., relevant code lists)


Consolidating Components



Create a "Consolidated Table of Content Components" like those in Figure 12-13 of the Document Engineering text for each of your three (or more) document types. Revise the definitions for any of the affected components to ensure that they are effective in bounding the new equivalence classes that you're creating by merging synonyms and splitting out homonyms. For each merger or split, please write a one sentence explanation -- e.g., point out the distinction that was explicit in the harvest table that you are ignoring for synonyms or which was implicit and which you are emphasizing for homonyms. If you can, make the very coarse distinction between components that are essential or mandatory in the model and "everything else."


Standards and Code Sets



Identify and analyze any existing standards, domain-specific languages or code sets or other specifications that can suggest business rules or controls on possible values for information components. Explain what you found and your decision about whether or not to use this information in your consolidated model.


Identifying Additional Processes or Services


ACTIVITY 4: It is likely that as a result of this document analysis activity, you will have identified some processes or services that can make use of the information components in your domain in ways you'd hadn't yet considered. You should give each process or service a short name and a one-sentence description.


ACTIVITY 5: Identifying new processes usually goes hand-in-hand with revising existing ones. So you'll need to reconcile the list of processes or transactions you've developed to this point. For each of the processes that you now have left, identify an existing or potential document type that would "package" the information produced by the process. This should leave you with a coherent view of how documents and processes fit together and get you ready to assemble your documents, which you'll do in the last assignment.