SE735 - Data and Document Representation & Processing

Lecture 1 - Convergence : Data, Documents, Delivery

Brief History of Documents


1.    Clay tablets (Mesopotamia - 3rd millennium BC): Babylonia: Early Accounting Practice

Administrative tablet with cylinder seal impression of a male figure, hunting dogs, and boars, 3100–2900 b.c.;

Jemdet Nasr period (Uruk III script), Mesopotamia, H. 2 in. (5.3 cm)

Metropolitan Museum of Art, Purchase, Raymond and Beverly Sackler Gift, 1988 (1988.433.1)

This tablet most likely documents grain distributed by a large temple, although the absence of verbs in early texts makes them difficult to interpret with certainty.


Drawing of a tablet from the Uruk III period (ca. 3300-3000 BC) containing an accounting of deliveries of barley and malt from two individuals for the production of beer.

·        The bottom row bears the name of the official in charge.

·        The tablet is read from right-to-left and top-down.

·        Each row corresponds to an individual, with the first two columns containing entries for malt, followed by a column for barley.

·        Subtotals are given in the third column (barley groats (top) and malt (bottom)).

·        The left-most box displays the grand total.

·        No formal language was used to express the relationship between the signs and symbols in the tablet.

2.     Sumerian Accounting Practice

A record of sources of revenue and monthly disbursements to forty-six temple personnel by its bursar Ḫunabi for the year 1295 BCE

Table attributes:

·        Column headings and row titles.

·        Column headings at the top of the table specify month names.

·        Names and professions are shown in the right-hand column (e.g. seeress, weaver, overseer, temple servant).

·        Eighteen of the individuals listed receive no payment for all or half the year. (Notice the blank “smooth” cells along rows.)

·        These individuals are classified as either dead or fugitive.

·        Grid locations within the table contain numerical information that are part of calculations, flowing first down a column, and then across a row.

·        Subtotals for each individual are given every six months, culminating with a yearly total adjacent to row labels.

·        The table is annotated with explanatory interpolations under columns containing totals, and a summary column at the table's end.

3.     Egypt: Papyrus

·        In Ancient Egypt, papyrus was used for writing - first evidence is from the account books of King Neferirkare Kakai of the Fifth Dynasty (about 2400 BC)

·        Papyrus books were in the form of a scroll of several sheets pasted together, for a total length of up to 10 meters or even more.

·        The spread of books, and attention to their cataloging and conservation, as well as literary criticism developed during the Hellenistic period with the creation of large libraries in response to the desire for knowledge exemplified by Aristotle.

·        e.g.

o   The Library of Alexandria, a library created by Ptolemy Soter and set up by Demetrius Phalereus contained 500,900 volumes (in the Museion section) and 40,000 at the Serapis temple (Serapeion).

o   All books in the luggage of visitors to Egypt were inspected, and could be held for copying.

·        Papyrus supported book production in Rome in the 1st century BC with Latin literature that had been influenced by the Greek.

4.     The Codex

·        Forerunner of the contemporary book.

·        Invented by the Romans (1C. AD)

·        Originally constructed by binding together waxed wooden writing tablets, and eventually papyrus, and the parchment sheets (animal membrane).

·        The codex is more practical than a scroll, given that it allows random information access, as opposed to a scroll's sequential access; and unlike the scroll, both sides of a sheet may be used for writing.


5.     Medieval Manuscripts

Carolingian Renaissance (Emperor Charlemagne (c. 742 – 814))

·        As the lands under his dominion continued to grow during the eighth century, there were insufficient literate individuals to help administer the expanding state.

·        The decline of the Roman Empire had engendered a regionalization of Latin dialects, the future modern romance languages, which seriously impeded communication across Europe.

·        During the latter quarter of the eighth century Charlemagne executed a program of reforms that would transform the state and become known as the Carolingian Renaissance.

·        A major part of his program was to attract many of the leading scholars of his day to his court.

·        With the aid of one of these scholars, the English monk Alcuin of York (c. 735 – 804), who arrived at his court in 782, a program of cultural revitalization and educational transformation was undertaken to restore old schools and found new ones throughout his empire under the guidance of a monastery, cathedral, or noble court.

·        A standard curriculum was developed that established the trivium (grammar, logic, and rhetoric) and quadrivium (arithmetic, geometry, music, and astronomy) as the basis for education, and writing of textbooks was undertaken.

·        A standardized version of Latin was also developed that became the common language of scholarship and supported pan-European administration of the empire. Writing was standardized too.

·        The Carolingian minuscule was introduced to increase the uniformity, clarity, and legibility of handwriting.

o   It was used between 800 and 1200 to write codices, pagan and Christian manuscripts, and educational texts.

·        Rise of scriptoria for writing and copying manuscripts.


6.     China: The Technological Roots of Printing

·         Paper - rag paper and silk paper by 105AD

·         Paper's migration to Europe - passed onto through Middle East by 9th century AD and to Europe via Crusades by 13th Century.

·         Large ideographic alphabet

·         Invented movable type


7.     Gutenberg and the Historical Moment in Western Europe


8.     Print and Modern Thought

o    Less expensive and more portable books lent themselves to solitary and silent reading

9.     Advances in Print Technology

·         Teletypewriter (1913) - could be attached directly to a Linotype machines to control composition by means of a perforated tape

o    Tape was punched on a separate keyboard unit

·         Tape-reader translated the punched code into electrical signals that could be sent by wire to tape-punching units in many cities simultaneously

·         Xerography (1938) – uses photoconductivity

·         Computer-based printer technology


Desktop Publishing (DTP)

Definition: Preparation of typeset or near typeset documents on desktop computers (personal computers). All text composition, page makeup, manipulation of digitized graphics and integration of text and graphics are performed on desktop computers.

Three activities of DTP

  1. Pure text preparation
  2. Creation and manipulation of graphic images, where text plays only a minor role

3.      Complex page makeup, in which text and graphic elements are united in a harmonious way within the confines of a single page


Components of a DTP system

Key stages in the process of DTP

  1. Need for publication: Conduct appropriate analysis to determine need for publication
  2. Purpose and audience: Consider the audience, content, style, language, purpose.
  3. Create text: Word processed, scanned or directly typed into program. Proof read text to ensure content is OK.
  4. Create graphics: Graphics created with appropriate software,scanner, tablet or digitiser.
  5. Design format: Determine grid, columns, headers and footers, page numbers, text style, design final layout.
  6. Load files and lay out publication: Text and graphics are combined, formatted, scaled and positioned.
  7. Print: Choice of a suitable high resolution printer, i.e. laser printer or imagesetter


1979 - Alto - Xerox PARC


1981 - Model 8010 (Star) - Xerox PARC



1983 - Canon develops the 'engine' used in low cost laser printers

1983 - Lisa - Apple




1984- Hewlett-Packard produces the HP LaserJet

1984 - Macintosh - Apple

·         Mac had excellent graphics & 300dpi laser printer

1984 - Adobe introduce PostScript page description language (PDL)


1985 - Aldus develops PageMaker for Mac


1985 - Adobe builds PostScript hardware/software interface to Apple LaserWriter (cost $5000)


1986 - Microsoft release Windows 1.1


Mark-up Languages

Definition: A notation for identifying the components of a document to enable each component to be appropriately formatted, displayed, or used.

1967 - William Tunnicliffe paper-  titled The Separation of Information Content of Documents from their Format – separates content from formatting

1969 - Charles Goldfarb - GenCode project at IBM expanded this work to develop the Generalized Markup Language (GML) – by

1980, 90% of IBM documents formatted in GML

1973 – Joe Osanna - Unix operating System (PDP-11)

·         nroff produced text output suitable for terminals and line printers

·         troff generated a graphical output for a Wang typesetter

·         1979 – troff modified to work with different output devices.

1977 - Donald Knuth – TeX – begun in 1977, evolved through early ‘80s - detailed layout of text and font descriptions to typeset mathematical books in professional quality.

1980 – Brian Reid – Scribe : a document specification language and its compiler

·         Prepare a manuscript file using a text editor.

·         Process this manuscript file through Scribe to generate a document file, which is then printed on some convenient printing

·         Scribe controls the words, lines, pages, spacing, headings, footings, footnotes, numbering, tables of contents, indexes and more.

·         Scribe has a database of document format definitions which tell it the rules for formatting a document in a particular style.

·         Under normal circumstances, writers need not concern themselves with the details of formatting, because Scribe does it for them.

·         The manuscript document an author creates has markup statements throughout.

o    Describe the various components of the document to the Scribe processor.

o    The descriptive markup the author places in the document is interpreted and formatted by the Scribe document processor.

1986 - Standard Generalized Markup Language (SGML) extended GML and was accepted as an ISO standard

·         1st working document by Charles Goldfarb in 1980

·         Influenced by Scribe

·         focused on the structural aspects of a document and left the visual presentation of that structure to the interpreter

·         Specifies a syntax for including the markup in documents and a "metalanguage" for separately describing what the markup meant.

·         Allowed authors to create and use any markup they wished, selecting tags that made the most sense to them

·         Issues:

o    Generally found to be cumbersome, a side effect of attempting to do too much and be too flexible

o    Unknown to the masses

o    Too few tools to create files

o    Tools are expensive

o    Companion norms for style or hypertext are not ready

o    Not well supported by the major editors of the software market

1991 - Tim Berners-Lee and Robert Caillau - HyperText Markup Language (HTML) - some SGML syntax, without the meta-language

·         HTML consists of a set of "known" tags that handle common formatting tasks

·         Originally created to markup simple scientific papers and therefore need to be expanded in order to offer the rich content the web has today

·         As a result additions often follow no logical design, although recent efforts have attempted to address this.

·         Advantages

o    Simple to learn and to use

·         Easy to create from scratch or by converting legacy text files

·         Easy to parse

·         Disadvantages

o    Syntaxless

o    Much more a presentation language than a structural language

o    Too limited

1998 – XML – extended Markup Language

·         XML is a strict subset of SGML

·         Like SGML, XML is a grammar ( or a metalanguage ) and NOT a language

·         XML extends SGML features

·         Out of date SGML features are eliminated

·         Well-formed document syntax

 Other Languages

Adobe PostScript


Adobe PDF

· Optimized PostScript

· PDF document attributes:

·         external links

·         article threads

·         security features

·         device independent colour

·         notes

Shared Documents: Groupware and Computer-Supported Cooperative Work (CSCW)

·         Computer-assisted coordinated activity carried out by groups of collaborating individuals

·         e.g.

·         communication

·         problem solving

·         co-authoring a document

Computer Supported Cooperative Work (1984) coined by Gireif and Cashman

o    Today serves as a forum - collaborative/cooperative a metaphor, it could support competition.

A Paradigm Shift for Computing

Transformation from human-machine to human-human interaction

Results from several convergent phenomena:

 Widespread groupware:

CSCW Taxonomy

·         Today systems are moving toward anytime/anyplace


One Meeting Site

Multiple Meeting Sites

Synchronous Communications

Face-to-Face Interactions

  • Public Computer Displays
  • Electronic Meeting Rooms
  • Group Decision Support Systems

Remote Interactions

  • Shared View Desktop conference Systems
  • Desktop Conferencing with Collaborative Editors
  • Video Conferencing
  • Media Spaces

Asynchronous Communications

Ongoing Tasks

  • Team Rooms
  • Group Displays
  • Project Management

Communicationand Coordination

  • Vanilla email
  • Async conferencing bulletin boards
  • Stuctured messaging systems
  • Workflow management
  • Version Control
  • Meeting Schedulers
  • Cooperative hypertext, organizational memory


Asynchronous Groupware

·         Supports communication and problem solving among groups of individuals who contribute at different times


·         Email and Computer Conferencing


Structured Messages, Agents and Workflow

·         Workflow - focus on messages that define processes - sets of rules which create conversations

Cooperative Hypertext and Organizational Memory

        1. Retention of knowledge
        2. Support for global collaboration and global discussion
        3. Enhanced communication


Synchronous Groupware



o    Two approaches to developing groupware:

1.      Collaboration transparency - single user software made available to group

2.      Collaboration aware - rewritten software for group use

o    #1 is simplest approach - some software run on multiple workstations under control of screen sharing software