The Catasto Datamining Project: Version 0

Part 1: The Objective

Introduction

The Catasto Datamining Project tests the utility of datamining algorithms for analyzing historical datasets. Historical datasets differ from other datasets commonly used in datamining applications in the following ways:

These challenges raise an important question about the intersection of historical data and datamining: Can datamining techniques be successfully applied to historical data?

General Thesis

Datamining techniques can be successfully applied to historical data.

Specific Thesis

Datamining techniques can be successfully applied to the Catasto, an early fifteenth-century tax document from Florence.

Evaluation of the Specific Thesis

Success for datamining techniques in this knowledge domain are defined as follows:

Verification
Datamining techniques are successful to the extent that they verify, support, or extend evidence for well-known facts about the Catasto dataset as represented in the historical literature.
Extension
Datamining techniques are successful to the extent that they generate facts about the Catasto dataset that Historians can verify in the historical literature.
Direction
Datamining techniques are successful to the extent that they suggest useful lines of inquiry for historical research related to the Catasto.

Background: What is the Catasto?

In 1427, the Priors of the Florentine Republic used a new tax survey to assess the wealth of Florence's inhabitants. The survey is of great historical interest because it records economic and demographic data pertaining to the city of Florence itself as well as Florentine domains like Verona.
Online sources include:

Online Catasto of 1427

The Online Catasto provides a brief but useful introduction to the history of the Catasto, an interface for querying the Catasto data, the code book for data files, and important notes concerning the state of the data files.

Catasto Study: Census and Property Survey for Florentine Domains

The Catasto Study Website is the online data archive that distributes the data used for this study. The site includes data files, documentation files, an online codebook, references, and a description of the Catasto study.

Justification: Why the Catasto?

The Catasto dataset was chosen as the focus of this study for the following reasons:

Part 2: The Input

Introduction

The data cleansing process consists of three stages:

  1. Examine the Raw Format
  2. Parse the Raw Format
  3. Confirm the consistency of the ARFF Format

The Raw Format

The Catasto Study code book provides the information necessary to comprehend and parse the data files. The data files consist of two different types of records: economic and demographic. The image below is the first complete record in the dataset.

 Catasto Raw Data

The first line represents the economic record. Every entry in the datafile has 1 economic record (80 characters in length). The economic record consists of the following fields:

Field Name Number of Bytes Description Example
Series Number 2 The Series Number describes the survey in Space and Time Series Number 1 records data for the City of Florence in 1427.
Household Identification (Sequence Number) 4 The unique identifier of the record. In the example above, the sequence number is 1.
Location 5 The specific geographic location of the assessment. In the example above, the location number is 00011 which represents Gonfalone di Scala in the Quartiere di S. Spirito in Florence.
Name of head of Family 10 The first name of the head of the family. In the example above, the first name is ANTONIO.
Name of the Father or head of the Family 10 The first name of the Father or the head of the family -- patronymic. For widows, the name of the deceased Husband is generally given. In the example above, the patronymic is LUIGI.
Family name 10 The name of the family. In the example above, the family name is CANIGIANI.
Source 3 The number of the volume in the archival series containing the original declaration. In the example above, the source number is 64.
Page 3 The number of the folio where the declaration begins. In the example above, the page number is 1.
Type of Household 1 Comment on the type of household - the type of declaration. In the example above, there is no data for this field.
Type of Dwelling 1 Comment on the type of dwelling: home owner, renter, lives in home rent free, i.e. peasants or servants. In the example above, there is no data for this field.
Ownership of Animals 1 Comment on whether the family owns or rents animals. In the example above, there is no data for this field.
Emigration-Immigration 1 Comment on the origin of the head of the household. In the example above, the field value is 2 which denotes that the head of the household is living and residing in a locality of the district of Florence, but elsewhere than where he is obliged to the tax.
Trade Comment 1 Comment on the trade practiced such as employee, widow or daughter of a tradesman, and occupation no longer exercised. In the example above, there is no data for this field.
Trade or Occupation 2 The field encodes the occupation code for the taxpayer. In the example above, the field value is 23 denoting that the taxpayer is a Money Changer.
Value of Private Investments 5 The field sums the value of mobile property, business credits, cash measured in Florins (rounded to the nearest Florin). In the example above, the field value is 300 Florins.
Value of Public Investments 5 The field sums the value of investments in the public debt measured in Florins (rounded to the nearest Florin). In the example above, the field value is 6 Florins.
Total Value of all assets 6 The field sums the total value of investments measured in Florins (rounded to the nearest Florin). This value includes private and public investment and the value of real property with the house deducted. In the example above, the field value is 355 Florins.
Deductions 5 The total value of deductions (debts and charges) measured in Florins (rounded to the nearest Florin). In the example above, the field value is 2000 Florins.
Tax 5 Tax (the total value of the taxable fortune less the deductions) measured in Florins (rounded to the nearest Florin). In the example above, the field value is 0 Florins.

The demographic records will be addressed in version 1 of this project.

The Parsing Programs

PERL programs are used to parse the raw data into a comma delimited file necessary to build a file in the *.arff format.

Sample PERL program

The PERL program produces a comma delimited file that forms the data section of the *.arff file.

Sample output from PERL program

The ARFF Format

The final ARFF Format combines the instance data from the comma delimited file with the appropriate header information.

Sample *ARFF file

Please note that the arff file has a greatly reduced number of fields from the raw data. This was a requirement for producing Apriori output. Apriori output only accepts nominal data.
In future versions, the arff format will contain the entire dataset and filters will be used to make adjustments necessary for specific algorithms.

Part 3: The Output

Introduction

In future versions, several datamining algorithms will be applied to the data. In this version, only the apriori method is examined.

The Apriori Output

Sample output from the Apriori datamining method

Interpreting the Results

The results demonstrate a connection between home ownership, animal ownership, and migration patterns. These relationships, however, are not conclusive. Significant tuning and data cleansing are still required to obtain meaningful output that can be used to verify the specific thesis.

The Development Plan for Version 1

The following improvements are planned for the next iteration of the project:

Part 4: The Presentation

This website obviously constitutes a large part of the presentation of the project. The section will be expanded as needed.

References

Online Catasto of 1427. Version 1.3. Edited by David Herlihy, Christiane Klapisch-Zuber, R. Burr Litchfield and Anthony Molho. [Machine readable data file based on D. Herlihy and C. Klapisch-Zuber, Census and Property Survey of Florentine Domains in the Province of Tuscany, 1427-1480.] Florentine Renaissance Resources/STG: Brown University, Providence, R.I., 2002.

Herlihy, David and Christiane Klapisch-Zuber. Census and property survey of Florentine domains and the city of Verona in the fifteenth century ltaly [machine- readable data file]. Cambridge, Mass.: David Herlihy, Harvard University, Department of History and Paris, France: Christiane Klapisch-Zuber, Ecole Pratique des Hautes Etudes [producers], 1977. Madison, Wis.: University of Wisconsin, Data and Program Library Service [distributor], 1988 and 1996.; ; (12 September 1999)

J. Paul Bischoff and Robert Darcy, "Reformatting the Florentine Catasto for use by Standard Statistical Analysis Programs," Computers and Medieval Data Processing, XI, (October, 1981):5-6.

Les Toscans et leurs familles: Un étude du catasto Florentin de 1427. (Paris, Presses de la Fondation Nationale des Sciences Politiques, 1978

Herlihy, David and Christiane Klapisch-Zuber, Tuscans and Their Families: A Study of the Florentine Catasto of 1427, New Haven: Yale University Press, 1985.