The Catasto Datamining Project tests the utility of datamining algorithms for analyzing historical datasets. Historical datasets differ from other datasets commonly used in datamining applications in the following ways:
These challenges raise an important question about the intersection of historical data and datamining: Can datamining techniques be successfully applied to historical data?
Datamining techniques can be successfully applied to historical data.
Datamining techniques can be successfully applied to the Catasto, an early fifteenth-century tax document from Florence.
Success for datamining techniques in this knowledge domain are defined as follows:
In 1427, the Priors of the Florentine Republic used a new tax survey
to assess the wealth of Florence's inhabitants. The survey is of great historical
interest because it records economic and demographic data pertaining to
the city of Florence itself as well as Florentine domains like Verona.
Online sources include:
The Online Catasto provides a brief but useful introduction to the history of the Catasto, an interface for querying the Catasto data, the code book for data files, and important notes concerning the state of the data files.
Catasto Study: Census and Property Survey for Florentine DomainsThe Catasto Study Website is the online data archive that distributes the data used for this study. The site includes data files, documentation files, an online codebook, references, and a description of the Catasto study.
The Catasto dataset was chosen as the focus of this study for the following reasons:
The data cleansing process consists of three stages:
The Catasto Study code book provides the information necessary to
comprehend and parse the data files. The data files consist of two
different types of records: economic and demographic. The image
below is the first complete record in the dataset.
The first line represents the economic record. Every entry in the
datafile has 1 economic record (80 characters in length). The economic
record consists of the following fields:
Field Name | Number of Bytes | Description | Example |
---|---|---|---|
Series Number | 2 | The Series Number describes the survey in Space and Time | Series Number 1 records data for the City of Florence in 1427. |
Household Identification (Sequence Number) | 4 | The unique identifier of the record. | In the example above, the sequence number is 1. |
Location | 5 | The specific geographic location of the assessment. | In the example above, the location number is 00011 which represents Gonfalone di Scala in the Quartiere di S. Spirito in Florence. |
Name of head of Family | 10 | The first name of the head of the family. | In the example above, the first name is ANTONIO. |
Name of the Father or head of the Family | 10 | The first name of the Father or the head of the family -- patronymic. For widows, the name of the deceased Husband is generally given. | In the example above, the patronymic is LUIGI. |
Family name | 10 | The name of the family. | In the example above, the family name is CANIGIANI. |
Source | 3 | The number of the volume in the archival series containing the original declaration. | In the example above, the source number is 64. |
Page | 3 | The number of the folio where the declaration begins. | In the example above, the page number is 1. |
Type of Household | 1 | Comment on the type of household - the type of declaration. | In the example above, there is no data for this field. |
Type of Dwelling | 1 | Comment on the type of dwelling: home owner, renter, lives in home rent free, i.e. peasants or servants. | In the example above, there is no data for this field. |
Ownership of Animals | 1 | Comment on whether the family owns or rents animals. | In the example above, there is no data for this field. |
Emigration-Immigration | 1 | Comment on the origin of the head of the household. | In the example above, the field value is 2 which denotes that the head of the household is living and residing in a locality of the district of Florence, but elsewhere than where he is obliged to the tax. |
Trade Comment | 1 | Comment on the trade practiced such as employee, widow or daughter of a tradesman, and occupation no longer exercised. | In the example above, there is no data for this field. |
Trade or Occupation | 2 | The field encodes the occupation code for the taxpayer. | In the example above, the field value is 23 denoting that the taxpayer is a Money Changer. |
Value of Private Investments | 5 | The field sums the value of mobile property, business credits, cash measured in Florins (rounded to the nearest Florin). | In the example above, the field value is 300 Florins. |
Value of Public Investments | 5 | The field sums the value of investments in the public debt measured in Florins (rounded to the nearest Florin). | In the example above, the field value is 6 Florins. |
Total Value of all assets | 6 | The field sums the total value of investments measured in Florins (rounded to the nearest Florin). This value includes private and public investment and the value of real property with the house deducted. | In the example above, the field value is 355 Florins. |
Deductions | 5 | The total value of deductions (debts and charges) measured in Florins (rounded to the nearest Florin). | In the example above, the field value is 2000 Florins. |
Tax | 5 | Tax (the total value of the taxable fortune less the deductions) measured in Florins (rounded to the nearest Florin). | In the example above, the field value is 0 Florins. |
The demographic records will be addressed in version 1 of this project.
PERL programs are used to parse the raw data into a comma delimited file necessary to build a file in the *.arff format.
Sample PERL programThe PERL program produces a comma delimited file that forms the data section of the *.arff file.
Sample output from PERL programThe final ARFF Format combines the instance data from the comma delimited file with the appropriate header information.
Sample *ARFF file
Please note that the arff file has a greatly reduced number of fields
from the raw data. This was a requirement for producing Apriori output.
Apriori output only accepts nominal data.
In future versions, the arff format will contain the entire dataset
and filters will be used to make adjustments necessary for specific
algorithms.
In future versions, several datamining algorithms will be applied to the data. In this version, only the apriori method is examined.
The results demonstrate a connection between home ownership, animal ownership, and migration patterns. These relationships, however, are not conclusive. Significant tuning and data cleansing are still required to obtain meaningful output that can be used to verify the specific thesis.
The following improvements are planned for the next iteration of the project:
This website obviously constitutes a large part of the presentation of the project. The section will be expanded as needed.
Online Catasto of 1427. Version 1.3. Edited by David Herlihy, Christiane Klapisch-Zuber, R. Burr Litchfield and Anthony Molho. [Machine readable data file based on D. Herlihy and C. Klapisch-Zuber, Census and Property Survey of Florentine Domains in the Province of Tuscany, 1427-1480.] Florentine Renaissance Resources/STG: Brown University, Providence, R.I., 2002.
Herlihy, David and Christiane Klapisch-Zuber. Census and property survey of Florentine domains and the city of Verona in the fifteenth century ltaly [machine- readable data file]. Cambridge, Mass.: David Herlihy, Harvard University, Department of History and Paris, France: Christiane Klapisch-Zuber, Ecole Pratique des Hautes Etudes [producers], 1977. Madison, Wis.: University of Wisconsin, Data and Program Library Service [distributor], 1988 and 1996.;
J. Paul Bischoff and Robert Darcy, "Reformatting the Florentine Catasto for use by Standard Statistical Analysis Programs," Computers and Medieval Data Processing, XI, (October, 1981):5-6.
Les Toscans et leurs familles: Un étude du catasto Florentin de 1427. (Paris, Presses de la Fondation Nationale des Sciences Politiques, 1978
Herlihy, David and Christiane Klapisch-Zuber, Tuscans and Their Families: A Study of the Florentine Catasto of 1427, New Haven: Yale University Press, 1985.