Like most things, the birth and development of the project were the result of many influences and many accidents. Part of it stemmed from the historical background of the 1960's. It was obvious to anyone who was engaged in social history during that decade that the records for Britain were outstanding. In terms of their duration, variety and accuracy, England has possibly the best local historical records in the world, stretching back to the fourteenth century. The growth of County Record Offices from the 1950's, the reorganization of the Public Record Office, and the consequent depositing of many local records made these documents accessible in a new way. Thus the very old tradition of local history in Britain, dating back into the seventeenth century at least, could be enriched.
Alan Macfarlane had spent many months researching witchcraft in the Essex Record Office in the early 1960's for a doctoral thesis, so he was aware of the richness and diversity of the documents. Social anthropologists were showing that intensive studies of small communities could show a 'world in a grain of sand' and tell us something much more general about how a society worked and changed. At the same time historians of population, in France and then in Cambridge, were starting to show how valuable it was to link records together, particularly baptisms, marriages and burials - and he thought that this might be extended to other types of record.
While Alan Macfarlane was working in the Essex Record Office he came across a typescript of the diary of Ralph Josselin, Vicar of Earls Colne from 1640 to 1683. The many references to his family and wider kin intrigued him, and later, when he was studying anthropology, he wrote a book on his family life. As a result of this he was asked to edit a full edition of his diary for the British Academy. The many references to fellow villagers made him wonder how much he could find out about them from other records. So he decided to try, in collaboration with Sarah Harrison, to gather together everything about the village of Earls Colne that might refer to Josselin and his contemporaries. They were amazed by just how much there was and decided to use this village as an experiment to see whether it was possible to reconstruct an historical community.
This was not a simple task. Many of the documents were in Latin, some of them, for example the manor court records, were hundreds of feet long. The archives were scattered in record offices across the country and in private hands. It was clearly going to be a large task to bring together all the surviving records of a parish for a period of hundreds of years. They did not realize, in 1973, when this project formally began, that it would take twenty-seven years.
It has always been a team effort. Early on we were encouraged to use computers but this was long before the desk-top and Windows revolution. We had to type data onto paper tape, which was then fed into the Cambridge main frame computer. Editing for some years was a line at a time. We could not see a full screen of text to edit until we got an early freestanding computer in the late 1970's. But we were fortunate to be working at a time and place where, alongside the rapid developments in archives and social and demographic history, there was a computer revolution occurring. Another strand of the project thus needs to be outlined to understand the situation as the project emerged in 1973.
The project continued in a new context a number of earlier projects supported by King's College Research Centre at Cambridge. In the early 1970s Nick Jardine and Robin Sibson ran a project on Numerical Taxonomy with Keith van Rijsbergen as the Research Assistant. Later Keith Rijsbergen registered as a research student and completed a Ph.D. on Information Retrieval (IR) under the supervision of Ken Moody at King's College. This then linked into the later project.
Alan Macfarlane had worked on the historical records of various English parishes but had also done anthropological fieldwork in a village in the Annapurna Mountains in Nepal among the Gurungs. An application for a four-year grant was made to the Social Science Research Council (S.S.R.C.) to last from September 1972 to September 1976. The aim of the project was to 'further integrate the historical and sociological study of pre-industrial societies by undertaking a total reconstruction of three communities.' These were the Nepalese village, three contiguous villages of Boreham, Little Baddow and Hatfield Peverel in Essex, and Kirkby Lonsdale in Westmorland. The work was to complement that of the Cambridge Group for the History of Population and Social Structure, which in the persons of Peter Laslett, Tony Wrigley and Roger Schofield, was developing new ways of linking together records through the method of 'family reconstitution' and the analysis of listings of inhabitants. The topics to be studied included kinship, marriage, sex, mortality patterns and domestic economics in a local setting. One full-time person (Sarah Harrison) and one half-time (Iris Macfarlane) were to be employed for four years.
During the first year of the project, 1972-3 we gathered a good deal of Kirkby Lonsdale material and finished editing the diary of Ralph Josselin. We abandoned the study of the other three Essex villages whose records were not wholly satisfactory and decided to concentrated on Earls Colne which not only had Josselin's diary but also an excellent sixteenth century map and very good manorial records. During this period the aim was to create a manual index by person, place and subject, and then consider how to approach the computerization of the data.
The computerization was a very difficult task and with the help and encouragement of Dr. Ken Moody of King's College, who acted throughout the project as our formal adviser, we began to consider how this might be accomplished. King's College Research Centre agreed to fund a computer Analyst/Programmer for one year. Charles Jardine was appointed to start in November 1973.
By the second year (1973-4) the immense size of the material became apparent. We decided to concentrate all our efforts on Earls Colne. Originally we had intended to limit ourselves to the period 1500-1750, but since there were excellent manorial records from 1375 onwards it became clear that we would need to start earlier. We also needed to look at some of the records of neighbouring communities because of the high mobility we found in a single parish. We concluded that 'data collection is a far more laborious task than anticipated'. The manual indexing of the material continued and by the end of the year an early version of the Earls Colne name index had been created.
The major theoretical advances in the year were in the development of methods of storing and analysing historical documents with a computer. Originally we had merely seen the computer as a statistical tool; we would feed in linked abstracts about individuals and then run statistical programs on them. In his first year, Charles Jardine, with support from Ken Moody, transformed these aims into something much more ambitious. He began to think in terms of automatic record linkage, of inputting the full text of the documents. In order to do this in 1974, many things which we now take for granted had to be invented. We started with a very ancient paper-tape device called a flexowriter. Charles Jardine had to design and implement his own system of indexed sequential access to large files on disc. He had begun to design and implement a pre-processing program which would accept 'almost uncoded historical records'. Thus the important co-operation between historians and computer experts h ad begun.
In the third year (1974-5) we continued to fill in the gaps in the Earls Colne records and by the end of the year the bulk of the records had been collected. The amount of the material, we reported, 'continues to amaze us'. We made extensive use of photographic and tape-recording methods in collecting material. During this year the manual name index more than doubled in size. It was divided into surnames, though individuals were not then linked. The place index was extended, subdivided into 'from' and 'to' indexes, and indexes to rentals. Certain subject indexes had been shortened and we had begun to map the fields.
The major theoretical advance was the growing realization that it is absolutely essential not to alter the original records. Sarah Harrison had spent much time typing documents such as manor court rolls in a restructured form, altering the order of the information so that it was easier to read. Yet when we came to try to put the logical structure of the documents into the computer, we realized that this was often impossible from the form we had devised. Ambiguities which were absent in the original document began to creep in. It became apparent that most of the original records were written in a highly precise, unambiguous way and their terminology and syntax could not be improved. We therefore reverted to typing as close a copy of the original as possible. This meant a considerable amount of re-working of the data, but nevertheless by the end of the year, Sarah Harrison had typed about three-quarters of the Earls Colne manorial material for the period 1550-1750.
There were continued advances in the application of computers. King's College had extended funding for a second year and Charles Jardine had continued to work with us. During this year it began to appear essential to explore the possibility of typing the whole of the documents, however long and complex they were, in a machine-readable form. One would hope to create within the machine an exact, if artificial, representation of the semantic structure of all the links and attributes implicitly in the historical records. The attempt was based on the belief that there really is a translatable logical structure in historical records, which can be precisely and unambiguously defined. Charles Jardine decided to follow the more difficult course of looking at the more complex classes of social and economic records and to work out a way of putting in the tangled legal processes and descriptions of property into the computer. A number of months, for example, were spent in trying to specify the semantic structure of a conditional surrender to several heirs. A formal language and syntax into which the historical documents needed to be converted before being put into the machine was worked out. The system had to be literally invented from scratch.
In the final year of this first project (1975-6) we continued to collect material for Earls Colne and to index it by hand. The greatest amount of effort went into collaborating with Charles Jardine on the computing side, funded by the Research centre at King's. By the end of 1975 a working input system had been devised, but it suffered from two disadvantages, one practical and one theoretical. The practical one was that since the documents had to be re-ordered while being typed into the machine, only someone who fully understood them could prepare them for the computer. The second was that while the data model dealt with land and people very well, it left court cases, particularly the rich material in ecclesiastical and leet courts, practically untouched. To deal with both these problems an attempt was made both to improve the model and also to divide the typing in of the documents into two stages. The first consisted of an exact, verbatim, transcription, with only spelling (ex cept names, or odd words) modernised, but no re-ordering of the text. Someone who did not understand the documents could, in theory, do this. The meaning is conveyed by special syntactic marks which are inserted as a second stage by an expert. The information put in this form would, it was now hoped, be held in a database, and an enquiry language for searching it would be developed.
At the end of the four-year SSRC project (September 1976), we concluded that the major practical problem we had encountered was the sheer size and amount of data. This was combined with higher standards of indexing and the shift from abstracting only a part of a document to doing the whole document. These changes meant that, whereas we had originally estimated that a single parish would take about five person years to analyse from start to finish, we now estimated that 'twenty man-years are required to undertake a total reconstruction' of a parish of a thousand or more persons. It was clear that this was work for a team. So we applied for a continuation of the project, with funding from the S.S.R.C., later to become the E.S.R.C. (Economic and Social Research Council).
The aims of the continued project were put in the abstract of research as follows: 'to exploit local historical records in order to use them for the study of landholding and inheritance, kinship, marriage, fertility, mortality, geographical mobility, sexual behaviour, crime and social control... The method will consist of entering the material from the records into a computer, altering the original text only by adding punctuation to indicate the logical structure. Programs will be written to perform record linkage and to store the material as a structured database.' The staff consisted initially of Charles Jardine and Sarah Harrison, both full-time, the applicants being Alan Macfarlane and Charles Jardine. In the event, the SSRC agreed to fund the research for three years in the first instance.
In the first year of this project (1976-7), computing again formed the bulk of our work. We had developed an acceptable input format; now there was the task of converting our huge mass of Earls Colne documents into machine-readable text. This was to be done by using a nested bracketing structure devised by Charles Jardine by which the text could be broken into meaningful parts. For instance, persons, land, and specific actions such as surrendering land in the manorial court could be clearly marked. We were fortunate to obtain permission (through the help of Dr. Moody and others) of the Computing Service PDP11/45 and Vector General graphics computer, with light pen and keyboard, which made editing in the brackets much quicker. A special purpose editor was written by Charles Jardine. We also obtained the use of a visual display terminal with local editing facilities. Even with these more sophisticated tools, it was clear that putting in and bracketing the data was a huge task. </ P>
It was clear that the typing and editing of the data was a lengthy business. We were fortunate to find someone who had finished a degree in anthropology and was interested in learning about historical documents. We decided to employ Jessica Styles (later King). With her help, by the end of September 1977 we had typed three and a half million characters of data into the computer and had edited part of this.
We now had a satisfactory input system, but as data poured into the computer a new problem loomed ahead; how to hold it so that it could be searched rapidly. It was a lucky coincidence that it was at this very time that one of the most exciting developments in computing was occurring, namely research into database systems. When we wrote the 1976 application we did not have access to a suitable database system and thought we should either have to write our own or wait for one to come on the market. In fact, nothing suitable appeared until after this phase of the project was over. To write one ourselves would have been difficult, for Charles Jardine was already fully occupied on data input and problems of record linkage. We were therefore extremely fortunate that through the advice of Dr Ken Moody the project was joined by Tim King who started a Ph.D. on database systems in October 1976. He was funded by a CASE (Collaborative Award in Science and Engineering) studentship which ga ve him special access and collaboration with IBM. As a result of an agreement between Cambridge and IBM, we were able to make use of an experimental database system which it had taken many computer-programming man-years for IBM to develop at IBM UKSC at Peterlee. This system, PRTV, was run on the IBM 370 at Cambridge.
Tim King then built on this earlier research and started to write a database system in BCPL, which used a command language similar to PRTV. Jardine and King spent a week in Peterlee and began to collaborate closely in the work. The parsing and input programmes began to be fitted with a potential database system. Thus by the middle of 1977 the project had reached its maximum size.
The collection and transcription of data for Earls Colne was continuing alongside the input of the data into the machine. The two major sets of records which had not been transcribed by the start of this phase of the work were the manorial records for the period from 1375-1550 and many of the longer documents in the central courts, deposited in the Public Record Office. Neither of these types of record were easy to use. The former because of the abbreviated Latin in which they were written, often in faded script, the latter because court cases were difficult to find and extremely lengthy when found. We were able to solve the former problem through the help of Cherry Bryant. She magnified the medieval account and court rolls by using a slide projector at home and tape-recorded a translation. The contents of the tape were then typed into the computer by the team at Cambridge. The accurate and steady flow of this material was extremely valuable. Over the next three years we would also come to grips with some of the voluminous records in the Public Record Office, and particularly some very long and important cases in the courts of Star Chamber, Requests and Chancery. This was only possible by tape-recording them and then typing the material directly into the computer, since this was some years before the invention of portable computers.
The project was now a large and complex one, with six people associated with it full or part time. It was allocated increased space in the Department of Social Anthropology when the basement became available in March 1978. In this Department it always received the strongest backing from the Head of Department, Professor Jack Goody, who was himself making innovations in the use of computers in anthropology. The Computer Service loaned us a private disc pack for our data, since it was too large (about 40 megabytes !) to fit on the Cambridge mainframe public space. The S.S.R.C. recognized this expansion and its importance by appointing a steering committee for the duration of the project. Two or three times a year over the next five years it would visit Cambridge for the day and hear reports on our work and advise us on how to proceed. We found the necessity to justify what we were doing to a group of experts appointed by both the history and computing committees of the S.S.R.C. most valuable. Various members of the S.S.R.C. secretariat attended meetings and gave us advice, including David Allen, Chris Caswell and Michael Wood. David Allen throughout this and earlier phases showed an especial interest in the project and was most helpful in suggesting ways to go. Of the members of the Committee, the most active, naturally enough, were the three Chairmen, who put a great deal of effort into guiding us: these were Roderick Floud, Tony Coxon and Michael Drake. Others who played an active part included Roger Schofield and Joan Thirsk.
In the second year of this phase (1977-8) we set ourselves the ambitious target of typing and parsing approximately one hundred years of Earls Colne data. This was achieved. In this year we realized the importance of making a distinction between surnames and forenames and program was written to distinguish them. We were beginning to interrogate and check the data. We continued collecting and transcribing central and early records. Several students, including Mary Bouquet and Rab Houston used the system on similar data.
During this year we loaded about 250,000 words of text interspersed with 67,000 bracket pairs, about 8% of our total records, into two databases. It was loaded into the IBM UKSC Peterlee Relational Test Vehicle (PRTV) and a system (CRTV) written by Tim King. The result of the tests showed that CRTV was more reliable than PRTV and was running nearly twice as fast in Cambridge. IBM withdrew PRTV, but generously encouraged us to continue work based on it. Charles Jardine used CRTV for some preliminary work on automatic record linkage and produced some abstracts from records which, incidentally, were also extremely useful for our hand indexes.
By the mid summer (1978) Tim King had started to develop a more sophisticated database system, called CODD (COroutine Driven Database), which made use of coroutine processes to provide both greater efficiency and more flexibility of use. Charles Jardine collaborated with Tim King on this. The Dictionary system used for storing the text of the input was re-written and a spelling check program was written. A five-fold increase in efficiency was achieved. King and Jardine also started to develop a query language. This language, Cambridge Historical Information Programming System (CHIPS), was, as far as we know, the only procedural one available for querying a relational database system. The existence of a procedure (or subroutine) definition mechanism was crucial for our purposes, enabling us to define procedures tailor-made for our particular application of our database.
By the end of the third year (Aug. 1979) when Charles Jardine finally left the project, we had reached the following stage. We had developed the input format and parsing programs. We had developed a prototype database system. We had begun to work on a query language for interrogating the material. We had collected the material for Kirkby Lonsdale 1500-1750, Earls Colne for 1375- 1750 and for the Nepalese village. We had put about 100 years of the Earls Colne data into the computer. Alan Macfarlane and Tim King applied for a two-year extension so that we could complete the task. We hoped to input all the Earls Colne materials back to 1375. We would refine the database system and the query language and we would begin to undertake substantive analysis.
The team worked on for another two years. These continued to be productive and busy years with an enormous amount of work being put in by the full-time employees in the team. By the end of the project, we had typed in all the materials for Earls Colne, 1375-1750, roughly 3,200,000 words of text, into the computer. All had been bracketed, cleaned up and checked. The database had been improved and the query language was operational. Further analysis of the problems of record linkage had been undertaken.
The project was now at the stage where it was beginning to be possible to really use the carefully collected material. It was at this point that the research environment and particularly the funding of the S.S.R.C changed dramatically. When the project was set up and when we applied for the last two year extension we envisaged that the study we were making would be the first of many. We believed we would be pioneering methods which could be used by other similar groups. This now appeared unlikely. It became obvious that this was an unique study. We also believed during the middle of the project that the S.S.R.C. itself would be able to provide some permanent support for the suite of programs we were writing and the set of data which we were accumulating. The S.S.R.C., aware of this problem, commissioned a report and we held various meetings and made a detailed application for an interim solution by asking for a five-year computer post. Yet nothing emerged from all this in the w orsening financial situation. Once it became apparent that future maintenance and use of our system would largely depend on the private initiative of members of the team and could not be guaranteed, we were forced to devote more attention to provide data and indexes in a non-computerized form. This would make it possible for as many historians and others as possible to use the material. Hence the microfiche publication of the data, with certain indexes, which published a year later by Charles Chadwyck Healey Ltd.
Finally, in order to tidy up loose ends, we requested and obtained a small grant for one year from April 1982. This was to enable us to use the query system and to write a manual describing how it worked. This was done. It was also to enable us to see, now that all the data was in machine readable form, whether computer record linkage was possible. It finally emerged that it was not feasible, but we were fortunate that Dr. Tim King, then teaching at Bath, and Sarah Harrison, were prepared to give a considerable amount of time to complete record linkage. This was done by using the computer to provide the material upon which the human takes the decision to link or not to link.
Thus we finally achieved our aim. We had devised a way of getting historical records into a computer without pre-structuring them, to hold them in a database, to link together references to people and thus provide a linked database, and to query this data. We hoped that, in years to come, the cost of hardware would drop so much that it would be possible to revive all the data and programs and run them on a small computer. Without support, the system soon became unusable in computerized form and was only available as microfiche. Despite the difficulty of using it in this form, a number of students found it such a rich resource that they based various pieces of work on this (these and other works emerging from the project are listed at the end of this account.)
Alan Macfarlane and Sarah Harrison then applied the methods and a probabilistic retrieval system, working with Dr Porter, to a number of other sets of historical and anthropological materials in the 1980's. This included a project at King's Research Centre on the archives of the Portuguese Inquisition, and a project funded by the Nuffield Foundation and Leverhulme Trust to construct the first multi-media videodisc in the University (on the Nagas of Assam), which was linked to a book and Museum exhibition.
Meanwhile the unique set of historical records on Earls Colne awaited new developments in computing which would make our 1970's vision of what should be done a possibility. These developments occurred in the early 1990's as the power of micro-computers increased dramatically and, in particular, as the implications of the Internet became apparent. In 1994 we had considered the possibility of publishing the records on a CDROM, but when Dr Moody suggested that another of his Ph.D. students, Tim Mills, might help develop a Web Site for the Earls Colne data, it became immediately obvious that this was the way to proceed.
Since late 1995, with funding from the Renaissance Trust, Sarah Harrison worked with Tim Mills to link, order and adapt the Earls Colne data for a web site. This is not a trivial task for various reasons. Data preparation had previously ended in 1981 when computing facilities were relatively crude. With the possibility of modern computing and in particular using the data on line to cross-check, a number of errors, ambiguities and other remediable features were discovered. Furthermore, there are now new things we can do with maps and photographs. A few new sources have been discovered by other researchers, and the partially processed materials for the period 1750-1854, which were excluded from the previous microfiche, but were used to confirm linkages in the earlier data, were added. We also added the full modern English text of Josselin's Diary.
We tried to make a polished GUI (Graphics User Interface) and continued updating of this interface and of the system as a whole as the computing environment changed. This was expensive and needed a well-resourced professional operation to support it. We worked out a collaborative project with a new, middle-sized, computer company called Persimmon whose head-quarters were in North Carolina, but who set up a branch in Cambridge in the summer of 1996, directed by Dr.Mike Challis. Persimmon housed and supported the Web Site physically, though access to it was routed through the University. This arrangement worked for a while but terminated in 1998 when the Persimmon research operation in Cambridge ceased. This experience made us realize that we would finally have to house the site within the University. But there was still considerable developments needed before that, which called particularly on the constructive and continuing involvement of Tim Mills, who, on gaining his Ph.D. had joined the Olivetti and Oracle Research Lab (ORL) (which was later taken over by AT&T) to become the AT&T Laboratories Cambridge. The Earls Colne database software was ported to the Windows operating system on which it ran until the closure of the AT&T Cambridge Laboratory in 2002.
With the advent of XML, the data set was converted into this format, and XSLT scripts were written to transform the data into the current web site. This work was aided by CARET (Educational Technology Services). At this point, the representation of document identifiers was simplified from the form 0xxx.yyyyy to xxxyyyyy.
There is further background material on Alan Macfarlane's web site under 'Projects - Earls Colne'.