Overrated your job title

From implicit to explicit structures: A historical address book in TEI #dhmasterclass

Address books can be a real treasure for a variety of reasons. For example, data collectors use addresses to generate an indicator of creditworthiness using geoscoring. For this reason, data protectionists go to the barricades when it comes to passing on addresses or advocate greater transparency of the purposes of use. But not only data dealers see addresses as a resource, because historical address books are also of great interest to historical researchers as a mirror of socio-historical conditions and developments.

The entry on Heinrich Heine in the historical address book - and in TEI-XML

As part of the #dhmasterclass, we (among many other things) dealt with the “Address Book of Germans in Paris for the Year 1854”. Address books like this one contain biographical snapshots from the point of view of data processing. In a one-line short story they bring together names, places of residence and sometimes jobs. In this strongly compressed form, they unfold as biographies of possibilities: Names sometimes give indications of social status (aristocratic predicates), marital connections (double names) and cultural roots (language). Street names or districts may indicate wealth or poverty. Job titles unfold genre images in the reader's inner eye and make him think about the everyday activities and living conditions of police officers, box makers, harmonica manufacturers or bacon dealers.

In the address book there are also personalities such as Heinrich Heine (see illustration), Adolphe Sax, the inventor of the saxophone (who was Belgian), and the composer Julie Massenet, who worked as a piano teacher (her family came partly from Alsace, and you youngest brother was the opera composer Jules Massenet).

These address books implicitly contain entire networks of professional groups and neighborhoods. People could have known each other, especially if they lived in the same house or if they worked in the same industry within a district. For logistical reasons alone, it is obvious that businesses in one line of business accumulate, such as drapers and dyers. Sophie Schneider visualized the content of the address book on a map with the help of georeferencing and was able to show which industries were widespread in which districts.

The address book is currently in a table. After the #dhmasterclass, I asked myself whether a table was the appropriate form of data representation in terms of content (spoiler: yes and no). Tables are practical and feared, underestimated and overrated at the same time. Why? A table can initially be very helpful as a basic structure: One unit per line, i.e. an entry in the address book that can be defined in the layout. Then several properties per line. Clearly in the address book: name, address, occupation. The properties are uniform in every line. The researcher's first approach to the data consists in this strictly schematic segmentation of units of meaning. There is a high probability that sooner or later you will come across an entry where the table principle reaches its limits. The classification of the names and the differentiation of the name parts becomes borderline, for example, in the following cases:

IDFirst nameSurnamename suffix
1BathingCountess of
3F.Boringerand Comp.
4Joseph and JuliusKochlinGebr., From Mühlhausen

In line (1) it is unclear whether “von” should be part of the surname and “Countess” should not be classified as a title. In line (2) you should ask whether “Fr [äu] l [ein].” Is not more of a status information than an addition to a name (even if it was of course used that way). Line (3) seems to be less about people and more about companies, and line (4) contains two names (possibly also a business name) with an indication of origin. This hair-splitting is relevant if one wants to conduct differentiated research with the data: It would be much easier to calculate the total proportions of aristocratic persons or companies if one could define one's own category “title of nobility” or “company name”, or the properties depending on the case Could differentiate entry type (person: first name, surname, title, additional information / company: name, legal form, etc.). If you stick to the table model, this would soon degenerate into an unmanageable number of columns. Other special features (not explained here using the example) are entries that contain two addresses, e.g. B. Place of residence and business included. In some cases, job titles contain a description of patents. The handling of normalizations, for example the abbreviations "Miss" and "Miss" to "Miss" by the processor, also appears difficult. Additions in square brackets contaminate the machine-readable text, as these are not created on the coding level; on the other hand, the documentation of normalizations is scientifically necessary.

It made sense to convert the table to TEI. Information that deviates from the strict tabular scheme can be differentiated and meaningfully described with TEI, and abbreviations can be documented in original and normal form. The return to a tabular presentation remains easy to accomplish. A "TEIsated" entry could look something like this:

<person xml:id=„row_1718“>
<forename>Lina</forename> <surname>Weigel</surname> <addName>Frl.</addName>
<occupation ana=„#trade“>Modehändlerin</occupation>
<street corresp=„#rue-des-petits-champs“>
<name><supplied>rue</supplied> Neuve-des-Petits-Champs</name>
<district corresp=„#arrondissement_2“>2</district>

A company entry could look like this:

<org xml:id=„row_1930“>
<orgName><surname>Köchlin</surname> <addName>Gebrüder</addName></orgName>
<origPlace>aus <placeName>Mühlhausen</placeName></origPlace>
<affiliation ana=„#trade“>Gedruckte Stoffe</affiliation>
<street corresp=„#rue-du-sentier“>
<name><supplied>rue du</supplied> Sentier</name>
<district corresp=„#arrondissement_2“>3</district>

The user can read the XML elements like annotations of the text and at the same time use them as a technical structure. The hierarchical structures are very much adapted to their respective content and therefore very different. During processing, flexible XPath information must then be used, for example, to be able to extract address information regardless of the data type (for example).

The data quality can be increased even further, for example, by storing standard data on personal names or standard designations for occupational information. As a little experiment, I used a (now somewhat older) web service that makes suggestions for normative data based on basic information (e.g. name, time, place of activity). Unfortunately, this was unsuccessful for the majority of the data, as most people have never been recorded in an authority system due to their low level of awareness (and in many cases it remains questionable that this will ever happen). On the other hand, the system generated interesting suggestions from aristocrats and well-known artists and politicians, i.e. public figures.

Incidentally, making a TEI document out of a table is more difficult than expected. With a "character (sic) separated list" there are quick problems if you have used commas or quotation marks within individual cells, because escape characters or entities do not always work as you want them to. Conversion routines also like to skip empty table rows for reasons of memory space and express this, for example, as. If you ignore this, the renaming of the columns will go wrong at first.

All in all, dealing with this address book has already turned out to be a complex task that presents numerous challenges in terms of content, structure and technology. The work on this is still ongoing. And this is just one of many historical address books which, with their concentrated amounts of information in TEI-coded form, represent an immensely valuable source for digital historical research, and which have further synergies in association or in combination with other resources (foreigner lists, court minutes, etc. .) are likely to unfold.