Strategies for the Development of Databases

History of Science, Medicine and Technology.
Bibliography of Primary Sources

Auxiliary databases

Introduction

    In many bibliographical database there are codes and specific information that can appear in many different entries, and that should be discussed as independent auxiliary units. In the actual implementation, they may be real independent databases linked to the main databases, or they may be tables belonging to the main databases. The actual implementation will not be discussed here. Only the main concept will be presented.

Authorities database

    Library catalogues do usually make use of authorities databases as a means to produce uniform personal names in their entries. However, the auxiliary database proposed here is a little bit more complex and useful for historians of science.

    We suggest that the database should contain the following fields:

1. Personal code
    A unique code identifying a single person. It may be a simple numeric code, or a mnemonic code. One possibility is a code using the first letter of the first name, the following consonant of the first name, the first letter of the family name, the following consonant of the family name and, whenever necessary, a number. "Roberto Martins" would have a code such as rbmr3, for instance. This code will be used in the main databases, to provide a link to the auxiliary database.

2. Name(s) of the person
    Yes, a single person can have several names. This may occur in several circumstances:
* personal name and title (for instance: William Thomson = Lord Kelvin)
* literary name, pseudonym
* nickname, short name
* personal name in different languages (for instance: Descartes and Cartesius)
* old and new form of the same name (for instance: José is the current correspondent of Joseph, in Portuguese)
* variant orthographies
* names originally written in non-European languages
    My suggestion is that all known forms of a person's name should be entered in the authorities database. Subfields should identify the different cases described above. The "standard name" used by librarians should also be added.
    In the main bibliographic databases, my suggestion is that each personal name should be entered exactly as it appears in the corresponding publication (at the title page, by default).

3. Date of birth and death
    If the day is known, it should be entered in the database. Otherwise, the years should be informed. In many cases, even the years are not known, or are doubtful, but it is possible to enter the decade or century.

4. Occupation(s) or profession(s) of the person
     This is a useful information, for historians.

5. Cities or countries associated to the person
    City and/or country and/or region where the person was born and/or died and/or produced his works. This field will use codes from the geographical database.

6. Sources of information about the person
    It is necessary to identify the work from which the biographical information was obtained, and to provide a specific reference (volume and page, or sometimes a reference item number). Each source of information will be identified by a code, and will be described by the sources of information auxiliary database. Of course, it is not necessary to produce a detailed bibliography on each person.

    Historians should be able to search this database, finding relevant authors, and then use the author entries to search the main bibliographical databases. Also, when searching for specific documents, it should be possible to use any variant of the author's name.

Geographical database

    Library databases usually ascribe codes to countries and other geographical regions. Our proposal is a little bit different from that. The aim of this auxiliary geographical database is to allow geographical searches by country, region, and by any variant of a city's name.

    We suggest that the database should contain the following fields:

1. Place code
    A code identifying the city, state, province, country, region, continent, etc.
    There are standard library codes (sometimes national ones) for many of these.

2. Level code
    A code identifying the type of geographical information (city, state, province, country, region, continent, etc.)

3. Names of the place
    A geographical place can have serveral names, especially under the following circumstances:
* places that have different names in different periods
* geographical name in different languages (for instance: Lutetia = Paris)
* old and new form of the same name
* variant orthographies
* names originally written in non-European languages
    The "standard name" used by librarians should also be added.
    In the case of the most famous cities, and for country names, it is common to have different "translations" of the city name. This auxiliary database should contain the geographical names in all languages that can be used when the main databases are searched. For instance: if the main databases can be searched in English, French, and Portuguese, there will be several translations for name of the Italian city "Firenze": Florence, Florence, Florença. A sub-field should identify the languages of the several translations, using the codes of the language database.

4. Upper level connections
    In the case of cities, the code of the states or provinces to which they belong (or the country, if there are no country divisions); in the case of states and provinces, the countries to which they belong; etc.

CAVEAT: Due to political changes, a city may belong to some country during some period, and to a different country during another. Also, countries appear and disappear in time. It would be possible to circumvent this problem by specifying the period during which a city belonged to a given country, etc. (but this would be very complicated).

Languages database

    Librarians use a standard set of codes to represent different languages. We suggest that the same set o codes should be used, but together with the identification of the language names in different languages:

1. Language code
    Standard (Library of Congress, or any other) code identifying a language.

2. Language names
    The full name of the several languages, in all languages that can be used when the main databases are searched. For instance: if the main databases can be searched in English, French, and Portuguese, there will be several translations of the language "French": French, François, Francês. A sub-field should identify the several translations.

    When searching the main databases for specific documents, it should be possible to use any variant of the language's name.

Subjects database

    Library databases usually classify books using a standard set of subject entries, sometimes with a numeric code (for instance, Dewey Decimal Classification, Universal Decimal Classification, Library of Congress subject classification). It is very useful to use any of those standard classification schemes, together with their codes, because it is easier to control the subject entries and because the standard classification subjects have already been translated into several languages. Therefore, once the code is known, it is possible to know the equivalent subject in several languages. Conversely, if someone searches the database using subjects described in any of the available languages, the subject codes will allow him/her to find the relevant entries. Besides that, it is possible to make searchs starting with very general subjects (for instance, medicine) and then using increasingly specific subjects, because of the structure of those classification schemes.

    The structure of this database is very simple:

1. Subject code
    Standard (DDC, UDC, Library of Congress, or any other) code identifying the subject.

2. Subject, in several languages
    The full subject, in all languages that can be used when the main databases are searched. A sub-field should identify the language of each translation.

Libraries and achives database

    The bibliographical databases will contain information about documents that can be found in several different libraries and archives, all over the world. Each repository should be identified by a code, in the bibliographic entries, but it is necessary to have an auxiliary database (or table) containing the full description of the repository.

    The structure of this database is also very simple:

1. Library or archive code
    We suggest a mnemonic code, built from the initial letters of the library or archive name. The British Library will be identified as BL, the Library of Congress as LC, the Bibliothèque Nationale de France by BNF, and so on. When necessary, a number can be added to the initials, to distinguish similar codes.

2. Library or archive name, in several languages
    The full name of the repository, in all languages that can be used when the main databases are searched. A sub-field should identify the language of each translation. Besides that, it is useful to add variant names, when a library or archive had different names in different times.

3. City code of the library or archive
    The geographical code (from the geographical database) identifying the place where the repository is situated.

4. Library or archive address
    The full address of the library or archive. The street addresses are usually long lived. Telephone numbers, e-mail addresses, fax numbers and other similar information usually change very fast, and it is a difficult task to keep the database up to date, if this information is included.

5. Link to the library or archive
    The locator (URL) of the Internet site of the library or archive should also be included, when known, even if it believed that it will suffer future changes.

Sources of information database

    This is an auxiliary bibliographical database, describing the secondary and tertiary printed sources used in obtaining information entered at the main databases. Suppose, for instance, that the following book is one of the sources of information used while building the periodicals database.

    In that case, for each entry including information extracted from that source, in the "source of information" field one should enter the code corresponding to Bolton's book, and a specific reference (volume, page, reference number, etc.). In this way, it will be possible to check the information contained in the database.

    The structure of this database will be similar to that of the main books and articles databases, since it should contain the bibliographical description of the source. However, it is necessary to introduce one additional field:

1. Source of information code
    We suggest a mnemonic code, built from the initial letters of the work title. For instance, Bolton's book could have the code CSTP.

    This database need not contain all fields used in the main database. A very simple bibliographic description will be enough for the sources of information.

Special problems: merging auxiliary databases

    When a set of databases is produced at a single institution, coherence and compatibility can be easily obtained. However, in an international project, several problems can arise when databases of different origins are merged. Some of them can have used different codes for the same information, or the same code with different meanings.

    In the best of the worlds, all countries used the same codes; but what can be done if they didn't?

    Let us suppose that in country XY, the code STP was used to represent Bolton's Catalogue of Scientific and Technical Periodicals. In country WZ, the code STP was used to represent Kronick's Scientific and Technical Periodicals, and the code CSTP was used to represent Bolton's Catalogue of Scientific and Technical Periodicals.

    We have the same work (Bolton's book) represented by two different codes (STP and CSTP), and the same code (STP) used to represent two different books. Of course, if the databases from countries XY and WZ are merged without any change, there will arise many problems.

    The problem can be solved if all peculiar codes of each country (or each project) are identified by an additional piece of information, corresponding to the country (or project) from which the information came. For instance: suppose that we identify the whole database coming from country XY by the code XYDB, and the database coming from country WZ by the code WZDB. Now, all codes used in XYDB should be renamed as XYDB+code, and all codes coming from WZDB should be renamed as WZDB+code. How, the code that was identical in both databases (STP) will become XYDB+STP in the first one, and WZDB+STP in the second one. Now, each entry in the main databases will point to the correct source of information, and no conflict will arise. Of course, the same source of information will have several different codes, but this will not create any conflict or misunderstanding.
 

Database structure: Periodicals
Database structure: Articles
  


RETURN TO THE MAIN DOCUMENT



Roberto de Andrade Martins
roberto.andrade.martins@gmail.com
Group of History, Theory of Science and Teaching
Document version 1, 21 April 2003