Difference between revisions of "Meaningful filenames"

From Gramps
Jump to: navigation, search
m (Naming structure)
(GRAMPS ID based: major revision)
Line 142: Line 142:
  
 
= GRAMPS ID based =
 
= GRAMPS ID based =
 +
 
This is another attempt by [[User:Duncan|Duncan Lithgow]] to find a good system. It is not finshed so feel free to add comments and correct any obvious mistakes.
 
This is another attempt by [[User:Duncan|Duncan Lithgow]] to find a good system. It is not finshed so feel free to add comments and correct any obvious mistakes.
 +
 +
Here's the records we'll use as examples. They involve Mary Agnes Williams (daughter of John Williams and Anna Matthews). She married Anders Sørensen (son of Anders Sørensen and Anna ?)  and they had a daughter Anna Sorensen, note the spelling change.
 +
 +
* Census record: mentioning her and her siblings and parents. It is from the 1810 census in the London parish of Dangerfield on Saint John Road.
 +
* Portrait: a hand drawn portrait of Mary, undated, assumed to be from before her marriage.
 +
* House picture: her parent's Saint John's Road row house in London, from some time around 1810's
 +
* Court record: Anders Sørensen was before the district court for drunk and unbecoming behaviour on January 3rd, 1820. Engelfield, London.
 +
* Marriage certificate: She married Anders Sørensen, 2nd December 1823, in London.
 +
* Wedding portrait: in the picture is Anders Sørensen's father, also called Anders Sørensen (on the back it says that Anders Sørensen's (the son) mother is called Anna).
 +
* Birth certificate: of Anna Sorensen (daughter of Mary and Anders) dated January 18th, 1824
 +
* Family tree: a hand written family tree called "The Dean family from 1735" by an Angus Dean written in 1972 which connects the families Dean and Williams.
 +
 +
== Justification ==
 +
 +
Needs expanding
 +
 +
== Aims ==
 +
 +
This system tries to meet the following aims:
 +
* simple enough to remember
 +
* just enough information, and no more
 +
* all media for one family name is under one directory (portability for travel)
 +
* all media for generating reports is under one directory (for portability)
  
 
== Record types ==
 
== Record types ==
GRAMPS ID's use the first character to denote the type of item the ID refers to. Taking the most relevant ones these can be converted to the following tags:
 
  
* '''I--''' Individual
+
The record types tell us what the record is about. GRAMPS ID's use the first character to denote the type of item the ID refers to. Sticking to something already thought and taking the most relevant ones to stored records these can be used as the following tags for record types:
* '''P--''' Place
+
 
* '''E--''' Event
+
* I-- Individual
* '''S--''' Source
+
* P-- Place
 +
* E-- Event
 +
* S-- Source
 +
 
 +
(see also source types)
 +
 
 +
'''Question'''
 +
* Records about repositories?
 +
* Correspondence with family?
 +
* What about records covering more than one type?
 +
* What will happen on old 8+3 file systems?
  
 
== Record properties ==
 
== Record properties ==
By making all properties of each record compulsory we avoid extra tags like ''GN'' for given name and so on. We can what a property is by where it is in the file name.
 
  
* '''family name''' is their surname before marriage, but including deed pool changes, MacArthur for example
+
Properties tell us just enough information to make the file name meaningful and recognisable, and split this information up so we can search for parts of it with our file manager. It's the what, where, when, why and how of what's in the record.
* '''given name''' is their official first name
+
 
* '''uid''' is a unique identity, in this example the (original) GRAMPS ID of the media file
+
By making all properties of each record compulsory we avoid extra tags like GN for given name and so on. We can see what a property is by where it is in the file name.
* '''source date''' is the date in ISO 8601 format when the information left the people or organisation responsible for it
+
 
* '''event date''' is the date in ISO 8601 format when the event occured or started. YYYY-MM-DD, ie. ''2008-12-28''
+
* family name is their surname before marriage, but including deed pool changes, MacArthur for example
* '''event type''' is a noun describing the event, chosen from a list of event types, ie: ''marriage''
+
* given name is their official first name
* '''title''' is the name of a document (book, letter, census) or object (gravestone, heirloom), ie. ''williams__arthur_headstone''
+
* uid is a unique identity, in this example the (original) GRAMPS ID of the media file
* '''source author name''' is the name of the person or organisation most responsible for the information. For people always use family name first followed by two underscores (__), ie: ''church_of_lds''
+
* source date is the date in ISO 8601 format when the information left the people or organisation responsible for it
* '''note''' is for notes. Names should always be family name first followed by a double underscore
+
* event date is the date in ISO 8601 format when the event occured or started. YYYY-MM-DD, ie. 2008-12-28
 +
* event type is a noun describing the event, chosen from a list of event types, ie: marriage
 +
* title is the name of a document (book, letter, census) or object (gravestone, heirloom), ie. williams__arthur_headstone
 +
* source author name is the name of the person or organisation most responsible for the information. For people always use family name first followed by two underscores (__), ie: church_of_lds
 +
* note is for notes. Names should always be family name first followed by a double underscore
  
 
== Naming structure ==
 
== Naming structure ==
Putting it all together
 
  
 +
Now we can outline a single schema for all record types in which the following rules apply.
 +
 +
* File names are written directly to the file name, not copied from another program.
 
* File names start with a single capital letter representing their record type.
 
* File names start with a single capital letter representing their record type.
* Record properties are separated by two dashes/ minus signs (--). Two dashes/ minus signs (--) can not be used for anything else.
+
* Record properties are separated by two dashes (--). This can not be used for anything else.
 
* Missing information is replaced by a single underscore (_).
 
* Missing information is replaced by a single underscore (_).
 
* Names in notes should always be family name first and separated by two underscores, ie: ''doe__john'' which can be represented as ''John Doe'' or ''Doe, John''.
 
* Names in notes should always be family name first and separated by two underscores, ie: ''doe__john'' which can be represented as ''John Doe'' or ''Doe, John''.
 
* Place names should start with the largest geographical region followed by a double underscore before the next geographical region, ie: ''oz__far_far_away__yellow_brick_road'' which can be represented as ''Oz, Far far away, Yellow brick road''.
 
* Place names should start with the largest geographical region followed by a double underscore before the next geographical region, ie: ''oz__far_far_away__yellow_brick_road'' which can be represented as ''Oz, Far far away, Yellow brick road''.
 +
* If the family name is unknown it must be replaced by an underscore. This will give three consecutive underscores (___), ie: ''___john''' should always be interpreted as meaning ''[no record], John''.
 +
* event types should always be drawn from a list to avoid separate words being used for the same event type. (Maybe use the event list gramps uses?)
  
'''Individual'''
+
# <record type>-- (I, P, E or S)
I--<family name>--<given name>--<note>--<uid>.ext
+
# <source type/event type>-- (needs expansion.)
* Example: ''I--williams--mary_agnes--portrait_sketch--o3472.jpg''
+
# <1st persons family name[__2nd persons family name]>-- (two names for couples or families, alphabetical)
 +
# <1st persons given name(s)[__2nd persons given name(s)]>-- (two names for couples, same order as for family names)
 +
# <country code__region__city>-- (use as many divisions as needed)
 +
# <date>-- (ISO date, YYYY-MM-DD)
 +
# <note>-- (usually not needed)
 +
# <uid> (a Unique ID, possible derived from the gramps ID)
  
'''Place'''
+
== Examples ==
<nowiki>P--<place__sub place__sub sub place>--<note>--<uid>.ext</nowiki>
 
* Example: ''P--united_kingdom__england__london--williams__mary_agnes_house--o0857.jpg''
 
  
'''Event'''
+
Using the records outlined in the beginning we would get the following file names. (Please help complete this list of examples)
E--<event ISO date>--<event type>--<1st family name>--<2nd family name>--<3rd family name>--<4th family name>--<note>--<uid>.ext
 
* Example: ''E--1923-12-02--marriage--jones--williams--_--_--william__angus_to_right_of_mary--o2846.jpg''
 
  
'''Source'''
+
* Census record: ''S--census--matthews__williams--anna__john--uk__london__dangerfield__st_johns_rd--1810-_-_--_-00874.pdf''
S--<title>--<source author name>--<source ISO date>--<note>--<uid>.ext
+
* Court record: ''S--court_record--soerensen--anders--uk__london__engelfield--1820-01-03--before_district_court--00826.pdf''
* Example: ''S--census__uk__london__engelfield--1840-03-2--jones__mary_jean_at_johns_road--o0847.pdf''
+
* Marriage certificate: ''S--marriage_certificate--jensen__williams--anders__mary_agnes--uk__london--1823-12-02--_--00864.pdf''
* Example: ''S--jones_family_from_1735--dean__angus--1972-06-21--shows_connection_to_dean_family--o5689.pdf''
+
* Portrait: ''I--portrait--williams--mary_agnes--uk__london--1823-12-03--wedding_portrait--000967.jpg''
  
 
= See also =
 
= See also =

Revision as of 21:22, 2 August 2008

After thinking about the limits to how we can structure our files and folder (see Portable_Filenames) the next step is developing a semantic controlled vocabulary.

Before launching too deep into this lets look at what we want to achieve.

  • Understandable filenames
  • Computer readable filenames
  • A system simple enough to remember

To be understandable we need to be able to use full words where appropriate.

To be computer readable we need to seperate the parts in a way which a script can easily recognise and, more importantly, in a way which would never occur in real language. So it would be no good to mark a name section with the word name if we also can use the word name somewhere in the file where it is not meant to be a marker.

To be simple enough to remember the system should not be too complicated, after all GRAMPS is meant to store the real information, this is just a supplement.

What's in a name?

It would be nice if we could have files called

Marriage of Mary Angus Jones and Matthew Williams, 2nd Dec 1923 (William Angus is to Mary's right).jpg

But this meets only one of the criteria above, that of understandable filenames. How can a computer know who got married? what their surnames are? and so on. And anyway because of the limitations of Portable_Filenames we can't have file names like that. We have to drop the reliance on capitalisation, drop the spaces, drop the comma and drop the brackets. To be computer readable we need to separate the sections with a system of markers to indicate where the surname, event name etc are.

So what sections do we want to be able to identify? Here's a basic list that should be enough for most situation, remember that GRAMPS stores the more complex information, we're just trying to give a useful structure to our files.

  • Surname
  • Firstname
  • Date
  • Event type
  • Place
  • Source
  • Note

Some more important criteria. All file names:

  • Must be unique
  • Must have all necessary information
  • Must have no more information than necessary

So if I find a file somewhere strange in my system, or if someone I sent a file to seven years ago says "that file you sent me - that's not Jean it's her daughter" I know where my archive copy of that file will be.

GEDCOM based

This is a system contributed by Duncan Lithgow.

Tags

If we base a naming system on the 3 and 4 letter Lineage-Linked GEDCOM Tag Definition used in the GEDCOM 5.5 standard we have a good long list of tags to chose from. By limiting the GEDCOM tags list we can make the following shortlist (which does not include events):

AUTH-- Author "The name of the individual who created or compiled information."
DATE-- Date
EVEN-- Event "A noteworthy happening related to an individual, a group, or an organization."
GIVN-- Given name "A given or earned name used for official identification of a person."
NAME-- Name, use only if GIVN and SURN are not known "A word or combination of words used to help identify an individual, title, or other item. More than one NAME line should be used for people who were known by multiple names."
NOTE-- Note "Additional information provided by the submitter for understanding the enclosing data."
PLAC-- Place "A jurisdictional name to identify the place or location of an event."
REFN-- Reference "A description or number used to identify an item for filing, storage, or other reference purposes."
SOUR-- Source "The initial or original material from which information was obtained."
SURN-- Surname "A family name passed on or used by members of a family."
TITL-- Title "A description of a specific writing or other work, such as the title of a book when used in a source context, or a formal designation used by an individual in connection with positions of royalty or other social status, such as Grand Duke."

Each marker ends with two hyphens (--). Two because we can't rely on the marker being recognised as capitalised, so a surname like Besour-Jean could be mistaken for beSOUR-Jean and the system thinks that SOUR- marks a source section.

Punctuation

In order for the file name to be parsed as meaningful text I think some we also would need

_ Underscore to represent a space
__ Double underscore to represent a comma followed by a space

Source events

The GEDCOM 5.5 standard defines so few events as to be useless. The GRAMPS XML schema defines no events as these can be made by the user. This all seems fair enough since events are highly culture based. The situations where I think a set of events should be defined are those which will be connected with source records. GEDCOM has a reasonable group of those but they are heavily based in western christian culture. The solution must be language and culture dependent. Here's my list:

marriage is for an actual marriage event and all the associated documentation, including possible divorce and separation documentation.
birth is for the actual birth records, also christening record
death is for death records
census is for census records
civic is for military service records, and government records of any type
health is for health records

An event image file

File name:

EVEN--marriage_SURN--jones_GIVN--mary-jean_SURN--williams_GIVN--matthew_DATE--1923-12-02_NOTE--william_angus_to_right_of_mary.jpg

This could be parsed (by GRAMPS?) as the description:

Event: Marriage
Surname: Jones
Given name: Mary-jean
Surname: Williams
Given name: Matthew
Date: 2nd Jan, 1923
Note: William angus to the right of mary

or it could make the text:

Mary-jean Jones and Matthew Williams, marriage 2nd Jan 1923. (William angus to the right of mary)

A source image file

File name:

SOUR--census_PLAC--uk__england__london_DATE--1840-03-21_SURN--jones_GIVN--mary-jean.pdf

This could be parsed (by GRAMPS?) as the description:

Source: Census
Place: Uk, england, london
Date: 21st March, 1840
Surname: Jones
Given name: Mary-jean

or it could make the text:

Census, Place: Uk, england, london, on 21st March 1840. This is a source connected to Mary-jean Jones

A source text

File name:

SOUR--publication_TITL--the_jones_family_from_1735_AUTH--mary_jean_jones_DATE--1872.pdf

This could be read as the description:

Source: Publication
Title: The Jones Family from 1735
Author: Mary Jean Jones
Date: 1872

Or it could make the text:

"The Jones Family from 1735" by Mary Jean Jones, 1872

SWOT analysis

Over at Wikipedia there is a good explanation of a SWOT analysis.

Aspect Strengths Weaknesses Opportunities Threats
File length Holds a lot of information All the information is already in the genealogy software Easily recognised. Easy to search for files with a certain Tag ?

GRAMPS ID based

This is another attempt by Duncan Lithgow to find a good system. It is not finshed so feel free to add comments and correct any obvious mistakes.

Here's the records we'll use as examples. They involve Mary Agnes Williams (daughter of John Williams and Anna Matthews). She married Anders Sørensen (son of Anders Sørensen and Anna ?) and they had a daughter Anna Sorensen, note the spelling change.

  • Census record: mentioning her and her siblings and parents. It is from the 1810 census in the London parish of Dangerfield on Saint John Road.
  • Portrait: a hand drawn portrait of Mary, undated, assumed to be from before her marriage.
  • House picture: her parent's Saint John's Road row house in London, from some time around 1810's
  • Court record: Anders Sørensen was before the district court for drunk and unbecoming behaviour on January 3rd, 1820. Engelfield, London.
  • Marriage certificate: She married Anders Sørensen, 2nd December 1823, in London.
  • Wedding portrait: in the picture is Anders Sørensen's father, also called Anders Sørensen (on the back it says that Anders Sørensen's (the son) mother is called Anna).
  • Birth certificate: of Anna Sorensen (daughter of Mary and Anders) dated January 18th, 1824
  • Family tree: a hand written family tree called "The Dean family from 1735" by an Angus Dean written in 1972 which connects the families Dean and Williams.

Justification

Needs expanding

Aims

This system tries to meet the following aims:

  • simple enough to remember
  • just enough information, and no more
  • all media for one family name is under one directory (portability for travel)
  • all media for generating reports is under one directory (for portability)

Record types

The record types tell us what the record is about. GRAMPS ID's use the first character to denote the type of item the ID refers to. Sticking to something already thought and taking the most relevant ones to stored records these can be used as the following tags for record types:

  • I-- Individual
  • P-- Place
  • E-- Event
  • S-- Source

(see also source types)

Question

  • Records about repositories?
  • Correspondence with family?
  • What about records covering more than one type?
  • What will happen on old 8+3 file systems?

Record properties

Properties tell us just enough information to make the file name meaningful and recognisable, and split this information up so we can search for parts of it with our file manager. It's the what, where, when, why and how of what's in the record.

By making all properties of each record compulsory we avoid extra tags like GN for given name and so on. We can see what a property is by where it is in the file name.

  • family name is their surname before marriage, but including deed pool changes, MacArthur for example
  • given name is their official first name
  • uid is a unique identity, in this example the (original) GRAMPS ID of the media file
  • source date is the date in ISO 8601 format when the information left the people or organisation responsible for it
  • event date is the date in ISO 8601 format when the event occured or started. YYYY-MM-DD, ie. 2008-12-28
  • event type is a noun describing the event, chosen from a list of event types, ie: marriage
  • title is the name of a document (book, letter, census) or object (gravestone, heirloom), ie. williams__arthur_headstone
  • source author name is the name of the person or organisation most responsible for the information. For people always use family name first followed by two underscores (__), ie: church_of_lds
  • note is for notes. Names should always be family name first followed by a double underscore

Naming structure

Now we can outline a single schema for all record types in which the following rules apply.

  • File names are written directly to the file name, not copied from another program.
  • File names start with a single capital letter representing their record type.
  • Record properties are separated by two dashes (--). This can not be used for anything else.
  • Missing information is replaced by a single underscore (_).
  • Names in notes should always be family name first and separated by two underscores, ie: doe__john which can be represented as John Doe or Doe, John.
  • Place names should start with the largest geographical region followed by a double underscore before the next geographical region, ie: oz__far_far_away__yellow_brick_road which can be represented as Oz, Far far away, Yellow brick road.
  • If the family name is unknown it must be replaced by an underscore. This will give three consecutive underscores (___), ie: ___john' should always be interpreted as meaning [no record], John.
  • event types should always be drawn from a list to avoid separate words being used for the same event type. (Maybe use the event list gramps uses?)
  1. <record type>-- (I, P, E or S)
  2. <source type/event type>-- (needs expansion.)
  3. <1st persons family name[__2nd persons family name]>-- (two names for couples or families, alphabetical)
  4. <1st persons given name(s)[__2nd persons given name(s)]>-- (two names for couples, same order as for family names)
  5. <country code__region__city>-- (use as many divisions as needed)
  6. <date>-- (ISO date, YYYY-MM-DD)
  7. <note>-- (usually not needed)
  8. <uid> (a Unique ID, possible derived from the gramps ID)

Examples

Using the records outlined in the beginning we would get the following file names. (Please help complete this list of examples)

  • Census record: S--census--matthews__williams--anna__john--uk__london__dangerfield__st_johns_rd--1810-_-_--_-00874.pdf
  • Court record: S--court_record--soerensen--anders--uk__london__engelfield--1820-01-03--before_district_court--00826.pdf
  • Marriage certificate: S--marriage_certificate--jensen__williams--anders__mary_agnes--uk__london--1823-12-02--_--00864.pdf
  • Portrait: I--portrait--williams--mary_agnes--uk__london--1823-12-03--wedding_portrait--000967.jpg

See also

External links