GEPS 043: Improving GEDCOM support for Places

From Gramps
Revision as of 15:01, 30 September 2017 by Prculley (talk | contribs) (Proposal)
Jump to: navigation, search

Related Bugs

  • 688 (support GEDCOM 5.5EL) <== The big one
  • 10160 (export of Place with ADDR)
  • 8699 (Blank Place when only ADDR is present)
  • 8322 Some SW exports address details in ADDR field and rest in PLAC
  • 8349 Compliant about lack of import/export of hierarchies

References:

Goal: avoid data loss

GEDCOM Import

Current situation

Place Name and Title fields are always stored the same unless otherwise noted.

Places are matched by title to avoid duplicates, which works ok for places imported, less so for places already present in db.

  • Case: raw PLAC (no FORM or ADDR)
    Imported as Gramps place, type 'Unknown' no attempt to parse into city, state, etc.
  • Case: PLAC with FORM
    Imported by force fitting into Location Object, and then converting Location Object to Gramps place. The various levels are used to create multiple places with appropriate enclosed by settings. Downside, any place types not recognized by the importer are lost, and some types are poorly converted to (City <= town, Municipality etc.). Postal code (when recognized) is put into the Place code field, and doesn't participate in enclosed by level.
    Place Title is Place name and then ',' separated enclosed by place names. Place Name is top level name. Place Type is top level (first) type.
    Places are matched by title to avoid duplicates, which works ok for places imported, less so for places already present in db.
  • Case: ADDR with no PLAC
    Imported as place, name from ADDR, type 'Address', no enclosure, and ADDR stuff in Alt Location object
  • Case: ADDR then PLAC, no FORM, PLAC not previously encountered (order important)
    Imported as place, name from PLAC, type 'Address', no enclosure, and ADDR stuff in Alt Location object
  • Case: ADDR then PLAC, no FORM, ADDR/PLAC previously encountered (order important)
    Previously encountered place is reused
  • Case: ADDR then PLAC, no FORM, PLAC previously encountered (order important)
    Added to previously encountered place, and ADDR stuff in Alt Location object. Note that if more than one ADDR/PLAC combination is found, doesn’t matter if ADDR is different, we overwrite the Alt Location object, losing the earlier ADDR info.
  • Case: PLAC then ADDR, no FORM, neither previously encountered (order important)
    Two places created, one from the ADDR, type 'Detail', the second enclosing the first from the PLAC, type 'Unknown'.
  • Case: PLAC then ADDR, no FORM, PLAC previously encountered (order important). Doesn't matter if ADDR portion of PLAC/ADDR combination was previously encountered.
    A new place created from the ADDR, type 'Detail', the previously encountered enclosing place was referenced. (Probably should find and use a previously encountered place type 'detail' that matches, but this does not happen).

There are probably other cases, but as you can see we have a bit of a mess.

Issues

  • We don't recognize duplicate ADDR, type 'Detail' records.
  • We lose data if multiple records with ADDR preceding PLAC, with matching PLAC occur, only last encountered is kept.
  • We are utilizing the deprecated 'Alt Location' of our place structure.

Proposals

  1. ) Support GEDCOM EL extensions for PLAC._LOC, and a few others related to places.
  2. ) Use only one mechanism for ADDR records. Do away with Place, type 'Details' and, Location Object (place Alt Location).
    • It has been suggested that Gramps developers want to deprecate the Location (Alt Location) structure of our place object. In this case the proposal is to convert stand-alone event ADDR to places with Type 'Address'. Events that also have the PLAC tag would create a place with Type 'Address' from ADDR data and then create a second (enclosing) place from PLAC data. In either case, the ADDR data would be converted to a single line of comma separated elements (if not already available).
  3. ) Don't force fit PLAC.FORM records to Location objects, just make appropriate places with type as value from the FORM. Enclosed by hierarchy defined by GEDCOM 5.5.1 FORM. Postal codes are expected after 'country' level and don't participate in enclosed by hierarchy, but should be recognized if at another position. Have another file with {2 FORM Town, Area code, County, Region, Country, Subdivision} for example, which needs to work.
    The following Example would create a Gramps place, Named '123 Main', type 'Street' with code '77375' enclosed by
    another Gramps place, Named 'Houston', type 'City' enclosed by
    another Gramps place, Named 'Harris Co', type 'County' etc.
    Example:
        1 RESI
        2 PLAC 123 Main, Houston, Harris Co., Texas, USA, 77375
        3 FORM Street, City, County, State, Country, Postal Code

Issues

  • Location xref (_LOC) structure allows multiple place types, with a date for each. Gramps allows only one, with no date.
  • Location xref (_LOC.TYPE) record for place type is not well defined by the Gedcom L group. In some cases they use an integer followed by text type, in other case only the text. In their examples, these types are in German, leading to questions on how to handle internationalization. At the moment they will be converted to standard PlaceType values, if recognized. If not, they will be imported as PlaceType CUSTOM values (text).
  • Location xref (_LOC.TYPE) record for place type seems likely to allow an enormous list of types. Far more than the current Gramps PlaceType list. We import them as Custom Gramps types, but that has poor results for internationalization. I have contacted the Gedcom L group on this topic, but their mailing list is nearly as infrequently attended as ours is, so I don't expect any resolution soon.
  • Location xref (_LOC) structure allows multiple postal codes, with a date for each. Gramps allows only one, with no date.
  • Gramps has no direct support of ABBR, ABBR.TYPE, _FPOST, _FPOST.DATE, _POST.DATE, _GOV, _FSTAE, _FCTRY, _MAIDENHEAD, _LOC.TYPE (Hierarchical relationship), _DMGD, _DMGD.DATE, _AIDN, _AIDN.DATE, _AIDN.TYPE. These are proposed to be stored in a Note.
  • Gramps does not support Event references from Place objects. These are proposed to be stored as xrefs in a StyledText Note.
  • The potential exists for place names defined in ADDR, PLAC, and _LOC records to be different. If this occurs, then multiple Gramps alt place names will be set for the place. The primary name will be set from the _LOC record.
  • The potential exists for certain items to be defined both in the PLAC and _LOC records (where the GEDCOM L specification says they should be in the _LOC record). If this occurs, both the PLAC.FORM and the _LOC versions of hierarchy will be imported into the Gramps place.

Details

In embedded PLAC records the following is in the GEDCOM L standard. This proposal suggests that items labeled 'new' be imported by Gramps.

n PLAC <PLACE_NAME> {1:1}
+1 FORM <PLACE_HIERARCHY> {0:1}
+1 FONE <PLACE_PHONETIC_VARIATION> {0:M}    New (1,2) Note
+2 TYPE <PHONETIC_TYPE> {1:1}               New (1,2) Note
+1 ROMN <PLACE_ROMANIZED_VARIATION> {0:M}   New (1,2) Note
+2 TYPE <ROMANIZED_TYPE> {1:1}              New (1,2) Note
+1 MAP {0:1}
+2 LATI <PLACE_LATITUDE> {1:1}
+2 LONG <PLACE_LONGITUDE> {1:1}
+1 <<NOTE_STRUCTURE>> {0:M}
+1 _LOC @<XREF:_LOC>@                       New cross ref to _LOC
+1 _GOV <GOV_IDENTIFIER> {0:1}              Stored in Gramps ID for the place
+1 _POST <POSTAL_CODE> {0:M}                New (2)Place Code field
+2 DATE <DATE_VALUE> {0:1}                  New (1,2) Note
+1 _FPOST <FOKO_POSTCODE> {0:M}             New (1,2) Note
+1 _MAIDENHEAD <MAIDENHEAD_LOCATOR> {0:1}   New (1,2) Note
+1 _FSTAE <FOKO_TERRITORY_IDENTIFIER> {0:1} New (1,2) Note
+1 _FCTRY <FOKO_STATE_IDENTIFIER> {0:1}     New (1,2) Note

1) Since Gramps Places don't have Attributes, I propose that these items be placed in a single Gramps note (one per place (or place reference in the Event)). The note will have the various GEDCOM fields and the contents included.

2) If the _LOC is present, these are also expected in the _LOC structure. Between both structures and in some cases multiple copies, a duplicate avoidance strategy is used. If the sub-record is the same as one already present in the note, it is dropped. If different, it will be added.

The cross referenced _LOC record is defined below and is entirely new. While these are expected at the end of the GEDCOM file per the EL agreement, Gramps can support them at any position.

0 @<XREF:_LOC>@ _LOC
1 NAME <PLACE_NAME> {1:M}                  Place Name
2 DATE <DATE_VALUE> {0:1}                  Place PlaceName Date
2 _NAMC <PLACE_NAME_ADDITION> {0:1}        Note (1)
2 ABBR <ABBREVIATION_OF_NAME> {0:M}        Note (1)
3 TYPE <TYPE_OF_ABBREVIATION> {0:1}        Note (1)
2 LANG <LANGUAGE_ID> {0:1}                 Place PlaceName Language
2 <<SOURCE_CITATION>> {0:M}                Citation (4)
1 TYPE <TYPE_OF_LOCATION> {0:M}            Place Type (2)
2 DATE <DATE_VALUE> {0:1}                  Note (1, 3)
2 <<SOURCE_CITATION>> {0:M}                Citation (4)
1 _FPOST <FOKO_POSTCODE> {0:M}             Note (1)
2 DATE <DATE_VALUE> {0:1}}                 Note (1)
1 _POST <POSTAL_CODE> {0:M}                Place Code field(6)
2 DATE <DATE_VALUE> {0:1}                  Note (1)
2 <<SOURCE_CITATION>> {0:M}                Citation (4)
1 _GOV <GOV_IDENTIFIER> {0:1}              Stored in Gramps ID for the place
1 _FSTAE <FOKO_TERRITORY_IDENTIFIER> {0:1} Note (1)
1 _FCTRY <FOKO_STATE_IDENTIFIER> {0:1}     Note (1)
1 MAP {0:1}
2 LATI <PLACE_LATITUDE> {1:1}              Place Latitude
2 LONG <PLACE_LONGITUDE> {1:1}             Place Longitude
1 _MAIDENHEAD <MAIDENHEAD_LOCATOR> {0:1}   Note (1)
1 EVEN [<EVENT_DESCRIPTOR>|<NULL>] {0:M}   Event (5)
2 <<EVENT_DETAIL>> {0:1}                   Event (5)
1 _LOC @<XREF:_LOC>@ 0:M                   Place Reference Enclosed
2 TYPE <HIERARCHICAL_RELATIONSHIP> {1:1}   Note (1)
2 DATE <DATE_VALUE> {0:1}                  Place Reference Date
2 <<SOURCE_CITATION>> {0:M}                Citation (4)
1 _DMGD <DEMOGRAPHICAL_DATA> {0:M}         Note (1)
2 DATE <DATE_VALUE> {0:1}                  Note (1)
2 <<SOURCE_CITATION>> {0:M}                Citation (4)
2 TYPE <TYPE_OF_DEMOGRAPICAL_DATA> 1:1     Note (1)
1 _AIDN <ADMINISTRATIVE_IDENTIFIER> {0:M}  Note (1)
2 DATE <DATE_VALUE> {0:1}                  Note (1)
2 <<SOURCE_CITATION>> {0:M}                Citation (4)
2 TYPE <TYPE_OF_ADMINISTRATIVE_IDENTIFIER> {1:1}      Note (1)
1 <<MULTIMEDIA_LINK>> {0:M}
1 <<NOTE_STRUCTURE>> {0:M}
1 <<SOURCE_CITATION>> {0:M}
1 <<CHANGE_DATE>> {0:1}

1) Since Gramps Places don't have Attributes, I propose that these items be placed in a single Gramps note (one per place). The note will have the various GEDCOM fields and the contents included. When there are multiple possible similar notes (place name notes, enclosed by notes) the associated place Name or enclosed Gramps_ID (as a StyledText xref) will be included in the local heading.

2) Gramps does not support multiple place types. If more than one type is found, the TYPE and TYPE.DATE will be put in the Note.

3) Place Type Date. Gramps does not support a date on the PlaceType, it is put into the Note.

4) The citation will be directly on the Place, however, the note will have a heading and citation xref.

5) The event will be created as normal, and the note will have a StyledText xref to it. This is somewhat dangerous in that these events will appear to be 'unused objects' in a 'Check and Repair' or 'Unused Object' scan, and might be deleted. Also if filters are used, these events might not be included in desired output.

6) If more than one Postal code is encountered only the first is stored in the Gramps place code field. Additional encounters are stored in Note.

GEDCOM Export

Current situation

Addresses attached to submitter, persons, repos, etc. are exported as ADDR record.

Events are exported with BOTH PLAC and ADDR records. The ADDR record uses CITY, STAE, CTRY sub records to build a place hierarchy.

Places export with the PLAC.NAME filled in with the Gramps Title field, the contents depending on the Preferences/Places/Automatic Place Title setting.

Places with Lat/Lon correctly export this info.

Places with 'code' export this in the ADDR structure.

Places with Alt-Locations; the Alt-Locations are NOT exported. These are expected to be deprecated soon.

Issues

  • ADDR structure is not legal GEDCOM. No first line.
  • PLAC.NAME etc. should not depend on automatic title generation setting.
  • While exporting ADDR and PLAC is arguably legal, it is redundant.
  • No attempt to export place alt name, enclosure, place type, date, language, URL, citations, media information.

Proposal

  1. Support PLAC.FORM. Use the Gramps enclosure information to create full comma separated PLAC.NAME and PLAC.FORM filled out from the Gramps place type. Postal Code from place code field at end of list when present. PLAC.FORM will be included with each PLAC, NOT in header, since Gramps is not guaranteed to have consistent place types.
  2. Support PLAC._LOC and _LOC xref. This, while not standard GEDCOM is defined by GEDCOM L and is legal under standard GEDCOM. So it should be safe for other programs to import, even if they don't support it. This can encode several Gramps structures as detailed below.
  3. Export places of Type Address as an ADDR record.
  4. Place names for export no longer depend on preferences.

Details

The following would be exported at each Event. We include PLAC.FORM for compatibility, and also include PLAC._LOC for more complete export.

n PLAC <PLACE_NAME> {1:1}                   Primary place name
+1 FORM <PLACE_HIERARCHY> {0:1}             Place type
+1 FONE <PLACE_PHONETIC_VARIATION> {0:M}    Not used
+2 TYPE <PHONETIC_TYPE> {1:1}               Not used
+1 ROMN <PLACE_ROMANIZED_VARIATION> {0:M}   Not used
+2 TYPE <ROMANIZED_TYPE> {1:1}              Not used
+1 MAP {0:1}                                When LAT/LON are present
+2 LATI <PLACE_LATITUDE> {1:1}              Place Latitude
+2 LONG <PLACE_LONGITUDE> {1:1}             Place Longitude
+1 _LOC @<XREF:_LOC>@                       cross ref (Gramps place ID)
+1 <<NOTE_STRUCTURE>> {0:M}                 When notes are present

This is the new _LOC structure.

0 @<XREF:_LOC>@ _LOC                       Gramps place ID
1 NAME <PLACE_NAME> {1:M}                  Place Name (1)
2 DATE <DATE_VALUE> {0:1}                  Place PlaceName Date
2 _NAMC <PLACE_NAME_ADDITION> {0:1}        Not used
2 ABBR <ABBREVIATION_OF_NAME> {0:M}        Not used
3 TYPE <TYPE_OF_ABBREVIATION> {0:1}        Not used
2 LANG <LANGUAGE_ID> {0:1}                 Place PlaceName Language
2 <<SOURCE_CITATION>> {0:M}                Not used
1 TYPE <TYPE_OF_LOCATION> {0:M}            Place Type
2 DATE <DATE_VALUE> {0:1}                  Place PlaceName Date
2 <<SOURCE_CITATION>> {0:M}                Not used
1 _FPOST <FOKO_POSTCODE> {0:M}             Not used
2 DATE <DATE_VALUE> {0:1}}                 Not used
1 _POST <POSTAL_CODE> {0:M}                Place code
2 DATE <DATE_VALUE> {0:1}                  Not used
2 <<SOURCE_CITATION>> {0:M}                Not used
1 _GOV <GOV_IDENTIFIER> {0:1}              Not used
1 _FSTAE <FOKO_TERRITORY_IDENTIFIER> {0:1} Not used
1 _FCTRY <FOKO_STATE_IDENTIFIER> {0:1}     Not used
1 MAP {0:1}                                When LAT/LON are present
2 LATI <PLACE_LATITUDE> {1:1}              Place Latitude
2 LONG <PLACE_LONGITUDE> {1:1}             Place Longitude
1 _MAIDENHEAD <MAIDENHEAD_LOCATOR> {0:1}   Not used
1 EVEN [<EVENT_DESCRIPTOR>|<NULL>] {0:M}   Not used
2 <<EVENT_DETAIL>> {0:1}                   Not used
1 _LOC @<XREF:_LOC>@ 0:M                   Enclosed Place Reference (2)
2 TYPE <HIERARCHICAL_RELATIONSHIP> {1:1}   Not used
2 DATE <DATE_VALUE> {0:1}                  Place Reference Date
2 <<SOURCE_CITATION>> {0:M}                Not used
1 _DMGD <DEMOGRAPHICAL_DATA> {0:M}         Not used
2 DATE <DATE_VALUE> {0:1}                  Not used
2 <<SOURCE_CITATION>> {0:M}                Not used
2 TYPE <TYPE_OF_DEMOGRAPICAL_DATA> 1:1     Not used
1 _AIDN <ADMINISTRATIVE_IDENTIFIER> {0:M}  Not used
2 DATE <DATE_VALUE> {0:1}                  Not used
2 <<SOURCE_CITATION>> {0:M}                Not used
2 TYPE <TYPE_OF_ADMINISTRATIVE_IDENTIFIER> {1:1} Not used
1 <<MULTIMEDIA_LINK>> {0:M}                When Media are present
1 <<NOTE_STRUCTURE>> {0:M}                 When Notes are present
1 <<SOURCE_CITATION>> {0:M}                When Citations are present
1 <<CHANGE_DATE>> {0:1}                    Always 

1) When there are Alt names present, the _LOC.NAME and subsidiary records are repeated, one for each Alt name. The Primary name is first.

2) When 'Enclosed by' references are present, this points to the associated _LOC records.

Questions

  1. GEDCOM L recommends that when 'user defined tags' (those that are preceded by an '_') are present, that an explanation of these tags is placed in the header of the exported GEDCOM file as a Schema. I was unable to find any example GEDCOM files that included this feature for _LOC tag, although I did find some very old GEDCOM files with the _SCHEMA tag, mostly exported by FTW vers 2.
    Do we want to include it for export?
    Author recommends NOT to include.
    GEDCOM L Schema for the _LOC tag is included below.
    0 HEAD ... 1 _SCHEMA 2 PLAC 3 _LOC 4 _DEFN Location record http://wiki-en.genealogy.net/ 5 CONC GEDCOM/PLAC-Tag#Location_Records
  2. Should Gramps be exporting GEDCOM custom tags? These tags (starting with an '_' example: "_LOC") are defined by GEDCOM specifications as legal ways to extend the Gramps specification. Programs that do not recognize these tags are supposed to ignore them except for some sort of warning to the user. Gramps does this for our own import by including these in an error dialog after import, as ell as notes in the database associated with the object when possible. Gramps already exports a few of these tags for other purposes.
  3. How do we want to include this into Gramps?
    The code is developed under gramps50 branch. But for release it could easily be changed to Gramps master, for release at 5.1 timeframe, or (with more work) developed as an Addon which would replace current libgedcom.py for 5.0 branch. Or even as a 'bug' for gramps50 branch.

Comments

Round trip GEDCOM in to GEDCOM out will not be transparent for all the fields in the GEDCOM L _LOC record. Data should not be lost, but some data will get converted to Notes, which would be inconvenient for users.

See also