Difference between revisions of "GEPS 017: Flexible gen.lib Interface"

From Gramps
Jump to: navigation, search
m
Line 1: Line 1:
gen.lib is the Python interface for all of the objects in Gramps. Currently, it is not directly tied to any data storage mechanism, except for the implicit assumption that objects are created through an unserialize method for each object.
+
gen.lib is the Python interface for all of the objects in Gramps. Currently, it is not directly tied to any data storage mechanism, except for the implicit assumption that objects are created through an unserialize method for each object.  
  
 
This proposal explores the possibility of making the creation of objects more general, and less tied to the particular unserializing process.
 
This proposal explores the possibility of making the creation of objects more general, and less tied to the particular unserializing process.
Line 15: Line 15:
 
where data is a serialized (non-object) representation of a Person.  
 
where data is a serialized (non-object) representation of a Person.  
  
This has two issues:  
+
This has several issues:  
  
 +
# Person() is first initialized as a completely empty object
 
# it may unserialize data that isn't needed
 
# it may unserialize data that isn't needed
 
# it only allows data to be created in this particular manner
 
# it only allows data to be created in this particular manner
 +
# it can be very slow, specifically when unserializing primary objects containing with many secondary objects or reference objects
 +
# the unserialize is directly linked with the bsddb table layout. As a consequence, database layouts that are different suffer a huge penalty (not possible to do only sweeps over one table only, it is necessary to always hit multiple tables)
  
This proposal would allow for an alternative gen.lib construction.
+
This proposal would use an alternative gen.lib construction, that avoids these problems.  
  
 
= Possible Fixes =
 
= Possible Fixes =
Line 31: Line 34:
 
# Use an Engine inside each object to retrieve data when necessary
 
# Use an Engine inside each object to retrieve data when necessary
  
== #1 Replicate ==
+
== Replicate ==
  
 
Replicating gen.lib has the benefit of having zero impact on the current gen.lib. However it would require two separate code paths to maintain, and does nothing to address unnecessary unpickling in BSDDB.
 
Replicating gen.lib has the benefit of having zero impact on the current gen.lib. However it would require two separate code paths to maintain, and does nothing to address unnecessary unpickling in BSDDB.
  
== #2 Lazy Wrapper ==
+
== Lazy Wrapper ==
  
 
The lazy wrapper idea was shown to have some savings in postponing unserializing (see patch in bug report [2]). However, the requirement to wrap all data in lazy(), and the unintended side-effects were too great a cost.
 
The lazy wrapper idea was shown to have some savings in postponing unserializing (see patch in bug report [2]). However, the requirement to wrap all data in lazy(), and the unintended side-effects were too great a cost.
  
== #3 Explicit delayed unpickling ==
+
== Explicit delayed unpickling ==
  
 
Just save the data of substructure until you need to unserialize it. This is still based on pickling and is limiting in future approaches.
 
Just save the data of substructure until you need to unserialize it. This is still based on pickling and is limiting in future approaches.
  
== #4 Engine ==
+
== Engine ==
  
 
The best choice considered so far is to build an invisible engine into the gen.lib framework.
 
The best choice considered so far is to build an invisible engine into the gen.lib framework.
 +
 +
This proposal would use an alternative gen.lib construction, that avoids the problems listed in the introduction. This means it should provide:
 +
# init of objects in one single call.
 +
:* So Person() provides an empty Person object
 +
:* Person(data) initializes an object, where data is the data about Person in the db which can be interpreted by an Engine object
 +
:* When using gen.lib on a database, one must set an Engine that gen.lib should use. The engine knows how data is present in the database, and what fields in the objects correspond to this
 +
# objects only set attributes that have no processing overhead at init. Other attributes are set only when they are needed, at which time they are further unpacked or fetched from db, via the engine.
 +
# unserialize/serialize are removed as methods of an object, and are moved to the engine
  
 
= References =
 
= References =

Revision as of 12:45, 12 January 2010

gen.lib is the Python interface for all of the objects in Gramps. Currently, it is not directly tied to any data storage mechanism, except for the implicit assumption that objects are created through an unserialize method for each object.

This proposal explores the possibility of making the creation of objects more general, and less tied to the particular unserializing process.

Overview

Currently, the main database interface for getting an object looks like:

>>> db.get_person_from_handle(handle)

This uses the only existing manner of creating a person supported by gen.lib:

>>> Person().unserialize(data)

where data is a serialized (non-object) representation of a Person.

This has several issues:

  1. Person() is first initialized as a completely empty object
  2. it may unserialize data that isn't needed
  3. it only allows data to be created in this particular manner
  4. it can be very slow, specifically when unserializing primary objects containing with many secondary objects or reference objects
  5. the unserialize is directly linked with the bsddb table layout. As a consequence, database layouts that are different suffer a huge penalty (not possible to do only sweeps over one table only, it is necessary to always hit multiple tables)

This proposal would use an alternative gen.lib construction, that avoids these problems.

Possible Fixes

In the detailed mailing-list discussion [1], there were four possible solutions dicussed:

  1. If an alternative is needed, use something outside of gen.lib
  2. Using a lazy() wrapper to only evaluate what is necessary
  3. Explicit delayed unpickling
  4. Use an Engine inside each object to retrieve data when necessary

Replicate

Replicating gen.lib has the benefit of having zero impact on the current gen.lib. However it would require two separate code paths to maintain, and does nothing to address unnecessary unpickling in BSDDB.

Lazy Wrapper

The lazy wrapper idea was shown to have some savings in postponing unserializing (see patch in bug report [2]). However, the requirement to wrap all data in lazy(), and the unintended side-effects were too great a cost.

Explicit delayed unpickling

Just save the data of substructure until you need to unserialize it. This is still based on pickling and is limiting in future approaches.

Engine

The best choice considered so far is to build an invisible engine into the gen.lib framework.

This proposal would use an alternative gen.lib construction, that avoids the problems listed in the introduction. This means it should provide:

  1. init of objects in one single call.
  • So Person() provides an empty Person object
  • Person(data) initializes an object, where data is the data about Person in the db which can be interpreted by an Engine object
  • When using gen.lib on a database, one must set an Engine that gen.lib should use. The engine knows how data is present in the database, and what fields in the objects correspond to this
  1. objects only set attributes that have no processing overhead at init. Other attributes are set only when they are needed, at which time they are further unpacked or fetched from db, via the engine.
  2. unserialize/serialize are removed as methods of an object, and are moved to the engine

References

  1. - mailing list discussion
  2. - Lazy experiment (patch)