GEPS 017: Flexible gen.lib Interface

From Gramps
Revision as of 13:55, 12 January 2010 by Bmcage (talk | contribs) (Introduction)
Jump to: navigation, search

gen.lib is the Python interface for all of the objects in Gramps. Currently, it is not directly tied to any data storage mechanism, except for the implicit assumption that objects are created through an unserialize method for each object.

This proposal explores the possibility of making the creation of objects more general, and less tied to the particular unserializing process.

Overview

Currently, the main database interface for getting an object looks like:

>>> db.get_person_from_handle(handle)

This uses the only existing manner of creating a person supported by gen.lib:

>>> Person().unserialize(data)

where data is a serialized (non-object) representation of a Person.

This has several issues:

  1. Person() is first initialized as a completely empty object
  2. it may unserialize data that isn't needed
  3. it only allows data to be created in this particular manner
  4. it can be very slow, specifically when unserializing primary objects containing with many secondary objects or reference objects
  5. the unserialize is directly linked with the bsddb table layout. As a consequence, database layouts that are different suffer a huge penalty (not possible to do only sweeps over one table only, it is necessary to always hit multiple tables)

This proposal would use an alternative gen.lib construction, that avoids these problems.

As further evidence that there are problems with the current approach in Gramps, it suffices to look at src/gui/views/treemodels. Eg for eventmodel.py, we have:

def column_date(self,data):
       if data[COLUMN_DATE]:
           event = gen.lib.Event()
           event.unserialize(data)
           return DateHandler.get_date(event)
       return u

In this code, data was obtained as raw data from the database: data = db.get_raw_event_data(handle) The model needs to store the position in the data of the date storage, COLUMN_DATE. This couples the database table with the view implementation. Only when present is an event object created, so the overhead of making an event can no longer be avoided. This however is a very costly operation as now the entire event is initialized, also eg EventType(), NoteBase(), .... All this to only obtain the date contained in the event object.

Possible Fixes

Overview

In the detailed mailing-list discussion [1], there were four possible solutions dicussed:

  1. If an alternative is needed, use something outside of gen.lib
  2. Using a lazy() wrapper to only evaluate what is necessary
  3. Explicit delayed unpickling
  4. Use an Engine inside each object to retrieve data when necessary

These come down to:

  1. Replicate
    Replicating gen.lib has the benefit of having zero impact on the current gen.lib. However it would require two separate code paths to maintain, and does nothing to address unnecessary unpickling in BSDDB. It also means gramps-connect and gramps proper will have no real code to share
  2. Lazy Wrapper
    The lazy wrapper idea was shown to have some savings in postponing unserializing (see patch in bug report [2]). However, the requirement to wrap all data in lazy(), and the unintended side-effects were too great a cost.
  3. Explicit delayed unpickling
    Just save the data of substructure until you need to unserialize it. This is still based on pickling and is limiting in future approaches.
  4. Engine
    The best choice considered so far is to build an invisible engine into the gen.lib framework. This proposal would use an alternative gen.lib construction, that avoids the problems listed in the introduction. We will detail it below.

A gen.lib Engine

Introduction

This proposal would use an alternative gen.lib construction, that avoids the problems listed in the introduction.

The core concept is that when using gen.lib on a database, an Engine object must be created, which will contain the methods needed to map database data to object attributes. All objects will have access to this Engine via a factory method.

Furthermore, all compound gen.lib objects will understand the concept of delayedaccess. That is, the object is not fully initialized on init. When not yet initialized pieces are needed (like eg the medialist of a person object), the object first initializes this piece, then returns it.

It would provide:

  1. init of objects in one single call.
    • So Person() provides an empty Person object
    • Person(data) initializes an object, where data is the data about Person in the db which can be interpreted by an Engine object to set the attributes
    • Person(source=pers) remains possible, to duplicate an existing object
  2. When using gen.lib on a database, one must set an Engine that gen.lib should use. The engine knows how data is present in the database, and what fields in the objects correspond to this
  3. objects only set attributes that have no processing overhead at init. Other attributes are set only when they are needed, at which time they are further unpacked or fetched from db, via the engine.
  4. unserialize/serialize are removed as methods of an object, and are moved to the engine
  5. get and set methods are remove, and replaced by attribute access and the property method to do the delayed access as needed
  6. gen.lib will obtain two engines to start with. One for bsddb, and one for a django backend.
    1. BsddbEngine will be pure software. The engine will contain all present serialize/unserialize methods present now in the objects themselve.
    2. DjangoEngine will have a pointer to the django models. When eg a person objects needs access to it's media_list, the DelayedObj will call the DjangoEngine to obtain the media list, which will use the sql mediareference table to return the list of all MediaRef data

Suggested Implementation

No serialize/unserialize

Objects have no serialize/unserialize anymore. This is present in the engine of a database that needs it, and only there. So in practice, the bsddb engine.

Example usage code on bsddb

def get_person_from_handle(self, handle) 
   return Person(db.get_raw_person_data(handle))

The person class will call the bssdb engine from the factory to unserialize this data. Engine will be stored to avoid calling factory every time. So obj.__engine will store the engine, and obj.engine make it accessible. This is part of the DelayedAccess object API, of which all gen.lib objects will inherit. To store data:

 def commit_person(self, person, ...) 
    ....
    db_data = person.engine.person_serialize()
    ...

This works because engine is a bsddb engine, and hence the person_serialize method exists.

DelayedAccess

All gen.lib objects know the concept of delayed access, using an engine to obtain the not yet initialized data.

class DelayAccessObj(object):
   """
   An object that supports delayed access of the data. 
   gen.lib objects are large constructs. Depending on the storage backend
   one can create objects of which part of the data is not yet retrieved or
   constructed for performance reasons. 
   On access of these parts, the data must be obtained or constructed. 
   
   The DelayAccessObj provides the infrastructure to obtain this. It holds:
   1. an engine which is used to obtain the missing data.
   """
   
   def __init__(self):
       self._engine = EngineKeeper.get_instance().engine

Note that above should be done with properties, so that _engine is only obtained when requested and still None. Note also that all gen.lib obects should perhaps use __slots__ to reduce memory footprint.

When not yet initialized attributes are needed, the engine is requested for the data. For example the marker attribute of a person, which is a MarkerType() object. Eg, the code fragment

pers = db.get_person_from_handle(handle)
print pers.marker

This initializes a Person. In the new setup, Person has it simple attributes set, and the rest is handle by delayedaccess. In essense, this means that pers.private is already set True or False in the __init__ of Person, but pers.marker is a property. Simplified, we have a setup as:

 def __init__(self, data):
     DelayedAccess.__init__(self)
     (self.private, self.__marker, self.__media_list) = self._engine.unpack_person(data)

For bssdb, we will have eg: self.private = False, self.__marker = 1, self.__media_list the raw tupled mediareference data

For django, with mediaref in another table: self.private = False, self.__marker = 1, self.__media_list = ('Person', handle)

The aim should be clear, each engine unpacks the data passed in a way that allows delayed access of the attribute. The bsddb engine, uses only the typle data passed by the database table. The django engine however, sets media_list to the value needed to obtain a media_list from the media reference table.

Next, pers.marker is called:

  @property
  def marker(self):
      if not isinstance(self._marker, MarkerType):
           #delayed retrieval of marker from the engine using the key
           self._marker = self._engine.get_markertype(self._marker)
       return self._marker 
  @property
  def media_list(self):
      if not isinstance(self._media_list, list):
           #delayed retrieval of media list from the engine using the key
           self._marker = self._engine.get_medialist(self._media_list)
       return self._media_list

So, as _marker is not initialized, the engine is used to obtain the marker from the data. Same for _media_list. Note that media_list returns a list of MediaRef objects, which however will use themselves delayed access to further unpack themselves as needed, so a minimal overhead has happened.

References

  1. - mailing list discussion
  2. - Lazy experiment (patch)