Difference between revisions of "OCR"

From Gramps
Jump to: navigation, search
Line 1: Line 1:
While researching your family tree, you will find textual publications or administrative documents. You may avoid long and annoying work into GRAMPS by using optical character recognition (OCR).
+
While researching your family tree, you will find textual publications or administrative documents. You may avoid long and annoying work into GRAMPS by using optical character recognition ([http://en.wikipedia.org/wiki/Optical_character_recognition OCR]).
  
 
Here we show how you can work on your picture to make it to text !!!
 
Here we show how you can work on your picture to make it to text !!!
Line 7: Line 7:
 
==How this work ?==
 
==How this work ?==
  
* picture need to be contrasted
+
* picture need to be contrasted with a good resolution
 
* OCR programs read the picture and with forms librairies, detect the characters in order to make some correspond the form to the awaited character
 
* OCR programs read the picture and with forms librairies, detect the characters in order to make some correspond the form to the awaited character
 
* dictionnaries will be used for minimized errors. They make comparison between existing words and your result.
 
* dictionnaries will be used for minimized errors. They make comparison between existing words and your result.
Line 15: Line 15:
  
 
There is not a lot of OCR open sources programs.
 
There is not a lot of OCR open sources programs.
Intelligent Word Recognition (IWR), Intelligent Character Recognition (ICR) for written certificates are hight level. They are used on financial, historical sectors. Some programs may be used as third party of GRAMPS.
+
Intelligent Word Recognition (IWR), [http://en.wikipedia.org/wiki/Intelligent_Character_Recognition Intelligent Character Recognition (ICR)] for written certificates are hight level. They are used on financial, historical sectors. Some programs may be used as third party of GRAMPS. ''-h'' on command line for output options.
 
* [http://code.google.com/p/tesseract-ocr/ Tesseract] may be a good solution for english reader but it currently only recognizes US-ASCII characters ...
 
* [http://code.google.com/p/tesseract-ocr/ Tesseract] may be a good solution for english reader but it currently only recognizes US-ASCII characters ...
 
* [http://www.geocities.com/claraocr/ claraocr] seems to be able to learn  but I do not find any documentation. Also, need to use pgm or pbm file format.
 
* [http://www.geocities.com/claraocr/ claraocr] seems to be able to learn  but I do not find any documentation. Also, need to use pgm or pbm file format.

Revision as of 09:51, 5 April 2007

While researching your family tree, you will find textual publications or administrative documents. You may avoid long and annoying work into GRAMPS by using optical character recognition (OCR).

Here we show how you can work on your picture to make it to text !!!

How this work ?

  • picture need to be contrasted with a good resolution
  • OCR programs read the picture and with forms librairies, detect the characters in order to make some correspond the form to the awaited character
  • dictionnaries will be used for minimized errors. They make comparison between existing words and your result.
  • Some programs allow bold, italic or custom fonts size.

Using into GRAMPS

There is not a lot of OCR open sources programs. Intelligent Word Recognition (IWR), Intelligent Character Recognition (ICR) for written certificates are hight level. They are used on financial, historical sectors. Some programs may be used as third party of GRAMPS. -h on command line for output options.

  • Tesseract may be a good solution for english reader but it currently only recognizes US-ASCII characters ...
  • claraocr seems to be able to learn but I do not find any documentation. Also, need to use pgm or pbm file format.
  • GOCR/JOCR is using by xsane and kooka, may generate a custom database characters with:

mkdir db

gocr -p db -m 130 -m 256 certificate.png

This will ask you for each new letters and will generate a new index (db.list) + portable-bitmap (pbm) for your letters. Each key entry on db.list is one of this .pbm related to your custom value (a, b, c ...)

Not very successfull on written text.

  • With Ocrad, you need to use pgm file format

Also, Conjecture is an OCR third party tool who incorporate both open sources programs code bases.