Libhtml

From Gramps
Revision as of 14:40, 3 April 2009 by Gbritton (talk | contribs) (Build up the page body by divisions)
Jump to: navigation, search

Html class

This page contains a description of and user's guide for the Html class. This class is useful for the preparation of Html pages and is found in the src/plugins/lib directory.

Standard XHTML Template

A standard XHTML page is constructed using a framework like this (taken from the XTHML standard template):

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

<head>
<title>An XHTML 1.0 Strict standard template</title>
<meta http-equiv="content-type"
content="text/html;charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
</head>

<body>

<p>… Your HTML content here …</p>

</body>
</html>


Most of the detail goes between the <body> and </body> tags of course. In the body, there may be other sections, possibly nested within each other. For example there might be paragraphs inside list elements inside tables inside divisions, etc. Because of this it is helpful to view an XHTML document as a n-way tree, with the overall document being the highest level with (in the example above), three sections (the <?xml...>, <!DOCTYPE...> and <html>...</html>). Inside the <html> tags are two nested sections, the <head> and <body>. Normally, inside the <body> would be other (perhaps many other) sections.

Standard XHTML Template with Html class

When actually writing an XHTML page from Python, you need to keep track of where you are so that you can properly close each element that you open. With a complicated document, this is tricky and fragile. This is what prompted me to write an Html class that can manage the trickiness and hopefully increase the robustness of the result. The idea is to store the page data in a series of nested lists then provide the means to extract them in proper order (basically a insertion-order tree traversal) for output. Let's start with an example. A similar, standard template can be generated with the Html class like this:

1. from libhtml import Html, _XMLNS

2. _LANG = 'xml:lang="en-CA" lang="en-CA"'
3. _META1 = 'http-equiv="content-type" content="text/html;charset=utf-8"'
4. _META2 = 'http-equiv="Content-Style-Type" content="text/css"'

5. p = Html('html', indent=False, xmlns=_XMLNS, attr=_LANG)
6. p.addXML()
7. p.addDOCTYPE()
8. head = Html('head', indent=False) + (
9.    Html('title','Mytitle', inline=True,indent=True),
10.   Html('meta', attr=_META1, indent=True,inline=True,close=False),
11.   Html('meta', attr=_META2, indent=True,inline=True,close=False)
12.    )
13. p+=head
14. body = Html('body', indent=False)
15. p+=body

Let's walk through this example by numbered line:

1. import the Html class 2-4. set up some constants to be used later 5. Instantiate a new Html object. The first, positional argument is the tag type, which defaults to 'html'. The two keyword arguments are not specifically recognized by the constructor and are passed into the tag that is generated as attributes. For example, if I just say:

>>> print Html('html', xmlns=_XMLNS, attr=_LANG)
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-CA" lang="en-CA"></html>
>>>

you can see that it produces a standard <html...> tag and closing tag. So far, so good. In lines 5 and 6, XML and DOCTYPE statements are added. I can print it after line 7:

>>> print p
<?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"  
xml:lang="en-CA" lang="en-CA"></html>
>>>

You can see that it looks messy! In fact, all that is happening is the Html class is converting the contents to a single string (including some tabs and line-ends) and printing it on the terminal. The class has a method that makes this look better:

>>> p.write()
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-CA" lang="en-CA">
</html>
>>>

Much neater! Note that the default action of the write() method is to print the results on the terminal. The write() method also takes a function as an argument (which can be a lambda function) which is handy for writing to a file.

Lines 8-12 illustrate the heart of what the class does. In line 8, we begin a <head> tag by instantiating a new object -- Html('head'). To that is added a tuple consisting of three new Html instances for the <title> and two <meta> tags. The Html class overloads the "+" operator and adds its special magic. It places whatever is added between the opening and closing tags of the tag to which the new elements are being added. After these lines, we can see the intermediate output:

>>> head.write()
<head>
    <title>Mytitle</title>
    <meta http-equiv="content-type" content="text/html;charset=utf-8" />
    <meta http-equiv="Content-Style-Type" content="text/css" />
</head>
>>>

Notice how it neatly puts the contents of the header between the <head> and </head> tags? also, notice how it formats them. This is the consequence of the keyword parameters on the objects "inside" the header:

indent=True - means indent this element (and any elements within it) by adding a tab stop (\t) to the start of the line when write() is called. The default action is to indent (but would still respect any indentation of its parent element, if there is one).

inline=True - means print this element (and any elements within it) on one line. The default is to print the tags and contents on separate lines.

close=False - means that this tag does not need a closing tag. (Note that this is the default for <meta> tags and thus close=False is not strictly needed here. There is a list of tags that fall into that category, including
, <link /> and others.)

Now, to include the header in my page I execute line 13:

p += head

This shows that the inline addition operator is also overloaded by the Html class. The result is to add the header to the page in the proper place:

>>> p += head
>>> p.write()
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-CA" lang="en-CA">
<head>
    <title>Mytitle</title>
    <meta http-equiv="content-type" content="text/html;charset=utf-8" />
    <meta http-equiv="Content-Style-Type" content="text/css" />
</head>
</html>
>>>

The Html class took the header and added it inside the page, which means inside the <html> </html> tags. At this point, I could continue to add to the header:

>>> head += Html('link',rel="stylesheet", href="../styles/calendar-screen.css",
...             type="text/css", media="screen",indent=True)

Which gives me:

>>> head.write()
<head>
    <title>Mytitle</title>
    <meta http-equiv="content-type" content="text/html;charset=utf-8" />
    <meta http-equiv="Content-Style-Type" content="text/css" />
    <link media="screen" href="../styles/calendar-screen.css" rel="stylesheet" type="text/css" />
</head>
>>>

See how it tucked the new <link> element inside the <head> element? See how it passed the keyword arguments into the Html element and auto-closed it (since it is in the "special" list)? Also, since the head object is still part of the p object:

>>> p.write()
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-CA" lang="en-CA">
<head>
    <title>Mytitle</title>
    <meta http-equiv="content-type" content="text/html;charset=utf-8" />
    <meta http-equiv="Content-Style-Type" content="text/css" />
    <link media="screen" href="../styles/calendar-screen.css" rel="stylesheet" type="text/css" />
</head>
</html>
>>>

The new link element appears where it should in the output. An additional feature automatically takes positional elements in the constructor and places them between the opening and closing tags. A hyperlink is a good example of this:

>>> print Html('a', "cnn", href="http://cnn.com", inline=True)
<a href="http://cnn.com">cnn</a>
>>>

You could also achieve the same effect this way:

>>> print Html('a',href="http://cnn.com", inline=True)+"cnn"
<a href="http://cnn.com">cnn</a>
>>>

The string "cnn" is inserted into the proper place between the start <a> and </a> tags. Which one you use is a matter of preference, but I would suggest using the first form when the text is short (that is, the whole expression fits comfortably on an 80-character line in your source) and using the second when it is long, so:

Html('a','cnn',href="http://cnn.com", inline=True)

is fine but

Html('a',"CNN.com is among the world's leaders in online news and information delivery. Staffed 24 hours, seven
days a week by a dedicated staff in CNN's world headquarters in Atlanta, Georgia, and in bureaus worldwide,  
CNN.com relies heavily on CNN's global team of almost 4,000 news professionals. CNN.com features the latest 
multimedia technologies, from live video streaming to audio packages to searchable archives of news features and 
background information. The site is updated continuously throughout the day." href="http://cnn.com", inline=True)

would not be recommended.

Notice that keyword arguments (other than indent, inline and close and attr) are passed into the opening tag and positional arguments are placed between the tags. What if you have a keyword argument that could be confused with a Python keyword? "class" is a good example of this:

>>> print Html('a', "cnn", href="http://cnn.com", inline=True, class="myclass")
  File "<stdin>", line 1
    print Html('a', "cnn", href="http://cnn.com", inline=True, class="myclass")
                                                                     ^
SyntaxError: invalid syntax

The way to get around this is to append an underscore ("_") to the keyword:

>>> print Html('a', "cnn", href="http://cnn.com", inline=True, class_="myclass")
<a class="myclass" href="http://cnn.com">cnn</a>
>>>

The trailing underscore is stripped before adding it to the other attributes. What if your keyword violates Python syntax? For example:

>>> print Html('meta', http-equiv="Content-Style-Type")
  File "<stdin>", line 1
SyntaxError: keyword can't be an expression

Fails. But you can use the special keyword "attr" instead:

>>> print Html('meta', attr='http-equiv="Content-Style-Type"')
<meta http-equiv="Content-Style-Type" />
>>>

You would do the same thing with other attributes, such as "xml:lang" Practically speaking, there are currently only a handful of keywords that conflict with Python keywords or are syntactically invalid in Python.

Using the Html class

The Html class extends the built-in "list" class and so any methods you can use on a list will work with Html as well. The methods described below are those that have been specially enhanced to support the generation of HTML documents:

Object instantiation

Html(tag='html', [arg1,...argn,]
    attr=,
    indent=True,
    inline=False,
    close=True,
    [keyword1=arg1, ..., keywordn=argn])

to instantiate a new Html object, simply assign it to some variable:

mypage = Html()

The constructor accepts the following arguments:

tag
The HTML tag to be used. This defaults to 'html.' The constructor does not validate the tag, so anything is accepted.
arg1...argn
Optional positional arguments. These arguments will be copied as is between the opening and closing tags.
attr
tag attributes that violate Python syntax so cannot be passed as "keyword=arg" (See below). An example would be the xml:lang attritube, which will cause a syntax error if given as a keyword parameter.
indent
Indent this object with respect to its parent, if parent exists. Default = True. Note that the indentation is cumulative. That is, the indentation of this object will be the sum of its indentation and that of its parent, if the parent is indented. This is useful for producing human-readable output though it should have no effect on browsers rendering the page.
inline
Instruct the write() method to output the tag and all its contents (including any nested elements) as one string. Default is False which means call the output method once for the beginning and ending tags and once for each element contained therein, including sub-elements.
close
This element should be closed normally (e.g. <tag>...</tag>). Default is True.
keyword1=arg1...
Any other keywords are passed into the tag. That is, they are assumed to be tag attributes.

Note 1: There are several tags that are not normally closed. These include: area, base, br, frame, hr, image, input, link, meta and param. The constructor recognizes these and assumes close=False, which means that the tags are closed automatically. For example, "meta" is automatically closed like this:

<meta ... />

Note 2: At least one typical tag attribute, "class=" conflicts with a Python keyword. To circumvent this, the constructor will see if a keyword has a trailing underscore ("_"). If so, it will strip the underscore and pass the resulting keyword assignment into the tag.

Extending your Html objects

The Html class overloads the typical list methods append(), extend(), "+" and "+=" to enable you to easily build up your page. All of the following examples are valid uses of Html objects. Try them in an interactive Python session and use <object>.write() to view the results:

foo = Html(indent=False)
bar = Html('head')
foobar = foo + bar
bar += Html('meta')
bar.extend('meta',bob="your uncle")
bar.append('useless text in header')
opa = Html('body')
foobar += opa
opa = opa + Html('div', id="my div", class_="my class")
opa += [
    'text beginning',
    Html('a','imbedded href',href="http://w3c.org"),
    "text following"
    ]
p = Html('p')+ ['the quality of mercy is not strained','it falleth as the gentle dew from heaven']

Note: The Html class does not implement list operations by index (e.g. list[1]) or slice (e.g. list[1:2]). Use these at your own risk as they may cause unpredictable results.

Of course, if you like you can add preformatted tags:

    page += '<ul><li>one</li><li>two</li></ul>'

since the object derives from list, but then you are responsible for properly closing your tags.

However, if the tag argument begins with a "<", the constructor assumes that it is a preformatted tag:

page += Html('<title>My Title</title>')

The Html class also overloads the list methods remove(), "-" and "-=" allowing you to remove elements from an object. For example:

    page = Html()
    head = Html('head')
    title = Html('title','my title')
    head += title
    page += head
    newtitle = Html('title','new title')
    head.remove(title)
    page -= head

The Html class also overloads the list method replace() allowing you to replace elements. For example:

   page = Html()
   head = Html('head')
   title = Html('title','my title')
   head += title
   page += head
   newtitle = Html('title','new title')
   head.replace(title,newtitle)

Writing your output as and where you wish

The Html class adds a new method, write() which is to be used when you are ready to output your finished page. If called with no parameters, it simply prints the contents of the page on your terminal. If called with an argument, that argument must be the name of a function that is to receive the data to be output. For example:

page.write(lambda data:sys.stdout.write(data + "\n")

Building standard DOCTYPE and XML declarations: Two methods, addXML() and addDOCTYPE() are available to build standard XML and DOCTYPE declarations and insert them in their proper places. They are invoked as follows:

addXML(version=1.0, encoding="UTF-8", standalone="no")

will build a standard XML declaration like this:

<?xml version="1.0" encoding="utf-8"? standalone="no">

and add it as the first element in the object. It takes three keyword parameters with the defaults as show.

addDOCTYPE([arg1, arg2, ..., argn,]name='html', external_id=_XHTML10_STRICT)

will build a standard DOCTYPE statement like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

it takes two keyword parameters, with defaults as show, and passes any other positional arguments into the declaration. The declaration is added as the first element, if there is no XML declaration, or the second element, if the first element is an XML declaration. This method also makes use of a variable, _XHTML10_STRICT, that is exported along with others for convenience. The full list is:

_XHTML10_STRICT = '"-//W3C//DTD XHTML 1.0 Strict//EN"\n' \
                  '\t"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"'
_XTHML10_TRANS = '"-//W3C//DTD XHTML 1.0 Transitional//EN"\n' \
                 '\t"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"'
_XHTML10_FRAME = ' "-//W3C//DTD XHTML 1.0 Frameset//EN"\n' \
                 '\t"http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd"'
_XHTML11 = '"-//W3C//DTD XHTML 1.1//EN"\n' \
           '\t"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"'
_XHTML10_BASIC = ' "-//W3C//DTD XHTML Basic 1.0//EN"\n' \
                 '\t"http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd"'
_XHTML11_BASIC = ' "-//W3C//DTD XHTML Basic 1.1//EN"\n ' \
                 '\t"http://www.w3.org/TR/xhtml-basic/xhtml-basic11.dtd"'
_XMLNS = "http://www.w3.org/1999/xhtml"

You can use these to build your own DOCTYPE statements. The last one is useful when starting a new document:

page = Html(xmlns=_XMLNS)

Note: You must explicitly import these constants from the libhtml module. For example:

from libhtml import _XMLMS

Recommended approach

The Html class is versatile and there are many ways to use it to produce an HTML page. You can make your life easier, however, if you organize your work as described in this section.

Use Html.page to begin a new page

The Html.page function takes care of the HTML boilerplate that is needed for every valid HTML page. Start off like this:

page, head, body = Html.page(title="My title')

You will then have three Html objects to use for the rest of your work. (Insert your own title text as desired.)

Add additional tags to the header

Many pages require additional tags to be added to the <head> ... </head> section. These can easily be added like this:

head += Html('meta', name="author", content="Bill Shakespeare")
head += Html('link', media="screen" href="../my.css" type="text/css" rel="stylesheet"

Build up the page body by divisions

Try to design the body of your document as a logical arrangement of divisions that refer to CSS styles. Instantiate each division separately:

div1 = Html('div', id="first_division", class_='div1class')
div2 = Html('div', id="second_division", class_='div2class')

When you have designed your divisions, add them to the page body:

body += div1
body += div2

Or, if you have man divisions, you can add them all at once like this:

body += [div1, div2]

Note: by adding them as a list, the divisions are siblings within the body.

Add content to your divisions

If you are adding content that is more or less fixed, just add it directly:

div1 += Html('p','division 1 content')
div2 += Html('p','division 2 content')

If you are adding content that you will need to build up, first instantiate a new object and add it to your division:

table1 = Html('table')
div1 += table

Write out your completed page

When all content has been added to your page, write it out:

page.write(myoutputmethod())

Static Methods

The Html class contains several static methods which perform simple operations that are needed for most HTML pages.

Html.head

Html.head generates a HTML <head>...</head> object, with <meta.../> statements required by the XHTML standard. It takes three parameters:

title
Specifies the title that is to appear in the browser titlebar. It is inserted inside the tags <head><title>...</title></head>. Default is "Title".
encoding
Specifies the encoding to be used in the page. Default is "utf-8".
lang
Specifies a language code for the content of the page. Default is "en".

Html.head returns an Html object reference to the newly-built object.

Html.html

Html.html generates a basic HTML <html...></html> object with standard attributes. It takes two parameters:

xmlns
a string containing the URL to the XML namespace definition. Default is "http://www.w3.org/1999/xhtml"
lang
Specifies a language code for the content of the page. Default is "en".

Html.html returns an Html object reference to the newly-built object.

Html.doctype

Html.doctype builds and returns a string containing a standard DOCTYPE statement. It takes three parameters:

name
name of this DOCTYPE. Default is "html" which matches the XHTML standard.
public
characterization of this DOCTYPE. Default is "PUBLIC" which matches the XHTML standard.
external_id
external id of this DOCTYPE. Default is
'"-//W3C//DTD XHTML 1.0 Strict//EN"\n' \
                 '\t"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"'

which matches the XHTML standard.

Note: the external id may be overridden with any of the constants defined by the Html class as described earlier.

Html.xmldecl

Html.xmldecl builds a standard XML declaration which should be the first line in an HTML document according to the standard. It takes three parameters:

version
version number. Defaults to "1.0"
encoding
encoding method. Detauls to "UTF-8"
standalone
characterization of this declaration. Defaults to "no"

Html.page

Html.page is useful when starting a new page object. It uses the previously-defined static methods and builds three Html objects representing the whole document (that is, the entire page), the <head> section and the <body> section. Html.page takes three parameters:

title
Specifies the title that is to appear in the browser titlebar. It is inserted inside the tags <head><title>...</title></head>. Default is "Title".
encoding
Specifies the encoding to be used in the page. Default is "utf-8".
lang
Specifies a language code for the content of the page. Default is "en".

Html.page returns three object references:

page
reference to page object, which contains head and body objects
head
reference to head object
body
reference to body object

Properties

The Html class exposes three properties which correspond to the three sections of a standard HTML tagset and can be used to manipulate them:

tag
returns a reference to the tag defined for this object
attr
returns a reference to the attributes inside the opening tag, if any.
inside
returns a reference to the contents of the tagset as a list -- that is, whatever exists between the opening and closing tags.

Examples

  • Retrieve the tag name of an Html object:
>>> html = Html('html')
>>> html
['<html>', '</html>']
>>> html.tag
'html'
>>> 
  • Change the tag name:
>>> html.tag = 'tail'
>>> html
['<tail>', '</tail>']
>>> 
  • Retrieve the tag attributes:
>>> a = Html('a', href='http://cnn.org')
>>> a
['<a href="http://cnn.org">', '</a>']
>>> a.attr
'href="http://cnn.org"'
>>>
  • Change the tag attributes:
>>> a.attr
'href="http://gramps-project.org"'
>>>
  • Extend the tag attributes:
>>> a.attr += ' id="myhref"'
>>> a.attr
'href="http://gramps-project.org" id="myhref"'
>>>
  • Delete the tag attributes:
>>> del a.attr
>>> a.attr

>>>
  • Retrieve whatever is inside a tag:
>>> p = Html('p','This is a paragraph')
>>> p.inside
['This is a paragraph']
>>> 
  • Change whatever is inside a tag:
>>> p.inside = "This is a better paragraph"
>>> p.inside
['This is a better paragraph']
>>> 
  • Extend whatever is inside a tag:
>>> p.inside += ["THIS IS THE BEST PARAGRAPH!"]
>>> p.inside
['This is a better paragraph', 'THIS IS THE BEST PARAGRAPH!']
>>> 
  • Delete whatever is inside a tag:
>>> del p.inside
>>> p.inside
[]
>>> p