what's the most pythonic xhtml/html parser/generator/template module that supports dom like access?

  • Last Update :
  • Techknowledgy :

The first part can for the most part be done by ElementTree, but it takes a few more steps:

>>> import xml.etree.ElementTree as ET
>>> html = ET.XML('<html>

<head>
   <title>Hi</title>
</head>

<body></body>

</html>')
>>> html.head = html.find('head')
>>> html.head.append(ET.XML('
<link type="text/css" href="main.css" rel="stylesheet" />'))
>>> html.head.title = html.head.find('title')
>>> html.head.title.text
'Hi'

The second part can be completed by creating Element objects, but you'd need to do some of your own work to make it happen the way you really want:

>>> html.body = html.find('body')
>>> my_h1 = ET.Element('h1', {'class': 'roflol'})
>>> my_h1.text = 'BIG TITLE!12'
>>> html.body.append(my_h1)
>>> html.body.SOURCE = ET.tostring(html.body)
>>> html.body.SOURCE
'<body><h1 class="roflol">BIG TITLE!12</h1></body>'

You could create a stylesheet function of your own:

>>> def stylesheet(href = '', type = 'text/css', rel = 'stylesheet', ** kwargs):
   ...elem = ET.Element('link', href = href, type = type, rel = rel)
   ...
   return elem
      ...
      >>>
      html.head.append(stylesheet(href = "main.css"))

Suggestion : 2

This module defines a class HTMLParser which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.,The HTMLParser class uses the SGML syntactic rules for processing instructions. An XHTML processing instruction using the trailing '?' will cause the '?' to be included in data.,An HTMLParser instance is fed HTML data and calls handler methods when start tags, end tags, text, comments, and other markup elements are encountered. The user should subclass HTMLParser and override its methods to implement the desired behavior.,Return the text of the most recently opened start tag. This should not normally be needed for structured processing, but may be useful in dealing with HTML “as deployed” or for re-generating input with minimal changes (whitespace between attributes can be preserved, etc.).

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)

def handle_endtag(self, tag):
print("Encountered an end tag :", tag)

def handle_data(self, data):
print("Encountered some data :", data)

parser = MyHTMLParser()
parser.feed('<html>

<head>
   <title>Test</title>
</head>'
'

<body>
   <h1>Parse me!</h1>
</body>

</html>')
Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data: Test
Encountered an end tag: title
Encountered an end tag: head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data: Parse me!
   Encountered an end tag: h1
Encountered an end tag: body
Encountered an end tag: html
from html.parser
import HTMLParser
from html.entities
import name2codepoint

class MyHTMLParser(HTMLParser):
   def handle_starttag(self, tag, attrs):
   print("Start tag:", tag)
for attr in attrs:
   print("     attr:", attr)

def handle_endtag(self, tag):
   print("End tag  :", tag)

def handle_data(self, data):
   print("Data     :", data)

def handle_comment(self, data):
   print("Comment  :", data)

def handle_entityref(self, name):
   c = chr(name2codepoint[name])
print("Named ent:", c)

def handle_charref(self, name):
   if name.startswith('x'):
   c = chr(int(name[1: ], 16))
else:
   c = chr(int(name))
print("Num ent  :", c)

def handle_decl(self, data):
   print("Decl     :", data)

parser = MyHTMLParser()
>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
   ...'"http://www.w3.org/TR/html4/strict.dtd">')
Decl: DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd"
>>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
Start tag: img
attr: ('src', 'python-logo.png')
attr: ('alt', 'The Python logo')
>>>
>>> parser.feed('<h1>Python</h1>')
Start tag: h1
Data : Python
End tag : h1
>>> parser.feed('<style type="text/css">#python { color: green }</style>')
Start tag: style
     attr: ('type', 'text/css')
Data     : #python { color: green }
End tag  : style

>>> parser.feed('<script type="text/javascript">'
...             'alert("<strong>hello!</strong>");</script>')
Start tag: script
     attr: ('type', 'text/javascript')
Data     : alert("<strong>hello!</strong>");
End tag  : script

Suggestion : 3

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.,The default is formatter="minimal". Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML:,Tag, NavigableString, and BeautifulSoup cover almost everything you’ll see in an HTML or XML file, but there are a few leftover bits. The only one you’ll probably ever need to worry about is the comment:,The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string:

html_doc = """
<html>

<head>
   <title>The Dormouse's story</title>
</head>

<body>
   <p class="title"><b>The Dormouse's story</b></p>

   <p class="story">Once upon a time there were three little sisters; and their names were
      <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
      <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
      <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
      and they lived at the bottom of a well.
   </p>

   <p class="story">...</p>
   """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
# <html>
#

<head>
   # <title>
      # The Dormouse's story
      # </title>
   # </head>
#

<body>
   # <p class="title">
      # <b>
         # The Dormouse's story
         # </b>
      # </p>
   # <p class="story">
      # Once upon a time there were three little sisters; and their names were
      # <a class="sister" href="http://example.com/elsie" id="link1">
         # Elsie
         # </a>
      # ,
      # <a class="sister" href="http://example.com/lacie" id="link2">
         # Lacie
         # </a>
      # and
      # <a class="sister" href="http://example.com/tillie" id="link2">
         # Tillie
         # </a>
      # ; and they lived at the bottom of a well.
      # </p>
   # <p class="story">
      # ...
      # </p>
   # </body>
#

</html>
soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
for link in soup.find_all('a'):
   print(link.get('href'))
# http: //example.com/elsie
   # http: //example.com/lacie
   # http: //example.com/tillie
print(soup.get_text())
# The Dormouse 's story
#
# The Dormouse 's story
#
# Once upon a time there were three little sisters;
and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
#...
from bs4 import BeautifulSoup

with open("index.html") as fp:
soup = BeautifulSoup(fp)

soup = BeautifulSoup("<html>data

</html>")

Suggestion : 4

Among the many Jodd components available there are Lagarto, an HTML parser, and Jerry, defined as jQuery in Java. There are even more components that can do other things. For instance, CSSelly, which is a parser for CSS-selectors strings and powers Jerry, and StripHtml, which reduces the size of HTML documents. ,There is little more to say about jsoup, because it does everything you need from an HTML parser and even more (e.g., cleaning HTML documents). It can be very concise.,There are a few functions to manipulate the document and easily add or remove elements. For instance, there are a few functions to wrap an element inside a provided one or doing the inverse operation.,The revival has improved the quality of the code and provided an accessible the documentation. However, it still has an old mindset: it supports XSLT and XPath, but not CSS selector. The first feature was very useful 10 years ago, but the second is necessary for modern HTML parsing.

The documentation of Jerry is good and there are a few examples in the documentation, including the following one.

// from the documentation 
public class ChangeGooglePage
{
    public static void main(String[] args) throws IOException
    {
        // download the page super-efficiently
        File file = new File(SystemUtil.getTempDir(), "google.html");
        NetUtil.downloadFile("http://google.com", file);

        // create Jerry, i.e. document context
        Jerry doc = Jerry.jerry(FileUtil.readString(file));

        // remove div for toolbar
        doc.$("div#mngb").detach();
        // replace logo with html content
        doc.$("div#lga").html("<b>Google</b>");

        // produce clean html...
        String newHtml = doc.html();
        // ...and save it to file system
        FileUtil.writeString(
            new File(SystemUtil.getTempDir(), "google2.html"),
            newHtml);
    }
}

The documentation offers a few examples and API documentation, but nothing more. The following example comes from it.

HtmlCleaner cleaner = new HtmlCleaner();
final String siteUrl = "http://www.themoscowtimes.com/";

TagNode node = cleaner.clean(new URL(siteUrl));

// traverse whole DOM and update images to absolute URLs
node.traverse(new TagNodeVisitor() {
   public boolean visit(TagNode tagNode, HtmlNode htmlNode) {
      if (htmlNode instanceof TagNode) {
         TagNode tag = (TagNode) htmlNode;
         String tagName = tag.getName();
         if ("img".equals(tagName)) {
            String src = tag.getAttributeByName("src");
            if (src != null) {
               tag.setAttribute("src", Utils.fullUrl(siteUrl, src));
            }
         }
      } else if (htmlNode instanceof CommentNode) {
         CommentNode comment = ((CommentNode) htmlNode);
         comment.getContent().append(" -- By HtmlCleaner");
      }
      // tells visitor to continue traversing the DOM tree
      return true;
   }
});

SimpleHtmlSerializer serializer =
   new SimpleHtmlSerializer(cleaner.getProperties());
serializer.writeToFile(node, "c:/temp/themoscowtimes.html");

In this example it directly fetches HTML documents from an URL and select a few links. On line 9 you can also see a nice option: the chance to automatically get the absolute url even if the attribute href reference a local one. This is possible by using the proper setting, which is set implicitly when you fetch the URL with the connect method.

Document doc = Jsoup.connect("http://en.wikipedia.org/")
   .userAgent("Mozilla")
   .get();

Elements newsHeadlines = doc.select("#mp-itn b a");

print("nLinks: (%d)", newsHeadlines.size());
for (Element link: newsHeadlines) {
   print(" * a: <%s>  (%s)", link.attr("abs:href"), trim(link.text(), 35));
}

If you are in need for things like XPath, HtmlAgilityPack should be your best choice. In other cases, I do not think it is the best right now, unless you are already using it.

// Load an HTML document
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);

// Get value with XPath	
var value = doc.DocumentNode
   .SelectNodes("//td/input")
   .First()
   .Attributes["value"].Value;

The standard Python library is quite rich and implement even an HTML Parser. The bad news is that the parser works like a simple and traditional parser, so there are no advanced functionalities geared to handle HTML. The parser essentially makes available a visitor with basic functions for handle the data inside tags, the beginning and the ending of tags.

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)

def handle_endtag(self, tag):
print("Encountered an end tag :", tag)

def handle_data(self, data):
print("Encountered some data :", data)

parser = MyHTMLParser()
parser.feed('<html>

<head>
   <title>Test</title>
</head>'
'

<body>
   <h1>Parse me!</h1>
</body>

</html>')

Suggestion : 5

The XML syntax for HTML was formerly referred to as "XHTML", but this specification does not use that term (among other reasons, because no such term is used for the HTML syntaxes of MathML and SVG).,HTML, its supporting DOM APIs, as well as many of its supporting technologies, have been developed over a period of several decades by a wide array of people with different priorities who, in many cases, did not know of each other's existence.,Implementations that support the XML syntax for HTML must support some version of XML, as well as its corresponding namespaces specification, because that syntax uses an XML serialization with namespaces. [XML] [XMLNS],Before supporting "credentialless", implementers are strongly encouraged to support both:

For example, suppose a page looked at its URL's query string to determine what to display, and the site then redirected the user to that page to display a message, as in:

<ul>
   <li><a href="message.cgi?say=Hello">Say Hello</a>
   <li><a href="message.cgi?say=Welcome">Say Welcome</a>
   <li><a href="message.cgi?say=Kittens">Say Kittens</a>
</ul>

If the message was just displayed to the user without escaping, a hostile attacker could then craft a URL that contained a script element:

https: //example.com/message.cgi?say=%3Cscript%3Ealert%28%27Oh%20no%21%27%29%3C/script%3E

Here, the author uses the onload handler on an img element to catch the load event:

<img src="games.png" alt="Games" onload="gamesLogoHasLoaded(event)">

If the element is being added by script, then so long as the event handlers are added in the same script, the event will still not be missed:

<script>
   var img = new Image();
   img.src = 'games.png';
   img.alt = 'Games';
   img.onload = gamesLogoHasLoaded;
   // img.addEventListener('load', gamesLogoHasLoaded, false); // would work also
</script>

However, if the author first created the img element and then in a separate script added the event listeners, there's a chance that the load event would be fired in between, leading it to be missed:

<!-- Do not use this style, it has a race condition! -->
<img id="games" src="games.png" alt="Games">
<!-- the 'load' event might fire here while the parser is taking a
      break, in which case you will not see it! -->
<script>
   var img = document.getElementById('games');
   img.onload = gamesLogoHasLoaded; // might never fire!
</script>

For example, the following markup fragment results in a DOM with an hr element that is an earlier sibling of the corresponding table element:

<table>
   <hr>...

For example, the following markup results in poor performance, since all the unclosed i elements have to be reconstructed in each paragraph, resulting in progressively more elements in each paragraph:

<p><i>She dreamt.
      <p><i>She dreamt that she ate breakfast.
            <p><i>Then lunch.
                  <p><i>And finally dinner.

In this fragment, the attribute's value is "?bill&ted":

<a href="?bill&ted">Bill and Ted</a>

In the following fragment, however, the attribute's value is actually "?art©", not the intended "?art&copy", because even without the final semicolon, "&copy" is handled the same as "&copy;" and thus gets interpreted as "©":

<a href="?art&copy">Art and Copy</a>

Thus, the correct way to express the above cases is as follows:

<a href="?bill&ted">Bill and Ted</a> <!-- &ted is ok, since it's not a named character reference -->

For example, it is unclear whether the author intended the following to be an h1 heading or an h2 heading:

<h1>Contact details</h2>