You can specify encoding
while calling tostring()
:
>>> from lxml.html import fromstring, tostring
>>> s = 'Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources'
>>> div = fromstring(s)
>>> print tostring(div, encoding='unicode')
<p>Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources</p>
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).,Parsers are represented by parser objects. There is support for parsing both XML and (broken) HTML. Note that XHTML is best parsed as XML, parsing it with the HTML parser can lead to unexpected results. Here is a simple example for parsing XML from an in-memory string:,Parsing XML and HTML with lxmlParsersThe target parser interfaceThe feed parser interfaceIncremental event parsingiterparse and iterwalkPython unicode strings,iterparse() also supports the tag argument for selective event iteration and several other parameters that control the parser setup. The tag argument can be a single tag or a sequence of tags. You can also use it to parse HTML input by passing html=True.
>>> from lxml
import etree
from io
import StringIO, BytesIO
>>> xml = '<a xmlns="test"><b xmlns="test" /></a>'
>>> root = etree.fromstring(xml)
>>> etree.tostring(root)
b'<a xmlns="test"><b xmlns="test" /></a>'
>>> tree = etree.parse(StringIO(xml))
>>> etree.tostring(tree.getroot())
b'<a xmlns="test"><b xmlns="test" /></a>'
>>> tree = etree.parse("doc/test.xml")
>>> root = etree.fromstring(xml, base_url = "http://where.it/is/from.xml")
This is a short tutorial for using xml.etree.ElementTree (ET in short). The goal is to demonstrate some of the building blocks and basic concepts of the module.,This module provides limited support for XPath expressions for locating elements in a tree. The goal is to support a small subset of the abbreviated syntax; a full XPath engine is outside the scope of the module.,To process this file, load it as usual, and pass the root element to the xml.etree.ElementTree module:,Writes an element tree or element structure to sys.stdout. This function should be used for debugging only.
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E" />
<neighbor name="Switzerland" direction="W" />
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N" />
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W" />
<neighbor name="Colombia" direction="E" />
</country>
</data>
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
root = ET.fromstring(country_data_as_string)
>>> root.tag 'data' >>>
root.attrib {}
>>>
for child in root:
...print(child.tag, child.attrib)
...
country {
'name': 'Liechtenstein'
}
country {
'name': 'Singapore'
}
country {
'name': 'Panama'
}
>>> root[0][1].text '2008'
If you can, I recommend you install and use lxml for speed. If you’re using a very old version of Python – earlier than 3.2.2 – it’s essential that you install lxml or html5lib. Python’s built-in HTML parser is just not very good in those old versions.,If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser. You can override this by specifying one of the following:,The name of the parser library you want to use. Currently supported options are “lxml”, “html5lib”, and “html.parser” (Python’s built-in HTML parser).,Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:
html_doc = """<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
# <html>
#
<head>
# <title>
# The Dormouse's story
# </title>
# </head>
#
<body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link3">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
#
</html>
soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
for link in soup.find_all('a'): print(link.get('href')) # http: //example.com/elsie # http: //example.com/lacie # http: //example.com/tillie
print(soup.get_text()) # The Dormouse 's story # # The Dormouse 's story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # #...
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
soup = BeautifulSoup("<html>a web page
</html>", 'html.parser')
Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands:,Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:,Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.,Beautiful Soup’s main strength is in searching the parse tree, but you can also modify the tree and write your changes as a new HTML or XML document.
html_doc = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
# <html>
#
<head>
# <title>
# The Dormouse's story
# </title>
# </head>
#
<body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link2">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
#
</html>
soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
for link in soup.find_all('a'): print(link.get('href')) # http: //example.com/elsie # http: //example.com/lacie # http: //example.com/tillie
print(soup.get_text()) # The Dormouse 's story # # The Dormouse 's story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # #...
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp)
soup = BeautifulSoup("<html>data
</html>")
lxml.html.fromstring()
def __init__(self, file_name, user_id):
with open(file_name, 'r') as self.opened_file:
# So Instapaper doesn't close <li> tags
# This was causing infinite recursion when using BS directly
# Hence why the stuff below is being done, so that the <li> tags get closed
self.html = html.document_fromstring(self.opened_file.read())
self.html = html.tostring(self.html)
self.soup = BeautifulSoup4(self.html)
self.user = user_id
self.urls = dict()
self.check_duplicates = dict()
self.check_duplicates_query = Bookmark.query.filter(Bookmark.user == self.user,
Bookmark.deleted == False).all()
for bmark in self.check_duplicates_query:
self.check_duplicates[bmark.main_url] = bmark
self.tags_dict = dict()
self.tags_set = set()
self.valid_url = re.compile(
r'^(?:[a-z0-9\.\-]*)://'
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}(?<!-)\.?)|'
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|'
r'\[?[A-F0-9]*:[A-F0-9:]+\]?)'
r'(?::\d+)?'
r'(?:/?|[/?]\S+)$', re.IGNORECASE)
def content(self):
"""
:returns: The text body of the message.
"""
# The code that follows is obviously pretty disgusting.
# It seems like it might be impossible to completely replicate
# the text of the original message if it has trailing whitespace
message = self._content_xpb.one_(self._message_element)
first_line = message.text
if message.text[:2] == ' ':
first_line = message.text[2:]
else:
log.debug("message did not have expected leading whitespace")
subsequent_lines = ''.join([
html.tostring(child, encoding='unicode').replace('<br>', '\n')
for child in message.iterchildren()
])
message_text = first_line + subsequent_lines
if len(message_text) > 0 and message_text[-1] == ' ':
message_text = message_text[:-1]
else:
log.debug("message did not have expected leading whitespace")
return message_text
def from_text(txt):
def replace(match):
txt = match.group()
if '\n' in txt:
return '<br>' * txt.count('\n')
else:
return ' ' * txt.count(' ')
tpl = '<p>%s</p>'
htm = escape(txt)
htm = fromstring(tpl % htm)
fix_links(htm)
htm = tostring(htm, encoding='unicode')
htm = htm[3:-4]
htm = re.sub('(?m)((\r?\n)+| [ ]+|^ )', replace, htm)
htm = tpl % htm
return htm
def try_justext(tree, url, target_language):
''
'Second safety net: try with the generic algorithm justext'
''
result_body = etree.Element('body')
justtextstring = html.tostring(tree, pretty_print = False, encoding = 'utf-8')
# determine language
if target_language is not None and target_language in JUSTEXT_LANGUAGES:
langsetting = JUSTEXT_LANGUAGES[target_language]
justext_stoplist = justext.get_stoplist(langsetting)
else:
#justext_stoplist = justext.get_stoplist(JUSTEXT_DEFAULT)
justext_stoplist = JT_STOPLIST
# extract
try:
paragraphs = justext.justext(justtextstring, justext_stoplist, 50, 200, 0.1, 0.2, 0.2, 200, True)
except ValueError as err: # not an XML element: HtmlComment
LOGGER.error('justext %s %s', err, url)
result_body = None
else:
for paragraph in paragraphs:
if not paragraph.is_boilerplate:
#if duplicate_test(paragraph) is not True:
elem = etree.Element('p')
elem.text = paragraph.text
result_body.append(elem)
return result_body
def ingest(self, file_path):
""
"Ingestor implementation."
""
file_size = self.result.size or os.path.getsize(file_path)
if file_size > self.MAX_SIZE:
raise ProcessingException("XML file is too large.")
try:
doc = etree.parse(file_path)
except(ParserError, ParseError):
raise ProcessingException("XML could not be parsed.")
text = self.extract_html_text(doc.getroot())
transform = etree.XSLT(self.XSLT)
html_doc = transform(doc)
html_body = html.tostring(html_doc, encoding = str, pretty_print = True)
self.result.flag(self.result.FLAG_HTML)
self.result.emit_html_body(html_body, text)
def clean_html(context, data):
""
"Clean an HTML DOM and store the changed version."
""
doc = _get_html_document(context, data)
if doc is None:
context.emit(data = data)
return
remove_paths = context.params.get('remove_paths')
for path in ensure_list(remove_paths):
for el in doc.xpath(path):
el.drop_tree()
html_text = html.tostring(doc, pretty_print = True)
content_hash = context.store_data(html_text)
data['content_hash'] = content_hash
context.emit(data = data)
Powerful and Pythonic XML processing library combining libxml2/libxslt with the ElementTree API.
merged_css_file.close()
merged_css_data.close()
# make actions independent from HTML content but required
for correct work of scripts.
makeCustomActions(input_html_dir, output_html_dir)
# remove comments from html.
for element in list(doc.getroot().iter(Comment)): # iterate through copy of list since we need to remove elements from original list
#print html.tostring(element)
element.getparent().remove(element)
# create new html file
new_html_path = os.path.join(output_html_dir, os.path.split(html_filename)[1])
html_file = open(new_html_path, 'w')
print >> html_file, doc.docinfo.doctype
print >> html_file, html.tostring(doc, pretty_print = True, include_meta_content_type = True, encoding = 'utf-8')
html_file.close()
return SUCCESS_CODE
if cid_mapping and message_data.get('body'):
root = lxml.html.fromstring(tools.ustr(message_data['body']))
postprocessed = False
for node in root.iter('img'):
if node.get('src', '').startswith('cid:'):
cid = node.get('src').split('cid:')[1]
attachment = cid_mapping.get(cid)
if not attachment:
attachment = fname_mapping.get(node.get('data-filename'), '')
if attachment:
attachment.generate_access_token()
node.set('src', '/web/image/%s?access_token=%s' % (attachment.id, attachment.access_token))
postprocessed = True
if postprocessed:
body = lxml.html.tostring(root, pretty_print = False, encoding = 'UTF-8')
message_data['body'] = body
return m2m_attachment_ids
timeout = 600., ))
final_url = urlsplit(out['url'])._replace(query = '', fragment = '').geturl()
# Ensure there are no scripts to be executed.
out['html'] = w3lib.html.remove_tags_with_content(out['html'], ('script', ))
root = html.fromstring(out['html'], parser = html.HTMLParser(),
base_url = final_url)
try:
head = root.xpath('./head')[0]
except IndexError:
head = html.Element('head')
root.insert(0, head)
if not head.xpath('./base/@href'):
head.insert(0, html.Element('base', {
'href': final_url
}))
if not head.xpath('./meta/@charset'):
head.insert(0, html.Element('meta', {
'charset': 'utf-8'
}))
out['html'] = html.tostring(root, encoding = 'utf-8',
doctype = '<!DOCTYPE html>')
filename = re.sub(r '[^\w]+', '_', url) + '.html'
with open(os.path.join(sites_dir, filename), 'w') as f:
f.write(out['html'])
return filename
def load_player(self, member_url, team, char_name = None): "" " Loads player and team membership data, and adds as member to team. Return profile, membership " "" try: member_d = self.visit_url(member_url) except IOError: profile_name = " ".join((word.capitalize() for word in member_url.strip("/").split("/")[-1].split("-"))) print("Page not found, constructing from {0} name and {1} charname".format(profile_name, char_name)) # create profile and membership profile, created = Profile(name = profile_name, user = self.master_user), True profile.save() membership = TeamMembership(team = team, profile = profile, char_name = char_name, active = False) membership.save() return profile, membership if "Player not found in database" in tostring(member_d): print("Player not found...skipping", file = self.stdout) return info_ps = member_d.cssselect('.content-section-1 p') info_h3s = member_d.cssselect('.content-section-1 h3') profile_name = info_ps[1].text if char_name is None: char_name = info_ps[4].text if "." in char_name: char_name = char_name.split(".", 1)[0] if Profile.objects.filter(name = profile_name).count(): profile, created = Profile.objects.get(name = profile_name), False membership, membership_created = TeamMembership.objects.get_or_create(team = team, profile = profile, defaults = { 'char_name': char_name }) membership.char_name = char_name else: try: membership = TeamMembership.objects.get(team = team, char_name = char_name)
'spanish': 'spa'
}.get(lang, None)
if lang:
mi.language = lang
if ebook_isbn:
# print("ebook isbn is " + type('')(ebook_isbn[0]))
isbn = check_isbn(ebook_isbn[0].strip())
if isbn:
self.cache_isbn_to_identifier(isbn, ovrdrv_id)
mi.isbn = isbn
if subjects:
mi.tags = [tag.strip() for tag in subjects[0].split(',')]
if desc:
desc = desc[0]
desc = html.tostring(desc, method = 'html', encoding = 'unicode').strip()
# remove all attributes from tags
desc = re.sub(r '<([a-zA-Z0-9]+)\s[^>]+>', r '<\1>', desc)
# Remove comments
desc = re.sub(r '(?s)<!--.*?-->', '', desc)
mi.comments = sanitize_comments_html(desc)
return None
def get_profile_page_html(linkedin_id):
profile_page_html = None
while not profile_page_html:
session = dryscrape.Session(base_url = "https://www.linkedin.com/in/")
session.visit(linkedin_id)
profile_page_html = lxml.html.tostring(session.document())
del session
else:
return profile_page_html
lxml is an XML parsing library (which also parses HTML) with a pythonic API based on ElementTree. (lxml is not part of the Python standard library.),parsel is a stand-alone web scraping library which can be used without Scrapy. It uses lxml library under the hood, and implements an easy API on top of lxml API. It means Scrapy selectors are very similar in speed and parsing accuracy to lxml.,BeautifulSoup is a very popular web scraping library among Python programmers which constructs a Python object based on the structure of the HTML code and also deals with bad markup reasonably well, but it has one drawback: it’s slow.,These pseudo-elements are Scrapy-/Parsel-specific. They will most probably not work with other libraries like lxml or PyQuery.
>>> response.selector.xpath('//span/text()').get()
'good'
>>> response.xpath('//span/text()').get()
'good' >>>
response.css('span::text').get()
'good'
>>> from scrapy.selector import Selector
>>> body = '<html>
<body><span>good</span></body>
</html>'
>>> Selector(text=body).xpath('//span/text()').get()
'good'
>>> from scrapy.selector
import Selector
>>>
from scrapy.http
import HtmlResponse
>>>
response = HtmlResponse(url = 'http://example.com', body = body) >>>
Selector(response = response).xpath('//span/text()').get()
'good'
<html>
<head>
<base href='http://example.com/' />
<title>Example website</title>
</head>
<body>
<div id='images'>
<a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
<a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
<a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
<a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
<a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
</div>
</body>
</html>
scrapy shell https: //docs.scrapy.org/en/latest/_static/selectors-sample1.html