Here is an example code which also covers <a href>
:
from lxml import etree, html import urlparse def fix_links(content, absolute_prefix): "" " Rewrite relative links to be absolute links based on certain URL. @param content: HTML snippet as a string "" " if type(content) == str: content = content.decode("utf-8") parser = etree.HTMLParser() content = content.strip() tree = html.fragment_fromstring(content, create_parent = True) def join(base, url): "" " Join relative URL "" " if not(url.startswith("/") or "://" in url): return urlparse.urljoin(base, url) else: # Already absolute return url for node in tree.xpath('//*[@src]'): url = node.get('src') url = join(absolute_prefix, url) node.set('src', url) for node in tree.xpath('//*[@href]'): href = node.get('href') url = join(absolute_prefix, href) node.set('href', url) data = etree.tostring(tree, pretty_print = False, encoding = "utf-8") return data
after testing Mikko Ohtamaa's answer, here are some notes. it works for many tags and using the lxm, there are different situations, such as background-image:url(xxx). so I just use the regex to substitute. here's the solution,
content = re.sub('(?P<left>("|\'))\s*(?P<url>(\w|\.)+(/.+?)+)\s*(?P<right>("|\'))',
'\g<left>' + url[:url.rfind('/')] + '/\g<url>\g<right>', content)
content = re.sub('(?P<left>("|\'))\s*(?P<url>(/.+?)+)\s*(?P<right>("|\'))',
'\g<left>' + url[:url.find('/', 8)] + '\g<url>\g<right>', content)
Using lxml, how do you globally replace all anycodings_html src attributes with an absolute link? ,I'm not sure when this was added, but anycodings_html documents created from lxml.fromstring() anycodings_html now have a method called anycodings_html make_links_absolute. From the anycodings_html documentation:,If resolve_base_href is true, then any anycodings_html tag will be taken into account (just anycodings_html calling self.resolve_base_href()).,This makes all links in the document anycodings_html absolute, assuming that base_href is the anycodings_html URL of the document. So if you pass anycodings_html base_href="http://localhost/foo/bar.html" anycodings_html and there is a link to baz.html that anycodings_html will be rewritten as anycodings_html http://localhost/foo/baz.html.
Here is an example code which also anycodings_html covers <a href>:
from lxml import etree, html import urlparse def fix_links(content, absolute_prefix): "" " Rewrite relative links to be absolute links based on certain URL. @param content: HTML snippet as a string "" " if type(content) == str: content = content.decode("utf-8") parser = etree.HTMLParser() content = content.strip() tree = html.fragment_fromstring(content, create_parent = True) def join(base, url): "" " Join relative URL "" " if not(url.startswith("/") or "://" in url): return urlparse.urljoin(base, url) else: # Already absolute return url for node in tree.xpath('//*[@src]'): url = node.get('src') url = join(absolute_prefix, url) node.set('src', url) for node in tree.xpath('//*[@href]'): href = node.get('href') url = join(absolute_prefix, href) node.set('href', url) data = etree.tostring(tree, pretty_print = False, encoding = "utf-8") return data
after testing Mikko Ohtamaa's answer, anycodings_html here are some notes. it works for many anycodings_html tags and using the lxm, there are anycodings_html different situations, such as anycodings_html background-image:url(xxx). so I just use anycodings_html the regex to substitute. here's the anycodings_html solution,
content = re.sub('(?P<left>("|\'))\s*(?P<url>(\w|\.)+(/.+?)+)\s*(?P<right>("|\'))',
'\g<left>' + url[:url.rfind('/')] + '/\g<url>\g<right>', content)
content = re.sub('(?P<left>("|\'))\s*(?P<url>(/.+?)+)\s*(?P<right>("|\'))',
'\g<left>' + url[:url.find('/', 8)] + '\g<url>\g<right>', content)
This makes all links in the document absolute, assuming that base_href is the URL of the document. So if you pass base_href="http://localhost/foo/bar.html" and there is a link to baz.html that will be rewritten as http://localhost/foo/baz.html.,This rewrites all the links in the document using your given link replacement function. If you give a base_href value, all links will be passed in after they are joined with this URL.,This finds anything that looks like a link (e.g., http://example.com) in the text of an HTML document, and turns it into an anchor. It avoids making bad links.,These functions will parse html if it is a string, then return the new HTML as a string. If you pass in a document, the document will be copied (except for iterlinks()), the method performed, and the new document returned.
>>>
import lxml.html.usedoctest
>>> import lxml.html
>>> html = lxml.html.fromstring('''\
... <html>
<body onload="" color="white">
... <p>Hi !</p>
... </body>
</html>
... ''')
>>> print lxml.html.tostring(html)
<html>
<body onload="" color="white">
<p>Hi !</p>
</body>
</html>
>>> print lxml.html.tostring(html)
<html>
<body color="white" onload="">
<p>Hi !</p>
</body>
</html>
>>> print lxml.html.tostring(html)
<html>
<body color="white" onload="">
<p>Hi !</p>
</body>
</html>
>>> from lxml.html import builder as E
>>> from lxml.html import usedoctest
>>> html = E.HTML(
... E.HEAD(
... E.LINK(rel="stylesheet", href="great.css", type="text/css"),
... E.TITLE("Best Page Ever")
... ),
... E.BODY(
... E.H1(E.CLASS("heading"), "Top News"),
... E.P("World News only on this page", style="font-size: 200%"),
... "Ah, and here's some more text, by the way.",
... lxml.html.fromstring("<p>... and this is a parsed fragment ...</p>")
... )
... )
>>> print lxml.html.tostring(html)
<html>
<head>
<link href="great.css" rel="stylesheet" type="text/css">
<title>Best Page Ever</title>
</head>
<body>
<h1 class="heading">Top News</h1>
<p style="font-size: 200%">World News only on this page</p>
Ah, and here's some more text, by the way.
<p>... and this is a parsed fragment ...</p>
</body>
</html>
>>> from lxml.html import fromstring, tostring
>>> form_page = fromstring('''<html>
<body>
<form>
... Your name: <input type="text" name="name"> <br>
... Your phone: <input type="text" name="phone"> <br>
... Your favorite pets: <br>
... Dogs: <input type="checkbox" name="interest" value="dogs"> <br>
... Cats: <input type="checkbox" name="interest" value="cats"> <br>
... Llamas: <input type="checkbox" name="interest" value="llamas"> <br>
... <input type="submit"></form>
</body>
</html>''')
>>> form = form_page.forms[0]
>>> form.fields = dict(
... name='John Smith',
... phone='555-555-3949',
... interest=set(['cats', 'llamas']))
>>> print tostring(form)
<html>
<body>
<form>
Your name:
<input name="name" type="text" value="John Smith">
<br>Your phone:
<input name="phone" type="text" value="555-555-3949">
<br>Your favorite pets:
<br>Dogs:
<input name="interest" type="checkbox" value="dogs">
<br>Cats:
<input checked name="interest" type="checkbox" value="cats">
<br>Llamas:
<input checked name="interest" type="checkbox" value="llamas">
<br>
<input type="submit">
</form>
</body>
</html>
>>> from lxml.html
import parse, submit_form
>>>
page = parse('http://tinyurl.com').getroot() >>>
page.forms[0].fields['url'] = 'http://lxml.de/' >>>
result = parse(submit_form(page.forms[0])).getroot() >>>
[a.attrib['href']
for a in result.xpath("//a[@target='_blank']")
]
['http://tinyurl.com/2xae8s', 'http://preview.tinyurl.com/2xae8s']
>>> html = '''\
... <html>
... <head>
... <script type="text/javascript" src="evil-site"></script>
... <link rel="alternate" type="text/rss" src="evil-rss">
... <style>
... body {background-image: url(javascript:do_evil)};
... div {color: expression(evil)};
... </style>
... </head>
... <body onload="evil_function()">
... <!-- I am interpreted for EVIL! -->
... <a href="javascript:evil_function()">a link</a>
... <a href="#" onclick="evil_function()">another link</a>
... <p onclick="evil_function()">a paragraph</p>
... <div style="display: none">secret EVIL!</div>
... <object> of EVIL! </object>
... <iframe src="evil-site"></iframe>
... <form action="evil-site">
... Password: <input type="password" name="password">
... </form>
... <blink>annoying EVIL!</blink>
... <a href="evil-site">spam spam SPAM!</a>
... <image src="evil!">
... </body>
... </html>'''
Post by rmcilvride » Tue Dec 04, 2018 7:54 pm , Post by Radu » Wed Dec 12, 2018 3:34 pm , Post by Radu » Tue Dec 11, 2018 1:06 pm , Post by Radu » Fri Dec 14, 2018 10:13 am
Code: Select all
C: /Activity/Cogent / Docs / DocsOx / Source / Images /
Code: Select all
C: /Activity/Cogent / Docs / DocsOx / Source / [document_name] /
Code: Select all
img.src.path = $ { cfd } /../Images /
Code: Select all
<imagedata fileref="cdh-prop-butlicenses.gif" align="center" />
Code: Select all
img.src.path =
Code: Select all
<imagedata fileref="../Images/cdh-prop-butlicenses.gif" align="center" />
Creating a two-step scraper to first extract URLs, visit them, and scrape their contents,Creating a two-step scraper to first extract URLs, visit them, and scrape their contents ,We can look at the HTML source code of a page to find how target elements are structured and how to select them.,Only the URLs starting with http:// can directly be passed into requests.get(url). The others are termed relative URLs and need to be modified to become absolute.
>>>
import requests
>>>
import lxml >>>
import cssselect
$ python unsc - scraper.py
import requests
response = requests.get('http://www.un.org/en/sc/documents/resolutions/2016.shtml')
print(response.text)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr">
<head>
<meta http-equiv="X-UA-Compatible" content="IE=8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Resolutions adopted by the United Nations Security Council since 1946</title>
...
import requests import lxml.html response = requests.get('http://www.un.org/en/sc/documents/resolutions/2016.shtml') tree = lxml.html.fromstring(response.text) title_elem = tree.xpath('//title')[0] title_elem = tree.cssselect('title')[0] # equivalent to previous XPath print("title tag:", title_elem.tag) print("title text:", title_elem.text_content()) print("title html:", lxml.html.tostring(title_elem)) print("title tag:", title_elem.tag) print("title's parent's tag:", title_elem.getparent().tag)
title tag: title
title text: Resolutions adopted by the United Nations Security Council in 2016
title html: b'<title>Resolutions adopted by the United Nations Security Council in 2016</title> \n'
title tag: title
title's parent's tag: head