lxml -- how to change img src to absolute link

  • Last Update :
  • Techknowledgy :

Here is an example code which also covers <a href>:

from lxml
import etree, html
import urlparse

def fix_links(content, absolute_prefix):
   ""
"
Rewrite relative links to be absolute links based on certain URL.

@param content: HTML snippet as a string ""
"

if type(content) == str:
   content = content.decode("utf-8")

parser = etree.HTMLParser()

content = content.strip()

tree = html.fragment_fromstring(content, create_parent = True)

def join(base, url):
   ""
"
Join relative URL
   ""
"
if not(url.startswith("/") or "://" in url):
   return urlparse.urljoin(base, url)
else:
   # Already absolute
return url

for node in tree.xpath('//*[@src]'):
   url = node.get('src')
url = join(absolute_prefix, url)
node.set('src', url)
for node in tree.xpath('//*[@href]'):
   href = node.get('href')
url = join(absolute_prefix, href)
node.set('href', url)

data = etree.tostring(tree, pretty_print = False, encoding = "utf-8")

return data

after testing Mikko Ohtamaa's answer, here are some notes. it works for many tags and using the lxm, there are different situations, such as background-image:url(xxx). so I just use the regex to substitute. here's the solution,

content = re.sub('(?P<left>("|\'))\s*(?P<url>(\w|\.)+(/.+?)+)\s*(?P<right>("|\'))',
         '\g<left>' + url[:url.rfind('/')] + '/\g<url>\g<right>', content)
                  content = re.sub('(?P<left>("|\'))\s*(?P<url>(/.+?)+)\s*(?P<right>("|\'))',
                           '\g<left>' + url[:url.find('/', 8)] + '\g<url>\g<right>', content)

Suggestion : 2

Using lxml, how do you globally replace all anycodings_html src attributes with an absolute link? ,I'm not sure when this was added, but anycodings_html documents created from lxml.fromstring() anycodings_html now have a method called anycodings_html make_links_absolute. From the anycodings_html documentation:,If resolve_base_href is true, then any anycodings_html tag will be taken into account (just anycodings_html calling self.resolve_base_href()).,This makes all links in the document anycodings_html absolute, assuming that base_href is the anycodings_html URL of the document. So if you pass anycodings_html base_href="http://localhost/foo/bar.html" anycodings_html and there is a link to baz.html that anycodings_html will be rewritten as anycodings_html http://localhost/foo/baz.html.

Here is an example code which also anycodings_html covers <a href>:

from lxml
import etree, html
import urlparse

def fix_links(content, absolute_prefix):
   ""
"
Rewrite relative links to be absolute links based on certain URL.

@param content: HTML snippet as a string ""
"

if type(content) == str:
   content = content.decode("utf-8")

parser = etree.HTMLParser()

content = content.strip()

tree = html.fragment_fromstring(content, create_parent = True)

def join(base, url):
   ""
"
Join relative URL
   ""
"
if not(url.startswith("/") or "://" in url):
   return urlparse.urljoin(base, url)
else:
   # Already absolute
return url

for node in tree.xpath('//*[@src]'):
   url = node.get('src')
url = join(absolute_prefix, url)
node.set('src', url)
for node in tree.xpath('//*[@href]'):
   href = node.get('href')
url = join(absolute_prefix, href)
node.set('href', url)

data = etree.tostring(tree, pretty_print = False, encoding = "utf-8")

return data

after testing Mikko Ohtamaa's answer, anycodings_html here are some notes. it works for many anycodings_html tags and using the lxm, there are anycodings_html different situations, such as anycodings_html background-image:url(xxx). so I just use anycodings_html the regex to substitute. here's the anycodings_html solution,

content = re.sub('(?P<left>("|\'))\s*(?P<url>(\w|\.)+(/.+?)+)\s*(?P<right>("|\'))',
         '\g<left>' + url[:url.rfind('/')] + '/\g<url>\g<right>', content)
                  content = re.sub('(?P<left>("|\'))\s*(?P<url>(/.+?)+)\s*(?P<right>("|\'))',
                           '\g<left>' + url[:url.find('/', 8)] + '\g<url>\g<right>', content)

Suggestion : 3

This makes all links in the document absolute, assuming that base_href is the URL of the document. So if you pass base_href="http://localhost/foo/bar.html" and there is a link to baz.html that will be rewritten as http://localhost/foo/baz.html.,This rewrites all the links in the document using your given link replacement function. If you give a base_href value, all links will be passed in after they are joined with this URL.,This finds anything that looks like a link (e.g., http://example.com) in the text of an HTML document, and turns it into an anchor. It avoids making bad links.,These functions will parse html if it is a string, then return the new HTML as a string. If you pass in a document, the document will be copied (except for iterlinks()), the method performed, and the new document returned.

>>>
import lxml.html.usedoctest
>>> import lxml.html
>>> html = lxml.html.fromstring('''\
... <html>

<body onload="" color="white">
   ... <p>Hi !</p>
   ... </body>

</html>
... ''')

>>> print lxml.html.tostring(html)
<html>

<body onload="" color="white">
   <p>Hi !</p>
</body>

</html>

>>> print lxml.html.tostring(html)
<html>

<body color="white" onload="">
   <p>Hi !</p>
</body>

</html>

>>> print lxml.html.tostring(html)
<html>

<body color="white" onload="">
   <p>Hi !</p>
</body>

</html>
>>> from lxml.html import builder as E
>>> from lxml.html import usedoctest
>>> html = E.HTML(
... E.HEAD(
... E.LINK(rel="stylesheet", href="great.css", type="text/css"),
... E.TITLE("Best Page Ever")
... ),
... E.BODY(
... E.H1(E.CLASS("heading"), "Top News"),
... E.P("World News only on this page", style="font-size: 200%"),
... "Ah, and here's some more text, by the way.",
... lxml.html.fromstring("<p>... and this is a parsed fragment ...</p>")
... )
... )

>>> print lxml.html.tostring(html)
<html>

<head>
   <link href="great.css" rel="stylesheet" type="text/css">
   <title>Best Page Ever</title>
</head>

<body>
   <h1 class="heading">Top News</h1>
   <p style="font-size: 200%">World News only on this page</p>
   Ah, and here's some more text, by the way.
   <p>... and this is a parsed fragment ...</p>
</body>

</html>
>>> from lxml.html import fromstring, tostring
>>> form_page = fromstring('''<html>

<body>
   <form>
      ... Your name: <input type="text" name="name"> <br>
      ... Your phone: <input type="text" name="phone"> <br>
      ... Your favorite pets: <br>
      ... Dogs: <input type="checkbox" name="interest" value="dogs"> <br>
      ... Cats: <input type="checkbox" name="interest" value="cats"> <br>
      ... Llamas: <input type="checkbox" name="interest" value="llamas"> <br>
      ... <input type="submit"></form>
</body>

</html>''')
>>> form = form_page.forms[0]
>>> form.fields = dict(
... name='John Smith',
... phone='555-555-3949',
... interest=set(['cats', 'llamas']))
>>> print tostring(form)
<html>

<body>
   <form>
      Your name:
      <input name="name" type="text" value="John Smith">
      <br>Your phone:
      <input name="phone" type="text" value="555-555-3949">
      <br>Your favorite pets:
      <br>Dogs:
      <input name="interest" type="checkbox" value="dogs">
      <br>Cats:
      <input checked name="interest" type="checkbox" value="cats">
      <br>Llamas:
      <input checked name="interest" type="checkbox" value="llamas">
      <br>
      <input type="submit">
   </form>
</body>

</html>
>>> from lxml.html
import parse, submit_form
   >>>
   page = parse('http://tinyurl.com').getroot() >>>
   page.forms[0].fields['url'] = 'http://lxml.de/' >>>
   result = parse(submit_form(page.forms[0])).getroot() >>>
   [a.attrib['href']
      for a in result.xpath("//a[@target='_blank']")
   ]
   ['http://tinyurl.com/2xae8s', 'http://preview.tinyurl.com/2xae8s']
>>> html = '''\
... <html>
...  <head>
...    <script type="text/javascript" src="evil-site"></script>
...    <link rel="alternate" type="text/rss" src="evil-rss">
...    <style>
...      body {background-image: url(javascript:do_evil)};
...      div {color: expression(evil)};
...    </style>
...  </head>
...  <body onload="evil_function()">
...    <!-- I am interpreted for EVIL! -->
...    <a href="javascript:evil_function()">a link</a>
...    <a href="#" onclick="evil_function()">another link</a>
...    <p onclick="evil_function()">a paragraph</p>
...    <div style="display: none">secret EVIL!</div>
...    <object> of EVIL! </object>
...    <iframe src="evil-site"></iframe>
...    <form action="evil-site">
...      Password: <input type="password" name="password">
...    </form>
...    <blink>annoying EVIL!</blink>
...    <a href="evil-site">spam spam SPAM!</a>
...    <image src="evil!">
...  </body>
... </html>'''

Suggestion : 4

Post by rmcilvride » Tue Dec 04, 2018 7:54 pm , Post by Radu » Wed Dec 12, 2018 3:34 pm , Post by Radu » Tue Dec 11, 2018 1:06 pm , Post by Radu » Fri Dec 14, 2018 10:13 am

C: /Activity/Cogent / Docs / DocsOx / Source / Images /
C: /Activity/Cogent / Docs / DocsOx / Source / [document_name] /
img.src.path = $ {
   cfd
}
/../Images /
<imagedata fileref="cdh-prop-butlicenses.gif" align="center" />
img.src.path =
<imagedata fileref="../Images/cdh-prop-butlicenses.gif" align="center" />

Suggestion : 5

Creating a two-step scraper to first extract URLs, visit them, and scrape their contents,Creating a two-step scraper to first extract URLs, visit them, and scrape their contents ,We can look at the HTML source code of a page to find how target elements are structured and how to select them.,Only the URLs starting with http:// can directly be passed into requests.get(url). The others are termed relative URLs and need to be modified to become absolute.

>>>
import requests
   >>>
   import lxml >>>
   import cssselect
$ python unsc - scraper.py
import requests

response = requests.get('http://www.un.org/en/sc/documents/resolutions/2016.shtml')
print(response.text)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr">

<head>
   <meta http-equiv="X-UA-Compatible" content="IE=8" />
   <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
   <title>Resolutions adopted by the United Nations Security Council since 1946</title>
   ...
import requests
import lxml.html

response = requests.get('http://www.un.org/en/sc/documents/resolutions/2016.shtml')
tree = lxml.html.fromstring(response.text)
title_elem = tree.xpath('//title')[0]
title_elem = tree.cssselect('title')[0] # equivalent to previous XPath
print("title tag:", title_elem.tag)
print("title text:", title_elem.text_content())
print("title html:", lxml.html.tostring(title_elem))
print("title tag:", title_elem.tag)
print("title's parent's tag:", title_elem.getparent().tag)
title tag: title
title text: Resolutions adopted by the United Nations Security Council in 2016
title html: b'<title>Resolutions adopted by the United Nations Security Council in 2016</title>&#13;\n'
title tag: title
title's parent's tag: head