how to get href from <a> tag which contains javascript using python?

  • Last Update :
  • Techknowledgy :

For all PDFs up to this part, it is having the same path

https: //www3.colonialfirststate.com.au/content/dam/prospects/

After that part we have a path generated using the fileID,

fs / 2 / 0 / fs2065.pdf ? 3 |
   | | | ||
   |
   | | | ++ -- - Not needed(But you can keep
      if you want) |
   | | |
   |
   | | + -- --File Name |
   | + -- -- -- -- --4 th character in the file name |
   + -- -- -- -- -- --3 rd character in the file name +
   -- -- -- -- -- -- --First two characters in the file name

We can use this as a workaround to get the exact url.

url = "javascript:GoPDF('FS2311')"
# javascript URL

pdfFileId = url[18: -2].lower() # extracts the file name from the Javascript URL

pdfBaseUrl = "https://www3.colonialfirststate.com.au/content/dam/prospects/%s/%s/%s/%s.pdf?3" % (pdfFileId[: 2], pdfFileId[2], pdfFileId[3], pdfFileId)

print(pdfBaseUrl)
# prints https: //www3.colonialfirststate.com.au/content/dam/prospects/fs/2/3/fs2311.pdf?3

The following code looks more complicated at first sight but delegates the tricky (and potentially brittle) URL stuff to urllib. It also uses more robust and flexible methods to extract the GoPDF fileId and the invariant part of the url.

from urllib.parse
import urlparse, urlunparse

def build_pdf_url(model_url, js_href):
   url = urlparse(model_url)
pdf_fileid = get_fileid_from_js_href(js_href)
pdf_path = build_pdf_path(model_url, pdf_fileid)
return urlunparse((url.scheme, url.netloc, pdf_path, url.params,
   url.query, url.fragment))

def get_fileid_from_js_href(href):
   ""
"extract fileid by extracting text between single quotes"
""
return href.split("'")[1].lower()

def build_pdf_path(url, pdf_fileid):
   prefix = pdf_fileid[: 2]
major_version = pdf_fileid[2]
minor_version = pdf_fileid[3]
filename = pdf_fileid + '.pdf'
return '/'.join([invariant_path(url), prefix, major_version, minor_version, filename])

def invariant_path(url, dropped_components = 4):
   ""
"
return all but the dropped components of the URL 'path'
NOTE: path components are separated by '/'
""
"
path_components = urlparse(url).path.split('/')
return '/'.join(path_components[: -dropped_components])

js_href = "javascript:GoPDF('FS1546')"
model_url = "https://www3.colonialfirststate.com.au/content/dam/prospects/fs/2/3/fs2311.pdf?3"
print(build_pdf_url(model_url, js_href))

$ python urlbuild.py
https: //www3.colonialfirststate.com.au/content/dam/prospects/fs/1/5/fs1546.pdf?3

Suggestion : 2

The href entries are always present within the anchor tag (<a> tag). So, the first task is to find all the <a> tags within the webpage.,Two ways to find all the anchor tags or href entries on the webpage are:,Soup represents the parsed file. The method soup.find_all() gives back all the tags and strings that match the criteria.,The anchor tags, or the <a> tags of HTML, are used to create anchor texts, and the URL of the webpage that is to be opened is specified in the href attribute.

To install requests on your system, open your terminal window and enter the below command:

pip install requests

To install Beautiful Soup in your system, open your terminal window and enter the below command:

pip install bs4

To install Beautiful Soup, open the terminal window and enter the below command:

import requests
from bs4
import BeautifulSoup

Output:

<a class="link-box" data-slogan="The Free Encyclopedia" href="//en.wikipedia.org/" id="js-link-box-en" title="English — Wikipedia — The Free Encyclopedia">
   <strong>English</strong>
   <small><bdi dir="ltr">6 383 000+</bdi> <span>articles</span></small>
</a>
.
.
.


<a href="https://creativecommons.org/licenses/by-sa/3.0/">Creative Commons Attribution-ShareAlike License</a>
<a href="https://meta.wikimedia.org/wiki/Terms_of_use">Terms of Use</a>
<a href="https://meta.wikimedia.org/wiki/Privacy_policy">Privacy Policy</a>

We can also use the SoupStrainer class. To use it, we have to first import it into the program using the below command.

from bs4
import SoupStrainer

Suggestion : 3

Last Updated : 07 Jan, 2021,GATE CS 2021 Syllabus

  • bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.
pip install bs4
  • requests:  Requests allows you to send HTTP/1.1 requests extremely easily. This module also does not comes built-in with Python. To install this type the below command in the terminal.
pip install requests

Suggestion : 4

Post date April 16, 2022 ,© 2022 The Web Dev

For instance, we write

from BeautifulSoup import BeautifulSoup

html = '''<a href="some_url">next</a>
<span class="class"><a href="another_url">later</a></span>'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
print(a['href'])