For all PDFs up to this part, it is having the same path
https: //www3.colonialfirststate.com.au/content/dam/prospects/
After that part we have a path generated using the fileID,
fs / 2 / 0 / fs2065.pdf ? 3 | | | | || | | | | ++ -- - Not needed(But you can keep if you want) | | | | | | | + -- --File Name | | + -- -- -- -- --4 th character in the file name | + -- -- -- -- -- --3 rd character in the file name + -- -- -- -- -- -- --First two characters in the file name
We can use this as a workaround to get the exact url.
url = "javascript:GoPDF('FS2311')" # javascript URL pdfFileId = url[18: -2].lower() # extracts the file name from the Javascript URL pdfBaseUrl = "https://www3.colonialfirststate.com.au/content/dam/prospects/%s/%s/%s/%s.pdf?3" % (pdfFileId[: 2], pdfFileId[2], pdfFileId[3], pdfFileId) print(pdfBaseUrl) # prints https: //www3.colonialfirststate.com.au/content/dam/prospects/fs/2/3/fs2311.pdf?3
The following code looks more complicated at first sight but delegates the tricky (and potentially brittle) URL stuff to urllib
. It also uses more robust and flexible methods to extract the GoPDF fileId and the invariant part of the url.
from urllib.parse import urlparse, urlunparse def build_pdf_url(model_url, js_href): url = urlparse(model_url) pdf_fileid = get_fileid_from_js_href(js_href) pdf_path = build_pdf_path(model_url, pdf_fileid) return urlunparse((url.scheme, url.netloc, pdf_path, url.params, url.query, url.fragment)) def get_fileid_from_js_href(href): "" "extract fileid by extracting text between single quotes" "" return href.split("'")[1].lower() def build_pdf_path(url, pdf_fileid): prefix = pdf_fileid[: 2] major_version = pdf_fileid[2] minor_version = pdf_fileid[3] filename = pdf_fileid + '.pdf' return '/'.join([invariant_path(url), prefix, major_version, minor_version, filename]) def invariant_path(url, dropped_components = 4): "" " return all but the dropped components of the URL 'path' NOTE: path components are separated by '/' "" " path_components = urlparse(url).path.split('/') return '/'.join(path_components[: -dropped_components]) js_href = "javascript:GoPDF('FS1546')" model_url = "https://www3.colonialfirststate.com.au/content/dam/prospects/fs/2/3/fs2311.pdf?3" print(build_pdf_url(model_url, js_href)) $ python urlbuild.py https: //www3.colonialfirststate.com.au/content/dam/prospects/fs/1/5/fs1546.pdf?3
The href entries are always present within the anchor tag (<a> tag). So, the first task is to find all the <a> tags within the webpage.,Two ways to find all the anchor tags or href entries on the webpage are:,Soup represents the parsed file. The method soup.find_all() gives back all the tags and strings that match the criteria.,The anchor tags, or the <a> tags of HTML, are used to create anchor texts, and the URL of the webpage that is to be opened is specified in the href attribute.
To install requests on your system, open your terminal window and enter the below command:
pip install requests
To install Beautiful Soup in your system, open your terminal window and enter the below command:
pip install bs4
To install Beautiful Soup, open the terminal window and enter the below command:
import requests
from bs4
import BeautifulSoup
Output:
<a class="link-box" data-slogan="The Free Encyclopedia" href="//en.wikipedia.org/" id="js-link-box-en" title="English â Wikipedia â The Free Encyclopedia">
<strong>English</strong>
<small><bdi dir="ltr">6 383 000+</bdi> <span>articles</span></small>
</a>
.
.
.
<a href="https://creativecommons.org/licenses/by-sa/3.0/">Creative Commons Attribution-ShareAlike License</a>
<a href="https://meta.wikimedia.org/wiki/Terms_of_use">Terms of Use</a>
<a href="https://meta.wikimedia.org/wiki/Privacy_policy">Privacy Policy</a>
We can also use the SoupStrainer
class. To use it, we have to first import it into the program using the below command.
from bs4
import SoupStrainer
Last Updated : 07 Jan, 2021,GATE CS 2021 Syllabus
- bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.
pip install bs4
- requests: Requests allows you to send HTTP/1.1 requests extremely easily. This module also does not comes built-in with Python. To install this type the below command in the terminal.
pip install requests
Post date April 16, 2022 ,© 2022 The Web Dev
For instance, we write
from BeautifulSoup import BeautifulSoup
html = '''<a href="some_url">next</a>
<span class="class"><a href="another_url">later</a></span>'''
soup = BeautifulSoup(html)
for a in soup.find_all('a', href=True):
print(a['href'])