scraping dynamic content quickly with python

  • Last Update :
  • Techknowledgy :

In this chapter, let us learn how to perform web scraping on dynamic websites and the concepts involved in detail.,We have seen that the scraper cannot scrape the information from a dynamic website because the data is loaded dynamically with JavaScript. In such cases, we can use the following two techniques for scraping data from dynamic JavaScript dependent websites −,Web scraping is a complex task and the complexity multiplies if the website is dynamic. According to United Nations Global Audit of Web Accessibility more than 70% of the websites are dynamic in nature and they rely on JavaScript for their functionalities.,Let us look at an example of a dynamic website and know about why it is difficult to scrape. Here we are going to take example of searching from a website named http://example.webscraping.com/places/default/search. But how can we say that this website is of dynamic nature? It can be judged from the output of following Python script which will try to scrape data from above mentioned webpage −

Let us look at an example of a dynamic website and know about why it is difficult to scrape. Here we are going to take example of searching from a website named http://example.webscraping.com/places/default/search. But how can we say that this website is of dynamic nature? It can be judged from the output of following Python script which will try to scrape data from above mentioned webpage −

import re
import urllib.request
response = urllib.request.urlopen('http://example.webscraping.com/places/default/search')
html = response.read()
text = html.decode()
re.findall('(.*?)', text)

Output

[]

For doing this, we need to click the inspect element tab for a specified URL. Next, we will click NETWORK tab to find all the requests made for that web page including search.json with a path of /ajax. Instead of accessing AJAX data from browser or via NETWORK tab, we can do it with the help of following Python script too −

import requests
url = requests.get('http://example.webscraping.com/ajax/search.json?page=0&page_size=10&search_term=a')
url.json()

First, we need to import webdriver from selenium as follows −

from selenium
import webdriver

Now, provide the path of web driver which we have downloaded as per our requirement −

path = r 'C:\\Users\\gaurav\\Desktop\\Chromedriver'
driver = webdriver.Chrome(executable_path = path)

Suggestion : 2

Last Updated : 05 Sep, 2020

1) Selenium bindings in python

pip install selenium

2) Web drivers
Selenium requires a web driver to interface with the chosen browser.Web drivers is a package to interact with web browser. It interacts with the web browser or a remote web server through a wire protocol which is common to all. You can check out and install the web drivers of your browser choice.

Chrome: https: //sites.google.com/a/chromium.org/chromedriver/downloads
   Firefox: https: //github.com/mozilla/geckodriver/releases
   Safari: https: //webkit.org/blog/6900/webdriver-support-in-safari-10/ 

To use beautiful soup, we have this wonderful binding of it in python :
1) BS4 bindings in python

pip install bs4

Suggestion : 3
1 pip install - U selenium
1 pip install chromedriver - install
1
import chromedriver_install as cdi
2 path = cdi.install(file_directory = 'c:\\data\\chromedriver\\', verbose = True, chmod = True, overwrite = False, version = None)
3 print('Installed chromedriver to path: %s' % path)
1 from selenium
import webdriver
2 from selenium.webdriver.common.keys
import Keys
3
4 driver = webdriver.Chrome("c:\\data\\chromedriver\\chromedriver.exe")
1 driver.get("http://www.python.org")
1 elem = driver.find_element_by_name("q")
2 elem.clear()
3 elem.send_keys("pycon")

Suggestion : 4

Beautiful Soup is a popular Python module that parses a downloaded web page into a certain format and then provides a convenient interface to navigate content. The official documentation of Beautiful Soup can be found here. The latest version of the module can be installed using this command: pip install beautifulsoup4.,The first step with Beautiful Soup is to parse the downloaded HTML into a “soup document”. Beautiful Soup supports several different parsers. Parsers behave differently when parsing web pages that do not contain perfectly valid HTML. For example, consider this HTML syntax of a table entry with missing attribute quotes and closing tags for the table row and table fields:,Beautiful Soup with the lxml parser can correctly interpret the missing attribute quotes and closing tags, as well as add the <html> and <body> tags to form a complete HTML document, as the code below shows:,As with Beautiful Soup, the first step of lxml is parsing the potentially invalid HTML into a consistent format. Here is an example of parsing the same broken HTML:

We use this static student profile webpage to provide examples for each approach. Suppose that we want to scrape a student name. The data we are interested in is found in the following part of the HTML. The student name is included within a <td> element of class="w2p_fw", which is the child of a <tr> element of ID students_name_row.

<table>
   <tr id="students_name_row">
      <td class="w2p_fl"><label for="students_name" id="students_name_label">Name:</label></td>
      <td class="w2p_fw">Adams</td>
      <td class="w2p_fc"></td>
   </tr>
   <tr id="students_school_row">
      <td class="w2p_fl"><label for="students_school" id="students_school_label">School:</label></td>
      <td class="w2p_fw">IV</td>
      <td class="w2p_fc"></td>
   </tr>
   <tr id="students_level_row">
      <td class="w2p_fl"><label for="students_level" id="students_level_label">Advanced:</label></td>
      <td class="w2p_fw">No</td>
      <td class="w2p_fc"></td>
   </tr>
</table>
2._
import re
import requests

url = 'https://iqssdss2020.pythonanywhere.com/tutorial/static/views/Adams.html'
html = requests.get(url)
mylist = re.findall('<td class="w2p_fw">(.*?)</td>', html.text)
print(mylist)

name = re.findall('<td class="w2p_fw">(.*?)</td>', html.text)[0]
print(name)
3._
mylist = re.findall('<tr id="students_name_row">
   <td class="w2p_fl"><label for="students_name" id="students_name_label">Name:\
      </label></td>
   <td class="w2p_fw">(.*?)</td>', html.text)

The first step with Beautiful Soup is to parse the downloaded HTML into a “soup document”. Beautiful Soup supports several different parsers. Parsers behave differently when parsing web pages that do not contain perfectly valid HTML. For example, consider this HTML syntax of a table entry with missing attribute quotes and closing tags for the table row and table fields:

<tr id=students_school_row>
   <td class=w2p_fl>
      <label for="students_school" id="students_school_label">
         School:
      </label>
   <td class=w2p_fw>IV
6._
from bs4 import BeautifulSoup

broken_html = '<tr id=students_school_row>
   <td class=w2p_fl><label for="students_school" id="students_school_label">School:</label>
   <td class=w2p_fw>IV'
      soup = BeautifulSoup(broken_html, 'lxml')
      fixed_html = soup.prettify()
      print(fixed_html)
import re
import requests

url = 'https://iqssdss2020.pythonanywhere.com/tutorial/static/views/Adams.html'
html = requests.get(url)
mylist = re.findall('<td class="w2p_fw">(.*?)</td>', html.text)
print(mylist)

name = re.findall('<td class="w2p_fw">(.*?)</td>', html.text)[0]
print(name)
mylist = re.findall('<tr id="students_name_row">
   <td class="w2p_fl"><label for="students_name" id="students_name_label">Name:\
      </label></td>
   <td class="w2p_fw">(.*?)</td>', html.text)
mylist = re.findall('<tr id="students_name_row">.*?<td\s*class=["\']w2p_fw["\']>(.*?)</td>', html.text)

The first step with Beautiful Soup is to parse the downloaded HTML into a “soup document”. Beautiful Soup supports several different parsers. Parsers behave differently when parsing web pages that do not contain perfectly valid HTML. For example, consider this HTML syntax of a table entry with missing attribute quotes and closing tags for the table row and table fields:

<tr id=students_school_row>
   <td class=w2p_fl>
      <label for="students_school" id="students_school_label">
         School:
      </label>
   <td class=w2p_fw>IV
2._
from bs4 import BeautifulSoup

broken_html = '<tr id=students_school_row>
   <td class=w2p_fl><label for="students_school" id="students_school_label">School:</label>
   <td class=w2p_fw>IV'
      soup = BeautifulSoup(broken_html, 'lxml')
      fixed_html = soup.prettify()
      print(fixed_html)
3._
soup = BeautifulSoup(broken_html, 'html.parser')
from bs4 import BeautifulSoup

broken_html = '<tr id=students_school_row>
   <td class=w2p_fl><label for="students_school" id="students_school_label">School:</label>
   <td class=w2p_fw>IV'
      soup = BeautifulSoup(broken_html, 'lxml')
      fixed_html = soup.prettify()
      print(fixed_html)