not able to arrange results of web parsing in proper order

  • Last Update :
  • Techknowledgy :

You could combine two lists as they match in length. I use css selectors to isolate the two lists (one for colours soup.select('p[style="width:9em;padding:5px;margin:auto;"]') and one for rgbs soup.select('p[title]')). I extract the title attribute for each element in the rgbs list and then regex out the required string. I simply use the .text for the a tag children returned in the colours list.

import requests
from bs4
import BeautifulSoup as bs
import re

r = requests.get('https://en.wikipedia.org/wiki/List_of_colors_(compact)')
soup = bs(r.content, 'lxml')
p = re.compile(r '𝗥𝗚𝗕 \((.*)\)')
for rgb, colour in zip(soup.select('p[title]'), soup.select('p[style="width:9em;padding:5px;margin:auto;"]')):
   print(p.findall(rgb['title'])[0], colour.text)

Capture the div that wraps 2 p tags, use the text as the color name, and then parse the rgb values from the style attribute of the first p tag per div, and you get the output you're looking for.

divs = soup.find_all('div', style = "float:left;display:inline;font-size:90%;margin:1px 5px 1px 5px;width:11em; height:6em;text-align:center;padding:auto;")
for i in divs:
   color_value = i.find('p').get('style').split('rgb(')[1].split(')')[0]
color_value = color_value.replace(',', ' ').strip()
print(color_value, i.text.strip())

Another way to do that - no regex!

soup = BeautifulSoup(r.text, 'html.parser')
dat = []
dat2 = []
for i in soup.find_all('p'):
   if i.get('title') is not None:
   title = i.get('title').split('\n')[1].replace('𝗥𝗚𝗕 (', '').replace(')', '')
dat.append(title)
if len(i.text.strip()) > 0:
   dat2.append(i.text)
del dat2[0]
for i, j in zip(dat, dat2):
   print(i, j)

Output:

0 72 186 Absolute Zero
176 191 26 Acid green
124 185 232 Aero

Suggestion : 2

This means that you're not allowed to scrape anything except the subfolder /pages/. Essentially, you just want to read the rules in order where the next rule overrides the previous rule.,Many times you'll see a * next to Allow or Disallow which means you are either allowed or not allowed to scrape everything on the site.,We don't really need to provide a User-agent when scraping, so User-agent: * is what we would follow. A * means that the following rules apply to all bots (that's us).,Allow gives us specific URLs we're allowed to request with bots, and vice versa for Disallow. In this example we're allowed to request anything in the /pages/subfolder which means anything that starts with example.com/pages/. On the other hand, we are disallowed from scraping anything from the /scripts/subfolder.

def save_html(html, path):
   with open(path, 'wb') as f:
   f.write(html)

save_html(r.content, 'google_com')
def open_html(path):
   with open(path, 'rb') as f:
   return f.read()

html = open_html('google_com')
User - agent: *
   Crawl - delay: 10
Allow: /pages/
Disallow: /scripts/

# more stuff
Disallow: *
   Allow: /pages/
import requests

url = 'https://www.allsides.com/media-bias/media-bias-ratings'

r = requests.get(url)

print(r.content[: 100])
b'
<!DOCTYPE html>\n<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->\n<!--[if lte'

Suggestion : 3

Getting through a Captcha Some sites employ Captcha or similar to prevent unwanted robots (which they might consider you). This can put a damper on web scraping and slow it way down. ,If report data were to be found, often, the data would be accessible by passing either form variables or parameters with the URL. For example:,Websites are meant to change – and they often do. That’s why when writing a scraping script, it’s best to keep this in mind. You’ll want to think about which methods you’ll use to find the data, and which not to use. Consider partial matching techniques, rather than trying to match a whole phrase. For example, a website might change a message from “No records found” to “No records located” – but if your match is on “No records,” you should be okay. Also, consider whether to match on XPATH, ID, name, link text, tag or class name, or CSS selector – and which is least likely to change.,Some sites employ Captcha or similar to prevent unwanted robots (which they might consider you). This can put a damper on web scraping and slow it way down.

If report data were to be found, often, the data would be accessible by passing either form variables or parameters with the URL. For example:

https: //www.myreportdata.com?month=12&year=2004&clientid=24823

I was able to start up Chrome in the script by adding the library components I needed, then issuing a couple of simple commands:

# Load selenium components
from selenium
import webdriver
from selenium.webdriver.common.by
import By
from selenium.webdriver.support.ui
import WebDriverWait, Select
from selenium.webdriver.support
import expected_conditions as EC
from selenium.common.exceptions
import TimeoutException

# Establish chrome driver and go to report site URL
url = "https://reportdata.mytestsite.com/transactionSearch.jsp"
driver = webdriver.Chrome()
driver.get(url)

By examining the form in developer tools (F12), I noticed that the form was presented within an iframe. So, before I could start filling in the form, I needed to “switch” to the proper iframe where the form existed. To do this, I invoked Selenium’s switch-to feature, like so:

# Switch to iframe where form is
frame_ref = driver.find_elements_by_tag_name("iframe")[0]
iframe = driver.switch_to.frame(frame_ref)

Then, armed with this information, I found the element on the page, then clicked it.

# Find the‘ Find’ button, then click it
driver.find_element_by_xpath("/html/body/table/tbody/tr[2]/td[1]/table[3]/tbody/tr[2]/td[2]/input").click()

Thus, it was necessary to find any plus signs on the page, gather the URL next to each one, then loop through each to get all data for every transaction.

# Loop through transactions and count
links = driver.find_elements_by_tag_name('a')
link_urls = [link.get_attribute('href') for link in links]
thisCount = 0
isFirst = 1
for url in link_urls:
   if (url.find("GetXas.do?processId") >= 0): # URL to link to transactions
if isFirst == 1: # already expanded +
   isFirst = 0
else:
   driver.get(url) # collapsed + , so expand
# Find closest element to URL element with correct class to get tran type tran_type = driver.find_element_by_xpath("//*[contains(@href,'/retail/transaction/results/GetXas.do?processId=-1')]/following::td[@class='txt_75b_lmnw_T1R10B1']").text
# Get transaction status
status = driver.find_element_by_class_name('txt_70b_lmnw_t1r10b1').text
# Add to count
if transaction found
if (tran_type in ['Move In', 'Move Out', 'Switch']) and(status == "Complete"):
   thisCount += 1

Suggestion : 4

Even if no syntax or runtime errors appear when running our program, there still might be semantic errors. You should check whether we actually get the data assigned to the right object and move to the array correctly.,Once a satisfactory web scraper is running, you no longer need to watch the browser perform its actions. Get headless versions of either Chrome or Firefox browsers and use those to reduce load times.,Running our program now should display no errors and display acquired data in the debugger window. While “print” is great for testing purposes, it isn’t all that great for parsing and analyzing data.,For the purposes of this tutorial, we will try something slightly different. Since acquiring data from the same class would just mean appending to an additional list, we should attempt to extract data from a different class but, at the same time, maintain the structure of our table.

pip install requests
import requests
response = requests.get("https://oxylabs.io/”)
      print(response.text)
form_data = {
   'key1': 'value1',
   'key2': 'value2'
}
response = requests.post("https://oxylabs.io/ ", data = form_data)
print(response.text)
proxies = {
   'http': 'http://user:password@proxy.oxylabs.io'
}
response = requests.get('http://httpbin.org/ip', proxies = proxies)
print(response.text)
import requests
url = 'https://oxylabs.io/blog'
response = requests.get(url)
from bs4
import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title)