anyway to scrape a link that redirects?

  • Last Update :
  • Techknowledgy :

requests execute HTTP redirections and keep all urls in r.history

import requests

r = requests.get('http://' + 'bit.ly/english-4-it')

print(r.history)
print(r.url)

result:

[<Response [301]>, <Response [301]>]
      http://helion.pl/ksiazki/english-4-it-praktyczny-kurs-jezyka-angielskiego-dla-specjalistow-it-i-nie-tylko-beata-blaszczyk,anginf.htm

Suggestion : 2

I'm trying to scrape "Find Flights" from https://www.aa.com/homePage.do using the requests module in python.,I try plugging the initial URL and Form Data into requests, but it doesn't work, seemingly because the site redirects and goes to a URL with a StateID and some other params. I can't figure out how to capture the redirect or the new URL. requests.url, requests.headers, requests.history don't have useful information.,I'm not presently concerned with parsing the resulting data, just getting the data.,i can recommend seleinum webdriver. it interacts with a website as if ur a user. can easily put in form data and follow the pages that follow

Here's the code:

import requests

form_data = 'currentCalForm=dep&currentCodeForm=&tripType=oneWay&searchCategory=award&originAirport=JFK&flightParams.flightDateParams.travelMonth=5&flightParams.flightDateParams.travelDay=14&flightParams.flightDateParams.searchTime=040001&destinationAirport=LHR&returnDate.travelMonth=-1000&returnDate.travelDay=-1000&adultPassengerCount=2&adultPassengerCount=1&serviceclass=coach&searchTypeMode=matrix&awardDatesFlexible=true&originAlternateAirportDistance=0&destinationAlternateAirportDistance=0&discountCode=&flightSearch=award&dateChanged=false&fromSearchPage=true&advancedSearchOpened=false&numberOfFlightsToDisplay=10&searchCategory=&aairpassSearchType=false&moreOptionsIndicator=oneWay&seniorPassengerCount=0&youngAdultPassengerCount=0&childPassengerCount=0&infantPassengerCount=0&passengerCount=2'.split('&')

payload = {}
for item in form_data:
   key, value = item.split('=')
if value:
   payload[key] = value

with requests.session() as s:
   r = s.post('https://www.aa.com/homePage.do', params = payload, allow_redirects = True)
print r.headers
print r.history
print r.url
print r.status_code
with open('x.htm', 'wb') as f:
   f.write(r.text.encode('utf8'))

Suggestion : 3

Matt Clarke, Wednesday, February 02, 2022

import requests
import pandas as pd
urls = [
   'https://bbc.co.uk/iplayer',
   'https://facebook.com/',
   'http://www.theguardian.co.uk',
   'https://practicaldatascience.co.uk'
]
df_output = pd.DataFrame(columns = ['original_url', 'original_status', 'destination_url', 'destination_status'])

for url in urls:

   response = requests.get(url, headers = {
      'User-Agent': 'Google Chrome'
   })
row = {}

if response.history:
   for step in response.history:
   row['original_url'] = step.url
row['original_status'] = step.status_code
row['destination_url'] = response.url
row['destination_status'] = response.status_code
else:
   row['original_url'] = response.url
row['original_status'] = response.status_code
row['destination_url'] = ''
row['destination_status'] = ''

print(row)

df_output = df_output.append(row, ignore_index = True)
{
   'original_url': 'https://bbc.co.uk/iplayer',
   'original_status': 301,
   'destination_url': 'https://www.bbc.co.uk/iplayer',
   'destination_status': 200
} {
   'original_url': 'https://facebook.com/',
   'original_status': 301,
   'destination_url': 'https://www.facebook.com/',
   'destination_status': 200
} {
   'original_url': 'https://www.theguardian.com/',
   'original_status': 302,
   'destination_url': 'https://www.theguardian.com/uk',
   'destination_status': 200
} {
   'original_url': 'https://practicaldatascience.co.uk/',
   'original_status': 200,
   'destination_url': '',
   'destination_status': ''
}
df_output.head()
import requests
import pandas as pd
from ecommercetools
import seo

df = seo.get_sitemap('https://www.examplecom/sitemap.xml')
urls = df['loc'].tolist()

df_output = pd.DataFrame(columns = ['original_url', 'original_status', 'destination_url', 'destination_status'])

for url in urls:

   response = requests.get(url, headers = {
      'User-Agent': 'Google Chrome'
   })
row = {}

if response.history:
   for step in response.history:
   row['original_url'] = step.url
row['original_status'] = step.status_code
row['destination_url'] = response.url
row['destination_status'] = response.status_code
else:
   row['original_url'] = response.url
row['original_status'] = response.status_code
row['destination_url'] = ''
row['destination_status'] = ''

print(row)

df_output = df_output.append(row, ignore_index = True)

df_output

Suggestion : 4

Published on June 7, 2022

The first few cells of the ipynb file should include the import statements of required libraries for carrying out the tasks. In this article the custom web scrapper is built using Beautiful Soup and the libraries imported for the same is shown below.

from bs4
import BeautifulSoup
import requests, re

Once the required libraries are imported a user-defined function is created to send a request for the webpage to collect data and it is stored in the variable. Later from the variable, only text from the request granted from the website will be accessed. The user-defined function created for the same is shown below.

def original_htmldoc(url):
   response = requests.get(url) # # the get inbuilt
function is used to send access request to the url
return response.text # # text
function is used to retrieve the text from the response

So once the custom python file is available a ipynb file was taken up in the same working directory. Initially, the drive was mounted to the working environment by traversing until specifying the path to the directory containing the python (py) file as shown below.

from google.colab
import drive
drive.mount('/content/drive')