scrape websites using scrapy

  • Last Update :
  • Techknowledgy :

Implementing Web Scraping in Python with Scrapy,Implementing Web Scraping in Python with BeautifulSoup,For example, links[0] will show something like this :'<a href="https://www.geeksforgeeks.org/" title="GeeksforGeeks" rel="home">GeeksforGeeks</a>',Scrapy comes with whole new features of creating spider, running it and then saving data easily by scraping it. At first it looks quite confusing but it’s for the best.

To overcome this problem, one can make use of MultiThreading/Multiprocessing with BeautifulSoup module and he/she can create spider, which can help to crawl over a website and extract data. In order to save the time one use Scrapy.

With the help of Scrapy one can:

   1. Fetch millions of data efficiently
2. Run it on server
3. Fetching data
4. Run spider in multiple processes

It is good to create one virtual environment as it isolates the program and doesn’t affect any other programs present in the machine. To create virtual environment first install it by using :

sudo apt - get install python3 - venv

Create one folder and then activate it :

mkdir scrapy - project && cd scrapy - project
python3 - m venv myvenv

After creating virtual environment activate it by using :

source myvenv / bin / activate

Install Scrapy by using :

pip install scrapy
response.css('a')
links = response.css('a').extract()
'<a href="https://www.geeksforgeeks.org/" title="GeeksforGeeks" rel="home">GeeksforGeeks</a>'
attributes
links = response.css('a::attr(href)').extract()

Suggestion : 2

To install Scrapy using conda, run:

conda install - c conda - forge scrapy

Alternatively, if you’re on Linux or Mac OSX, you can directly install scrapy by:

pip install scrapy

I love the python shell, it helps me “try out” things before I can implement them in detail. Similarly, scrapy provides a shell of its own that you can use to experiment. To start the scrapy shell in your command line type:

scrapy shell

When you crawl something with scrapy it returns a “response” object that contains the downloaded information. Let’s see what the crawler has downloaded:

view(response)

Let’s see how does the raw content looks like:

print response.text

Suggestion : 3

An open source and collaborative framework for extracting the data you need from websites. ,In a fast, simple, yet extensible way.,extensible by design, plug new functionality easily without having to touch the core,write the rules to extract the data and let Scrapy do the rest

1._
 pip install scrapy
 cat > myspider.py << EOF
 import scrapy

 class BlogSpider(scrapy.Spider):
    name = 'blogspider'
 start_urls = ['https://www.zyte.com/blog/']

 def parse(self, response):
    for title in response.css('.oxy-post-title'):
    yield {
       'title': title.css('::text').get()
    }

 for next_page in response.css('a.next'):
    yield response.follow(next_page, self.parse) EOF
 scrapy runspider myspider.py
2._
import scrapy

class BlogSpider(scrapy.Spider):
   name = 'blogspider'
start_urls = ['https://www.zyte.com/blog/']

def parse(self, response):
   for title in response.css('.oxy-post-title'):
   yield {
      'title': title.css('::text').get()
   }

for next_page in response.css('a.next'):
   yield response.follow(next_page, self.parse)
import scrapy

class BlogSpider(scrapy.Spider):
   name = 'blogspider'
start_urls = ['https://www.zyte.com/blog/']

def parse(self, response):
   for title in response.css('.oxy-post-title'):
   yield {
      'title': title.css('::text').get()
   }

for next_page in response.css('a.next'):
   yield response.follow(next_page, self.parse)
1._
 pip install shub
 shub login
Insert your Zyte Scrapy Cloud API Key: <API_KEY>

# Deploy the spider to Zyte Scrapy Cloud
 shub deploy

# Schedule the spider for execution
 shub schedule blogspider 
Spider blogspider scheduled, watch it running here:
https://app.zyte.com/p/26731/job/1/8

# Retrieve the scraped data
 shub items 26731/1/8
{"title": "Improved Frontera: Web Crawling at Scale with Python 3 Support"}
{"title": "How to Crawl the Web Politely with Scrapy"}
...
2._
{
   "title": "Improved Frontera: Web Crawling at Scale with Python 3 Support"
} {
   "title": "How to Crawl the Web Politely with Scrapy"
}
...
{
   "title": "Improved Frontera: Web Crawling at Scale with Python 3 Support"
} {
   "title": "How to Crawl the Web Politely with Scrapy"
}
...