scrapy - retrieve spider object in dupefilter

  • Last Update :
  • Techknowledgy :

If you really want that, a solution can be to override the request_seen method signature of the RFPDupeFilter, so that it receives 2 arguments (self, request, spider); than you need to override also the scrapy Scheuler's enqueue_request method because request_seen is called inside. You can creat new scheduler and new dupefilter like this:

# / scheduler.py

from scrapy.core.scheduler
import Scheduler

class MyScheduler(Scheduler):

   def enqueue_request(self, request):
   if not request.dont_filter and self.df.request_seen(request, self.spider):
   self.df.log(request, self.spider)
return False
dqok = self._dqpush(request)
if dqok:
   self.stats.inc_value('scheduler/enqueued/disk', spider = self.spider)
else:
   self._mqpush(request)
self.stats.inc_value('scheduler/enqueued/memory', spider = self.spider)
self.stats.inc_value('scheduler/enqueued', spider = self.spider)
return True

-

# / dupefilters.py

from scrapy.dupefilters
import RFPDupeFilter

class MyRFPDupeFilter(RFPDupeFilter):

   def request_seen(self, request, spider):
   fp = self.request_fingerprint(request)
if fp in self.fingerprints:
   return True
self.fingerprints.add(fp)
if self.file:
   self.file.write(fp + os.linesep)

# Do things with spider

and set their paths in settings.py:

# / settings.py

DUPEFILTER_CLASS = 'myproject.dupefilters.MyRFPDupeFilter'
SCHEDULER = 'myproject.scheduler.MyScheduler'

Suggestion : 2

The Scrapy settings allows you to customize the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves.,A dict containing the extensions available by default in Scrapy, and their orders. This setting contains all stable built-in extensions. Keep in mind that some of them need to be enabled through a setting.,A dict containing the item pipelines to use, and their orders. Order values are arbitrary, but it is customary to define them in the 0-1000 range. Lower orders process before higher orders.,Settings can be accessed through the scrapy.crawler.Crawler.settings attribute of the Crawler that is passed to from_crawler method in extensions, middlewares and item pipelines:

scrapy crawl myspider - s LOG_FILE = scrapy.log
class MySpider(scrapy.Spider):
   name = 'myspider'

custom_settings = {
   'SOME_SETTING': 'some value',
}
from mybot.pipelines.validate
import ValidateMyItem
ITEM_PIPELINES = {
   # passing the classname...
   ValidateMyItem: 300,
   #...equals passing the class path 'mybot.pipelines.validate.ValidateMyItem': 300,
}
class MySpider(scrapy.Spider):
   name = 'myspider'
start_urls = ['http://example.com']

def parse(self, response):
   print(f "Existing settings: {self.settings.attributes.keys()}")
class MyExtension:
   def __init__(self, log_is_enabled = False):
   if log_is_enabled:
   print("log is enabled!")

@classmethod
def from_crawler(cls, crawler):
   settings = crawler.settings
return cls(settings.getbool('LOG_ENABLED'))
{
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Language': 'en',
}

Suggestion : 3

Scrapy process can be used to extract the data from sources such as web pages using the spiders. Scrapy uses Item class to produce the output whose objects are used to gather the scraped data.,Scrapy shell can be used to scrap the data with error free code, without the use of spider. The main purpose of Scrapy shell is to test the extracted code, XPath, or CSS expressions. It also helps specify the web pages from which you are scraping the data.,Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, uses the spiders to execute the request and get back to the request when it returns a response object.,Spider is a class that defines initial URL to extract the data from, how to follow pagination links and how to extract and parse the fields defined in the items.py. Scrapy provides different types of spiders each of which gives a specific purpose.

>>> from scrapy.loader.processors
import TakeFirst
   >>>
   loader.get_value(u 'title: demoweb', TakeFirst(),
      unicode.upper, re = 'title: (.+)')
'DEMOWEB`
loader.add_value('title', u 'DVD')
loader.add_value('colors', [u 'black', u 'white'])
loader.add_value('length', u '80')
loader.add_value('price', u '2500')
loader.replace_value('title', u 'DVD')
loader.replace_value('colors', [u 'black',
   u 'white'
])
loader.replace_value('length', u '80')
loader.replace_value('price', u '2500')
# HTML code: <div class="item-name">DVD</div>
loader.get_xpath("//div[@class =
'item-name']")

# HTML code: <div id="length">the length is
   45cm</div>
loader.get_xpath("//div[@id = 'length']", TakeFirst(),
re = "the length is (.*)")
# HTML code: <div class="item-name">DVD</div>
loader.add_xpath('name', '//div
[@class = "item-name"]')

# HTML code: <div id="length">the length is
   45cm</div>
loader.add_xpath('length', '//div[@id = "length"]',
re = 'the length is (.*)')
# HTML code: <div class="item-name">DVD</div>
loader.replace_xpath('name', '
//div[@class = "item-name"]')

# HTML code: <div id="length">the length is
   45cm</div>
loader.replace_xpath('length', '
//div[@id = "length"]', re = 'the length is (.*)')

Suggestion : 4

Distributed crawling/scraping You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls. ,You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.,The class scrapy_redis.spiders.RedisSpider enables a spider to read the urls from redis. The urls in the redis queue will be processed one after another, if the first request yields more requests, the spider will process those requests before fetching another url from redis.,This example illustrates how to share a spider’s requests queue across multiple spider instances, highly suitable for broad crawls.

$ cd example - project
$ scrapy crawl dmoz
   ...[dmoz]...
   ^
   C
$ scrapy crawl dmoz
   ...[dmoz] DEBUG: Resuming crawl(9019 requests scheduled)
$ scrapy crawl dmoz
   ...[dmoz] DEBUG: Resuming crawl(8712 requests scheduled)
$ python process_items.py dmoz: items - v
   ...
   Processing: Kilani Giftware(http: //www.dmoz.org/Computers/Shopping/Gifts/)
      Processing: NinjaGizmos.com(http: //www.dmoz.org/Computers/Shopping/Gifts/)
         ...
from scrapy_redis.spiders
import RedisSpider

class MySpider(RedisSpider):
   name = 'myspider'

def parse(self, response):
   # do stuff
   pass