use scrapy to crawl local xml file - start url local file address

  • Last Update :
  • Techknowledgy :

Don't specify the allowed_domains at all and use 3 slashes after the protocol:

start_urls = ["file:///home/sayth/Downloads/20160123RAND0.xml"]

Here's a usage example

import pathlib

start_urls = [
   pathlib.Path(os.path.abspath('20160123RAND0.xml')).as_uri()
]

Suggestion : 2

I want to crawl a local xml file that I have anycodings_xml located in my Downloads folder with scrapy, anycodings_xml use xpath to extract the relevant anycodings_xml information.,I have tried several version of the below anycodings_xml however i am not able to get the start url anycodings_xml to accept my file.,Just to confirm I do have the file in that anycodings_xml location,Don't specify the allowed_domains at all anycodings_xpath and use 3 slashes after the protocol:

Using the scrapy intro as a guide

2016-01-24 12:38:53 [scrapy] DEBUG: Retrying <GET file://home/sayth/Downloads/20160123RAND0.xml> (failed 2 times): [Errno 2] No such file or directory: '/sayth/Downloads/20160123RAND0.xml'
   2016-01-24 12:38:53 [scrapy] DEBUG: Gave up retrying <GET file://home/sayth/Downloads/20160123RAND0.xml> (failed 3 times): [Errno 2] No such file or directory: '/sayth/Downloads/20160123RAND0.xml'
      2016-01-24 12:38:53 [scrapy] ERROR: Error downloading <GET file://home/sayth/Downloads/20160123RAND0.xml>

I have tried several version of the below anycodings_xml however i am not able to get the start url anycodings_xml to accept my file.

# - * -coding: utf - 8 - * -
   import scrapy

class MyxmlSpider(scrapy.Spider):
   name = "myxml"
allowed_domains = ["file://home/sayth/Downloads"]
start_urls = (
   'http://www.file://home/sayth/Downloads/20160123RAND0.xml',
)

def parse(self, response):
   for file in response.xpath('//meeting'):
   full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback = self.parse_question)

def parse_xml(self, response):
   yield {
      'name': response.xpath('//meeting/race').extract()
   }

Just to confirm I do have the file in that anycodings_xml location

sayth @sayth - HP - EliteBook - 2560 p: ~/Downloads [0] % ls - a
   .Building a Responsive Website with Bootstrap[Video].zip..codemirror.zip
1.1 Situation Of Long Term Gain.xls Complete - Python - Bootcamp - master.zip
2008 Racedata.xls Cox Plate 2005. xls
20160123 RAND0.xml

Don't specify the allowed_domains at all anycodings_xpath and use 3 slashes after the protocol:

start_urls = ["file:///home/sayth/Downloads/20160123RAND0.xml"]

Here's a usage example

import pathlib

start_urls = [
   pathlib.Path(os.path.abspath('20160123RAND0.xml')).as_uri()
]

Suggestion : 3

Use Scrapy to crawl local XML file - Start URL local file address,Don't specify the allowed_domains at all and use 3 slashes after the protocol:

start_urls = ["file:///home/sayth/Downloads/20160123RAND0.xml"]

Suggestion : 4

I want to crawl a local xml file that I have located in my Downloads folder with scrapy, use xpath to extract the relevant information.,I have tried several version of the below however i am not able to get the start url to accept my file.

Using the scrapy intro as a guide

2016-01-24 12:38:53 [scrapy] DEBUG: Retrying <GET file://home/sayth/Downloads/20160123RAND0.xml> (failed 2 times): [Errno 2] No such file or directory: '/sayth/Downloads/20160123RAND0.xml'
   2016-01-24 12:38:53 [scrapy] DEBUG: Gave up retrying <GET file://home/sayth/Downloads/20160123RAND0.xml> (failed 3 times): [Errno 2] No such file or directory: '/sayth/Downloads/20160123RAND0.xml'
      2016-01-24 12:38:53 [scrapy] ERROR: Error downloading <GET file://home/sayth/Downloads/20160123RAND0.xml>

I have tried several version of the below however i am not able to get the start url to accept my file.

# - * -coding: utf - 8 - * -
   import scrapy

class MyxmlSpider(scrapy.Spider):
   name = "myxapy

class MyxmlSpider(scrapy.Spider):
   name = "myxml"
allowed_domains = ["file://home/sayth/Downloads"]
start_urls = (
   'http://www.file://home/sayth/Downloads/20160123RAND0.xml',
)

def parse(self, response):
   for file in response.xpath('//meeting'):
   full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback = self.parse_question)

def parse_xml(self, response):
   yield {
      'name': response.xpath('//meeting/race').extract()
   }

Just to confirm I do have the file in that location

[email protected]: ~/Downloads [0] % ls - a
   .Building a Responsive Website with Bootstrap[Video].zip..codemirror.zip
1.1 Situation Of Long Term Gain.xls Complete - Python - Bootcamp - master.zip
2008 Racedata.xls Cox Plate 2005. xls
20160123 RAND0.xml

Don't specify the allowed_domains at all and use 3 slashes after the protocol:

start_urls = ["file:///home/sayth/Downloads/20160123RAND0.xml"]