Don't specify the allowed_domains
at all and use 3 slashes after the protocol:
start_urls = ["file:///home/sayth/Downloads/20160123RAND0.xml"]
Here's a usage example
import pathlib
start_urls = [
pathlib.Path(os.path.abspath('20160123RAND0.xml')).as_uri()
]
I want to crawl a local xml file that I have anycodings_xml located in my Downloads folder with scrapy, anycodings_xml use xpath to extract the relevant anycodings_xml information.,I have tried several version of the below anycodings_xml however i am not able to get the start url anycodings_xml to accept my file.,Just to confirm I do have the file in that anycodings_xml location,Don't specify the allowed_domains at all anycodings_xpath and use 3 slashes after the protocol:
Using the scrapy intro as a guide
2016-01-24 12:38:53 [scrapy] DEBUG: Retrying <GET file://home/sayth/Downloads/20160123RAND0.xml> (failed 2 times): [Errno 2] No such file or directory: '/sayth/Downloads/20160123RAND0.xml'
2016-01-24 12:38:53 [scrapy] DEBUG: Gave up retrying <GET file://home/sayth/Downloads/20160123RAND0.xml> (failed 3 times): [Errno 2] No such file or directory: '/sayth/Downloads/20160123RAND0.xml'
2016-01-24 12:38:53 [scrapy] ERROR: Error downloading <GET file://home/sayth/Downloads/20160123RAND0.xml>
I have tried several version of the below anycodings_xml however i am not able to get the start url anycodings_xml to accept my file.
# - * -coding: utf - 8 - * - import scrapy class MyxmlSpider(scrapy.Spider): name = "myxml" allowed_domains = ["file://home/sayth/Downloads"] start_urls = ( 'http://www.file://home/sayth/Downloads/20160123RAND0.xml', ) def parse(self, response): for file in response.xpath('//meeting'): full_url = response.urljoin(href.extract()) yield scrapy.Request(full_url, callback = self.parse_question) def parse_xml(self, response): yield { 'name': response.xpath('//meeting/race').extract() }
Just to confirm I do have the file in that anycodings_xml location
sayth @sayth - HP - EliteBook - 2560 p: ~/Downloads [0] % ls - a
.Building a Responsive Website with Bootstrap[Video].zip..codemirror.zip
1.1 Situation Of Long Term Gain.xls Complete - Python - Bootcamp - master.zip
2008 Racedata.xls Cox Plate 2005. xls
20160123 RAND0.xml
Don't specify the allowed_domains at all anycodings_xpath and use 3 slashes after the protocol:
start_urls = ["file:///home/sayth/Downloads/20160123RAND0.xml"]
Here's a usage example
import pathlib
start_urls = [
pathlib.Path(os.path.abspath('20160123RAND0.xml')).as_uri()
]
Use Scrapy to crawl local XML file - Start URL local file address,Don't specify the allowed_domains at all and use 3 slashes after the protocol:
start_urls = ["file:///home/sayth/Downloads/20160123RAND0.xml"]
I want to crawl a local xml file that I have located in my Downloads folder with scrapy, use xpath to extract the relevant information.,I have tried several version of the below however i am not able to get the start url to accept my file.
Using the scrapy intro as a guide
2016-01-24 12:38:53 [scrapy] DEBUG: Retrying <GET file://home/sayth/Downloads/20160123RAND0.xml> (failed 2 times): [Errno 2] No such file or directory: '/sayth/Downloads/20160123RAND0.xml'
2016-01-24 12:38:53 [scrapy] DEBUG: Gave up retrying <GET file://home/sayth/Downloads/20160123RAND0.xml> (failed 3 times): [Errno 2] No such file or directory: '/sayth/Downloads/20160123RAND0.xml'
2016-01-24 12:38:53 [scrapy] ERROR: Error downloading <GET file://home/sayth/Downloads/20160123RAND0.xml>
I have tried several version of the below however i am not able to get the start url to accept my file.
# - * -coding: utf - 8 - * - import scrapy class MyxmlSpider(scrapy.Spider): name = "myxapy class MyxmlSpider(scrapy.Spider): name = "myxml" allowed_domains = ["file://home/sayth/Downloads"] start_urls = ( 'http://www.file://home/sayth/Downloads/20160123RAND0.xml', ) def parse(self, response): for file in response.xpath('//meeting'): full_url = response.urljoin(href.extract()) yield scrapy.Request(full_url, callback = self.parse_question) def parse_xml(self, response): yield { 'name': response.xpath('//meeting/race').extract() }
Just to confirm I do have the file in that location
[email protected]: ~/Downloads [0] % ls - a
.Building a Responsive Website with Bootstrap[Video].zip..codemirror.zip
1.1 Situation Of Long Term Gain.xls Complete - Python - Bootcamp - master.zip
2008 Racedata.xls Cox Plate 2005. xls
20160123 RAND0.xml
Don't specify the allowed_domains
at all and use 3 slashes after the protocol:
start_urls = ["file:///home/sayth/Downloads/20160123RAND0.xml"]