Bolerjack: use scrapy to crawl an xml webpage

Wednesday, 7 August 2013

use scrapy to crawl an xml webpage

use scrapy to crawl an xml webpage

I need to crawl an xml page http://www.10why.net/sitemap.xml which is just
a table of urls that i want
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re
thename = "sitemap"
class ReviewSpider(BaseSpider):
name = thename
allowed_domains = ['10why.net']
start_urls = ['http://www.10why.net/sitemap.xml']
def parse(self, response):
hxs = HtmlXPathSelector(response)
content = hxs.select('//table[@cellpadding="5"]/tbody//a')
print content
for c in content:
file = open('%s.txt' % thename, 'a')
file.write("\n")
file.write(c)
file.close()
The content printed is [] (empty list) I use to be able to crawl things on
a normal html page instead of a site map xml page. Please help me. PS: I
write the file by myself for other reasons.

Bolerjack

Wednesday, 7 August 2013

use scrapy to crawl an xml webpage

No comments:

Post a Comment