Scrapy
Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival
Features
- Built-in support for selecting and extracting data from HTML/XML source using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions.
- An interactive shell console for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders.
- Built-in support for generating feed exports in multiple formats(JSON, CSV, XML) and sorting them multiple backends.
- Robust encoding support and dealing with foreign, non-standard and broken encoding declarations.
- Strong extensibility support, allowing you to plug in your own functionality using signals and well-defined API(middleware, extensions and pipelines)
Wide range of built-in extensions and middlewares for handling:
Cookies and session handling
HTTP features like compression, authentication, caching
User-agent spoofing
robots.txt
- crawl depth restriction
and more
- A Telnet console for hooking into a Python console running inside your scrapy process, to introspect and debug your crawler.
- Plus other goodies like reusable spiders to crawl sites from Sitemaps and XML/CSV feeds, a media pipeline for automatically downloading images associated with the scraped items, a caching DNS resolver.
Write your first SCrapy
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
- name: identifies the Spider. It must be unique within a project.
- start_requests(): must return an iterable of Request which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.
- parse(): A method that will be called to handle the response downloaded for each of the request made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new request(Request) from them.
Extracting data
scrapy shell "http://quotes.toscrape.com/page/1/"
- Use the Css Selector
>>> response.css('title')
>>> response.css('title::text').extract()
>>> response.css('title::text')[0].extract() # may raise IndexError when there is empty result
>>> response.css('title::text').extract_first()
>>> response.css('span small::text').extract_first(), # <span class="small">Something</span>
>>> response.css('div.tags a.tag::text').extract(),
Using the extract_first avoids an IndexError and returns None when it doesn't find any element matching the selection.
Use regular expression
>>> response.css('title::text').re(r'Quotes.*')
>>> response.css('title::text').re(r'Q\w+')
- Use XPath, a brief introduction
>>> response.xpath('//title')
>>> response.xpath('//title/text()').extract_first()
Storing the scraped data
- Json storage
$ scrapy crawl quotes -o quotes.json
This will generate an quotes.json file containing all scraped items, serialized in JSON.(append to the end of the file)
- Json Line
scrapy crawl quotes -o quotes.jl
The Json Line format is useful because it's stream-like, you can easily append records to it.(Each record in separate line, you can process big files without having to fit everything in memory.)
Following links
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('span small::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
When you yield a Request in a callback method, Scrapy will schedule that request to sent and register a callback method to be executed when the request finishes.
Command Line Tools
Scrapy will look for configuration parameters in ini-style scrapy.cfg files in standard locations:
- /etc/scrapy.cfg (system wide)
- ~/.config/scrapy.cfg or ~/.scrapy.cfg for global settings
- scrapy.cfg inside a scrapy project's root
scrapy.cfg
myproject/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
The directory where the scrapy.cfg files resides is known as the project root directory.
Common command
Global Command:
- (global commands)startproject genspider settings runspider shell fetch view version
(project-only command) crawl check list edit parse bench
startproject `scrapy startproject myproject` [project_dir]
genspider `scrapy genspider [-t template <name> <domain>]`
- Create a new spider in the current folder or in the current project's spider folder, if called from inside a project. The <name> parameter is set as the sider's name, while domain is used to generate allow_domains and start_urls spider's attributes.
Scrapy Specify
- Adding argument when calling scrap
" Spider arguments are passed in the crawl command using the -a option.
$ scrapy crawl myspider -a category=electronics -a domain=system
Spider can access the argument in their initializers:
class MySpider(scrapy.Spider):
name = 'myspider'
def __init__(self, category='', domain=None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://www.example.com/categories/%s' % category]
self.domain = domain
# ...
2. Sending parameter to the parse function
yield Request(url, callback=self.detail, meta={'item': item}
In the parse function, you can call the parameter by:
item = response.meta['item']