XPath

XPath is the result of an effort to provide a common syntax and semantics for functionality shared between XSL Transformations and XPointer. (The primary purpose of XPath is to address parts of an XML document.)

List of XPath

  • XPath as a filesystem addressing
    • If the path starts with the slash /, this represents an absolute path to the required element.
  • Start with //

    • If the path starts with // then all elements in the documents which fulfill following criteria are selected.
  • All elements\*

    • The star * selects all elements located by preceeding path.
  • Further condition inside []

    • Expression in square brackets can further specify an element. A number in the brackets gives the position of the element in the selected set. The function last() selects the last element in the selection.

      Example: /AAAA/BBB\[last\(\)\]   /AAA/BBB\[1\]
      
  • Attributes

    • Attributes are specified by @prefix

      //@id //BBB[@id] //BBB[@name] //BBB[@*] //BBB[not(@*)]

  • Attribute values

    • values of attributes can be used as selection criteria.

      //BBB[@id='b1'] //BBB[@name='bbb'] //BBB[normalize-space(@name)='bbb']

  • Node counting

    • Function count() counts the number of selected elements

      //*[count(BBB) = 2] //*[count(*) = 2]

  • Playing with names of selected elements

    • Function name() returns name of the element, the starts-with function returns true if the first argument string starts with the second argument string, and the contains function returns true if the first argument string contains the second argument string. //*[name()='BBB'] //*[starts-with(name(), 'B')] //*[contains(name(), 'C')]

Basic Principle

If a condition is appended with/, that means to then select the matching nodes for the next step. If the appended condition was enclosed in[], that means to continue on with the original set, but to discard those nodes for which there were no matching new nodes.

Selectors

  • BeautifulSoup: Construct a Python object based on the structure of the HTML code and deal with bad markup reasonably well.
  • lxml: An XML parsing library with a pythonic API based on Element tree.
>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]
>>> response.css('img').xpath('@src').extract()

.xpath() and .css() methods return a SelectorList instance, which is a list of new selectors. And to actually extract the textual data, you can call the selector .extract() methods.

>>> response.xpath('//base/@href').extract()
>>> response.css('base::attr(href)').extract()

>>> response.xpath('//a[contains(@href, "image")]').extract()
>>> response.css('a[href*=image]::attr(href)').extract()

>>> response.xpath('//a[contains(@href, "image")/img/@src]').extract()

Nesting selectors

The selection method (.xpath() or .css()) return a list of selectors of the same type, so you can call the selection methods for those selectors.

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> for index, link in enumerate(links):
        args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
        print('Link number %d points to url %s and image %s' % args)
>>> links = response.xpath('//div[@class="main-body"]/p/text()').extract()
  1. Select with the class name

results matching ""

    No results matching ""