Setup Scrapy for Development

In this blog learn how to setup scrapy for development.

Install Python

Install Scrapy using PIP

Create a virtual environment using the below command to run scrapy in virtual environment of python.

cmd> python3 -m venv foldername

e.g. cmd> pythong3 -m venv tutorial-env

Now i have to activate the virtual environment by running the below command on windows

.\tutorial-env\Scripts\Activate.bat for cmd prompt

.\tutorial-env\Scripts\activate.ps1 for power shell

start a project

to create a project run the below command

cmd>scrapy startproject whiskyscraper

Scrapy Shell

To run scrapy shell run the below command

cmd> scrapy shell

Fetch

fetch('url') to fetch data

response to see the response status

response.css('div.classname') return all div

response.css('div.classname').get() return first div

products = reponse.css('div.classname') save all records in products array

len(products) return no of records

products.css('a.classname') to get data for "a classname" within div classname

products.css('a.classname').get() to get first one a

products.css('a.classname::text').get() to get a innertext

products.css('a.classname::text').getall() to get all innertext of all a

products.css('a.classname::text').get().replace('$','') to remove dollar sign

products.css('a.classname').attrib['href'] to get href data

run spider by running below command

scrapy crawl whiskyscraper


cmd> scrapy shell 'url'

fetch the URL in shell

response.css('title')

response.css('title').get()

response.css('title::text').get() get the title

response.css('h3') all get h3

response.css('h3').get() first h3

response.css('h3').getall() all h3

response.css('h3').getall()[1] first element

response.css('h3').getall()[3] third element

response.css('div.classname.classname') if space replace with dot

response.css('a::attr(href)').get() return href

response.css('a::text').get() return text

response.text return full page

response.css('h1') all h1 element

response.css('h1').get() first one

response.css('h1::text').get() return h1 text


scrapy genspider drones1 URL - create spider file (drones1.py


for product in products:
item = {
'name':'something',
'price' :24
}

yield item


cmd> scrapy crawl spidername -o one.csv create csv

cmd> scrapy crawl spidername -o one.json create json file