In this blog learn how to setup scrapy for development.
Install Python
Install Scrapy using PIP
Create a virtual environment using the below command to run scrapy in virtual environment of python.
cmd> python3 -m venv foldername
e.g. cmd> pythong3 -m venv tutorial-env
Now i have to activate the virtual environment by running the below command on windows
.\tutorial-env\Scripts\Activate.bat for cmd prompt
.\tutorial-env\Scripts\activate.ps1 for power shell
to create a project run the below command
cmd>scrapy startproject whiskyscraper
To run scrapy shell run the below command
cmd> scrapy shell
Fetch
fetch('url') to fetch data
response to see the response status
response.css('div.classname') return all div
response.css('div.classname').get() return first div
products = reponse.css('div.classname') save all records in products array
len(products) return no of records
products.css('a.classname') to get data for "a classname" within div classname
products.css('a.classname').get() to get first one a
products.css('a.classname::text').get() to get a innertext
products.css('a.classname::text').getall() to get all innertext of all a
products.css('a.classname::text').get().replace('$','') to remove dollar sign
products.css('a.classname').attrib['href'] to get href data
run spider by running below command
scrapy crawl whiskyscraper
cmd> scrapy shell 'url'
fetch the URL in shell
response.css('title')
response.css('title').get()
response.css('title::text').get() get the title
response.css('h3') all get h3
response.css('h3').get() first h3
response.css('h3').getall() all h3
response.css('h3').getall()[1] first element
response.css('h3').getall()[3] third element
response.css('div.classname.classname') if space replace with dot
response.css('a::attr(href)').get() return href
response.css('a::text').get() return text
response.text return full page
response.css('h1') all h1 element
response.css('h1').get() first one
response.css('h1::text').get() return h1 text
scrapy genspider drones1 URL - create spider file (drones1.py
for product in products:
item = {
'name':'something',
'price' :24
}
yield item
cmd> scrapy crawl spidername -o one.csv create csv
cmd> scrapy crawl spidername -o one.json create json file