This article is about scrapy for web scraping. I will tell you how to set up and scrape with examples.
First, you download & install python on your machine using this link https://www.python.org/downloads/release/python-3105/
What is python?
Python is a high-level, interpreted, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically-typed and garbage-collected
Second, Run python in the virtual environment (Recommendation: Install Scrapy inside an isolated python environment.) skip step 2 if you don't want run scrapy in the virtual environment.
What is the virtual environment?
A virtual environment is a Python environment such that the Python interpreter, libraries, and scripts installed into it are isolated from those installed in other virtual environments, and (by default) any libraries installed in a “system” Python, i.e., one which is installed as part of your operating system.
The below command will create the virtual environment.
c:\>python3 -m venv virtual_enviroment_folder_name
You can activate your virtual environment by running the activate.bat or activate.ps1 command
.\virtual_environment_folder_name\Scripts\activate.bat
for command prompt
.\virtual_environment_folder_name\Scripts\Activate.ps1
for power shell
deactivate your virtual environment by using the deactivate command
Now the environment is done, using the pip command I will install scrapy. (remember that run this command from your virtual environment)
pip install scrapy
What is scrapy shell?
An interactive shell where you can try and debug your scraping code very quickly, without having to run the spider.
you can activate your scrapy shell by using below command
scrapy shell
Next is to understand the scrapy shortcuts and objects, here i using fetch function to scrape the given url and data will save in response object.
>>> fetch('https://www.google.com')
>>> response.css('title::text').get()
will return the title of url
To extract data from html, Scrapy comes with its own mechanism for extracting data.
Selectors help to extract certain parts of the HTML document specified either by XPath or CSS expressions.
It is a language for selecting nodes in XML documents, which can also be used with HTML.
xpath example
>>> response.xpath('//title/text()').get()
CSS is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.)
css example
>>> response.css('title::text').get()
Now, let check out the first example of scrapy
I will run the below command to create the first project in scrapy
c:\> scrapy startproject my_first_scraper
This will create all the folder and files structure as required by scrapy.
now jump to scraper folder and generate your spider file
You can create your spider file manually or can be created by below command
c:\> scrapy genspider your_spider_name url_to_scrape
this will create a spider file
import scrapy
class MyScraperSpider(scrapy.Spider):
name = 'my_scraper'
allowed_domains = ['www.google.com']
start_urls = ['http://www.google.com/']
def parse(self, response):
pass
Lets complete the code to fetch the title of url
import scrapy
class MyScraperSpider(scrapy.Spider):
name = 'my_scraper'
allowed_domains = ['www.google.com']
start_urls = ['http://www.google.com/']
def parse(self, response):
title = response.css('title::text').get()
yield{
'title' : title
}
After completing your code now run your spider
c:\> scrapy crawl my_scraper
c:\> scrapy crawl my_scraper -o one.csv create csv
c:\> scrapy crawl my_scraper -o one.json create json file
URL: https://www.imdb.com/search/name/?birth_monthday=07-13
import scrapy
class MyScraperSpider(scrapy.Spider):
name = 'my_scraper'
start_urls = ['https://www.imdb.com/search/name/?birth_monthday=07-13']
def parse(self, response):
records = response.css('h3 a')
for rec in records:
yield{
'name' : rec.css('a::text').get().strip(),
'link' : 'https://www.imdb.com' + rec.css('a::attr(href)').get()
}
The above got fetch all the names and links.
In this example 1, i have shown you that how to fetch names and links from particular page. In this example i will tell you how to click on each name and fetch data from another page
import scrapy
class MyScraperSpider(scrapy.Spider):
name = 'my_scraper'
start_urls = ['https://www.imdb.com/search/name/?birth_monthday=07-13']
def parse(self, response):
records = response.css('h3 a')
for rec in records:
link = rec.css('a').attrib['href']
yield response.follow('https://www.imdb.com/' + link, callback = self.page_parse)
def page_parse(self, response):
name = response.css('h1 span.itemprop::text').get()
description = response.css('div.name-trivia-bio-text div.inline::text').getall()
description_str = ' ' .join(description)
description_str = description_str.replace('\n','')
description_str = description_str.replace('...', '')
description_str = description_str.strip()
spouse = response.css('div[id=details-spouses] a::text').get()
yield {
'name' : name,
'description' : description_str,
'spouse' : spouse
}