Web scraping using Scrapy - Part 1

This article is about scrapy for web scraping. I will tell you how to set up and scrape with examples.

Step1: Python Installation

First, you download & install python on your machine using this link https://www.python.org/downloads/release/python-3105/

What is python?

Python is a high-level, interpreted, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically-typed and garbage-collected

Second, Run python in the virtual environment (Recommendation: Install Scrapy inside an isolated python environment.) skip step 2 if you don't want run scrapy in the virtual environment.

Step 2: Virtual Environment

What is the virtual environment?

A virtual environment is a Python environment such that the Python interpreter, libraries, and scripts installed into it are isolated from those installed in other virtual environments, and (by default) any libraries installed in a “system” Python, i.e., one which is installed as part of your operating system.

Step 2. a) creating a virtual environment

The below command will create the virtual environment.

c:\>python3 -m venv virtual_enviroment_folder_name

Step 2. b) Activate the Virtual environment

You can activate your virtual environment by running the activate.bat or activate.ps1 command

.\virtual_environment_folder_name\Scripts\activate.bat

for command prompt

.\virtual_environment_folder_name\Scripts\Activate.ps1

for power shell

Step 2. c) Deactivate the virtual environment

deactivate your virtual environment by using the deactivate command

Step 3. Scrapy Installation

Now the environment is done, using the pip command I will install scrapy. (remember that run this command from your virtual environment)

pip install scrapy

Step 4. Understanding Scrapy Shell

What is scrapy shell?

An interactive shell where you can try and debug your scraping code very quickly, without having to run the spider.

you can activate your scrapy shell by using below command

scrapy shell

Next is to understand the scrapy shortcuts and objects, here i using fetch function to scrape the given url and data will save in response object.

>>> fetch('https://www.google.com')

>>> response.css('title::text').get()

will return the title of url

Step 5. Scrapy Selectors

To extract data from html, Scrapy comes with its own mechanism for extracting data.

Selectors help to extract certain parts of the HTML document specified either by XPath or CSS expressions.

XPATH

It is a language for selecting nodes in XML documents, which can also be used with HTML.

xpath example

>>> response.xpath('//title/text()').get()

CSS

CSS is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.)

css example

>>> response.css('title::text').get()

Step 6. First Project

Now, let check out the first example of scrapy

Step 6. a) create your scrapy project

I will run the below command to create the first project in scrapy

c:\> scrapy startproject my_first_scraper

This will create all the folder and files structure as required by scrapy.

now jump to scraper folder and generate your spider file

Step 6. b) Generate spider file

You can create your spider file manually or can be created by below command

c:\> scrapy genspider your_spider_name url_to_scrape

this will create a spider file


import scrapy


class MyScraperSpider(scrapy.Spider):
    name = 'my_scraper'
    allowed_domains = ['www.google.com']
    start_urls = ['http://www.google.com/']

    def parse(self, response):
        pass

Lets complete the code to fetch the title of url


import scrapy


class MyScraperSpider(scrapy.Spider):
    name = 'my_scraper'
    allowed_domains = ['www.google.com']
    start_urls = ['http://www.google.com/']

    def parse(self, response):
        title = response.css('title::text').get()
        yield{
            'title' : title
        }

After completing your code now run your spider

c:\> scrapy crawl my_scraper

Step 7. Scrapy output in csv or in json

c:\> scrapy crawl my_scraper -o one.csv create csv

c:\> scrapy crawl my_scraper -o one.json create json file

Scrapy Example

Example 1: get names from the page

URL: https://www.imdb.com/search/name/?birth_monthday=07-13


import scrapy


class MyScraperSpider(scrapy.Spider):
    name = 'my_scraper'
    start_urls = ['https://www.imdb.com/search/name/?birth_monthday=07-13']

    def parse(self, response):
        records = response.css('h3 a')
        for rec in records:
            yield{
                'name' : rec.css('a::text').get().strip(),
                'link' : 'https://www.imdb.com' + rec.css('a::attr(href)').get()
            }

The above got fetch all the names and links.

Example 2: Multipage Scraping

In this example 1, i have shown you that how to fetch names and links from particular page. In this example i will tell you how to click on each name and fetch data from another page


import scrapy


class MyScraperSpider(scrapy.Spider):
    name = 'my_scraper'
    start_urls = ['https://www.imdb.com/search/name/?birth_monthday=07-13']

    def parse(self, response):
        records = response.css('h3 a')
        for rec in records:
            link = rec.css('a').attrib['href']
            yield response.follow('https://www.imdb.com/' + link, callback = self.page_parse)
            
    def page_parse(self, response):
        name = response.css('h1 span.itemprop::text').get()
        description = response.css('div.name-trivia-bio-text div.inline::text').getall()
        description_str = ' ' .join(description)
        description_str = description_str.replace('\n','')
        description_str = description_str.replace('...', '')
        description_str = description_str.strip()
        spouse = response.css('div[id=details-spouses] a::text').get() 
        yield {
            'name' : name,
            'description' : description_str,
            'spouse' : spouse
        }