Web Scraping in Python with Scrapy

5 mins read1.5K Views Comment

Updated on Apr 30, 2024 10:40 IST

scrapping is the automated process of extracting data from the internet quickly. This article includes topics like web scrapping architecture and web scrapping examples with Python implementation.

The world as we see it today is exploding with data everywhere. Modern industries are swamped with data. In the domain of Data Science, a typical project starts with data acquisition and collection before utilizing data-cleaning techniques to derive usable information from the data. It’s easily guessable that one of the easiest ways to acquire data is from the web. So, data scientists must possess the skills to extract or scrape relevant data from the internet. The process of doing this is, therefore, called web scraping. In this article, we will discuss how to perform data acquisition through web scraping using a framework called Scrapy.

We will be covering the following sections:

Introduction to Web Scraping
- Web Scraping General Process Flow Python tools for Web Scraping
What is Scrapy?
- Scrapy Architecture
Demo: Scraping Book Titles Using Scraps

Introduction to Web Scraping

When working on a data science problem, there are instances when data needs to be accessed via a web page. Doing this manually would be quite cumbersome, especially when you work with dynamic data that needs to be accessed frequently, such as stocks.

Web scraping is the process of extracting large amounts of data from the Internet quickly and efficiently. It automates the process of copying or downloading data from a website. Moreover, it converts and stores the data in the desired format, such as a CSV file, JSON file, or even an API. You can then retrieve your file and analyze its contents based on your requirements.

So, web scraping is like your best friend if you’re a data scientist! But you must remember that web scraping must only be performed on publicly available data; otherwise, it would be illegal.

In some cases, web scraping is also known as web harvesting, web data extraction, or even web data mining.

Web Scraping General Process Flow

Send an HTTP request to the webpage URL.
The server will respond with the HTML contents of the webpage.
Collect the relevant data from the response content.
Organize the data into a structured format
Eliminate unnecessary, redundant, missing data
Store the data in a form useful to the user.

Python Tools for Web Scraping

Python offers an automated way to fetch the HTML content from the web URL using the beautiful soup package we will discuss in the following article.

Another popular tool for extracting data from the web is a framework called Scrapy, which we will focus on in this article.

Variables in Python

In this article, we will discuss variables in python, how to name them and how to work with the variables.

Read Later

Python Projects for Beginners

Python is one of the most widely used programming languages globally because of its general-purpose functionality. Python is suitable for programmers of different skill levels, from students to...read more

Read Later

Python Data Types

You want to store a value, but that value can be anything. It can be a number, list of number, strings, etc. So these different type of values are stored...read more

Read Later

Also explore: Introduction to Python

Recommended online courses

Best-suited Machine Learning courses for you

Learn Machine Learning with these high-rated online courses

Master of Computer Applications with specialization in Machine Learning and Artificial Intelligence (Online MCA)

Amity OnlineDegree

Total Fees

₹1.7 L

Duration

2 years

MCA with specialization in Machine Learning & Artificial Intelligence (ML & AI)

Amity OnlineDegree

Total Fees

₹2.5 L

Duration

2 years

MCA in Machine Learning

Amity University Online, NoidaDegree

Total Fees

₹2.5 L

Duration

2 years

Advance Certification in Applied Data Science, Machine Learning & IoT

IIT GuwahatiCertificate

4.0

Total Fees

₹95 K

Duration

9 months

Professional Certificate Course In Generative AI And Machine Learning

IIT KanpurCertificate

Total Fees

₹1.53 L

Duration

11 months

IIT Roorkee - Post Graduate Certificate Program in Data Science & Machine Learning (Online)

TimesProCertificate

4.0

Total Fees

₹2 L

Duration

10 months

Data Science & Machine Learning Course

Coding NinjasCertificate

4.8

Total Fees

₹34.65 K

Duration

11 months

M.Sc. in Machine Learning and AI

upGradDegree

Total Fees

₹5.6 L

Duration

18 months

Full Stack Machine Learning & AI Program

Jigsaw AcademyCertificate

Total Fees

– / –

Duration

8 hours

IIT Roorkee & Wiley Post Graduate Certification in AI for BFSI

IIT RoorkeeCertificate

Total Fees

– / –

Duration

6 months

What is Scrapy?

Scrapy is an open-source Python framework used to automate the process of extracting large-scale data through crawling websites and processing & storing it in your preferred format.

Using Scrapy, you can create your own ‘spiders’ or crawlers to perform scraping. It also offers a base structure for building crawlers with inbuilt support for recursive web scraping while going through extracted URLs.

Scrapy is a very versatile tool for web scraping. It efficiently processes different types and formats of data you feed. It can also handle patchy data and fix it for better results.

Now, let’s talk about the components of Scrapy and how they interact with each other:

Scrapy Architecture

The diagram below demonstrates the architecture of Scrapy and how the data flows inside the system.

This architecture involves seven main components, as explained below:

Scrapy Engine: The engine is responsible for maintaining the data flow across all system components and triggering events when certain actions occur.
Scheduler: It accepts requests from the Scrapy engine and enqueues them to feed it back to the engine whenever requested.
Downloader: It is responsible for fetching the web pages and delivering them to the Scrapy engine, which, in turn, returns them to the spiders.
Spiders: Custom classes written by Scrapy users that define how a website will be scraped, that is, how the crawling and parsing will happen for a particular website (or a group of websites).
Item pipeline: Once the spiders have parsed a website's items, the item pipeline processes them and stores them in databases.
Downloader middlewares: This component sits between the Scrapy engine and the Downloader. It processes requests from the engine before forwarding them to the Downloader and does the same for responses from the Downloader to the engine.
Spider middlewares: This component sits between the Scrapy engine and the Spiders. It processes the spider inputs (responses from the engine) and the outputs (requests to the engine).

Scraping Book Titles Using Scrapy

Let’s see how we can use Scrapy to fetch book titles from a website. For our demonstration, we will be using a faux website called Books. toscrape, which is specifically used for web scraping purposes. The prices and ratings mentioned here are random and hold no real meaning.

Step 1 – Install the prerequisites.

To create a Scrapy project and write spiders, we will first install the required packages in our working environment.

We are using the Jupyter Notebook to execute our code.

pip install scrapy
pip install crochet

Remember that Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor. However, the Twisted reactor can only be initiated once and throws a ReactorNotRestartable error when the code block is run the second time. To mitigate this error, we use the Crochet package to test our spider easily.

Step 2 – Import the crawler

CrawlerRunner is a Scrapy utility that provides more control over the crawling process. If your application is already using Twisted and you want to run Scrapy in the same reactor, it's recommended that you use this over CrawlerProcess.

import scrapy
from scrapy.crawler import CrawlerRunner #to run our spider
Copy code

Step 3 – Setup crochet

import crochet
crochet.setup()
Copy code

Step 4 – Inspect the webpage

Go to books.toscrape.com
Inspect the first title of the book

Step 5 – Build the spider

class BookSpider(scrapy.Spider):
    name='BookSpider' # used to invoke spider
 
    #Used to start the requests
    start_urls=['http://books.toscrape.com/catalogue/page-1.html',
         'http://books.toscrape.com/catalogue/page-2.html',
         'http://books.toscrape.com/catalogue/page-3.html']
 
    ''' 
    Invoked by scrapy engine for every url
    Here we will use selectors to scrap the website
    '''
    def parse(self,response):
        book_list=response.css('article.product_pod>h3>a::attr(title)').getall()
 
        for i in book_list:
            print(i)
Copy code

Step 6-Run the spider

Now, let’s make our spider crawl, shall we?

def run_spider():
    crawler = CrawlerRunner({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    })    
 
    d = crawler.crawl(BookSpider)
    return d
 
run_spider()
Copy code

Voila! We have successfully scraped the book titles from the website using our Scrapy Spider.

Endnotes

Web scraping is an essential technique in Natural Language Processing, data mining, and machine learning. This article discussed how Scrapy is a potent tool for large-scale scraping and taught us how to build a Scrapy web crawler to fetch data from the Internet. I hope this article proves helpful for you.

About the Author

Shiksha Online

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio

Web Scraping in Python with Scrapy

Introduction to Web Scraping

Web Scraping General Process Flow

Python Tools for Web Scraping

Best-suited Machine Learning courses for you

Master of Computer Applications with specialization in Machine Learning and Artificial Intelligence (Online MCA)

MCA with specialization in Machine Learning & Artificial Intelligence (ML & AI)

MCA in Machine Learning

Advance Certification in Applied Data Science, Machine Learning & IoT

Professional Certificate Course In Generative AI And Machine Learning

IIT Roorkee - Post Graduate Certificate Program in Data Science & Machine Learning (Online)

Data Science & Machine Learning Course

M.Sc. in Machine Learning and AI

Full Stack Machine Learning & AI Program

IIT Roorkee & Wiley Post Graduate Certification in AI for BFSI

What is Scrapy?

Scrapy Architecture

Scraping Book Titles Using Scrapy

Endnotes

Top Picks & New Arrivals