Web Scraping in Python with Scrapy
scrapping is the automated process of extracting data from the internet quickly. This article includes topics like web scrapping architecture and web scrapping examples with Python implementation.
The world as we see it today is exploding with data everywhere. Modern industries are swamped with data. In the domain of Data Science, a typical project starts with data acquisition and collection before utilizing data-cleaning techniques to derive usable information from the data. It’s easily guessable that one of the easiest ways to acquire data is from the web. So, data scientists must possess the skills to extract or scrape relevant data from the internet. The process of doing this is, therefore, called web scraping. In this article, we will discuss how to perform data acquisition through web scraping using a framework called Scrapy.
We will be covering the following sections:
Introduction to Web Scraping
When working on a data science problem, there are instances when data needs to be accessed via a web page. Doing this manually would be quite cumbersome, especially when you work with dynamic data that needs to be accessed frequently, such as stocks.
Web scraping is the process of extracting large amounts of data from the Internet quickly and efficiently. It automates the process of copying or downloading data from a website. Moreover, it converts and stores the data in the desired format, such as a CSV file, JSON file, or even an API. You can then retrieve your file and analyze its contents based on your requirements.
So, web scraping is like your best friend if you’re a data scientist! But you must remember that web scraping must only be performed on publicly available data; otherwise, it would be illegal.
In some cases, web scraping is also known as web harvesting, web data extraction, or even web data mining.
Web Scraping General Process Flow
- Send an HTTP request to the webpage URL.
- The server will respond with the HTML contents of the webpage.
- Collect the relevant data from the response content.
- Organize the data into a structured format
- Eliminate unnecessary, redundant, missing data
- Store the data in a form useful to the user.
Python Tools for Web Scraping
Python offers an automated way to fetch the HTML content from the web URL using the beautiful soup package we will discuss in the following article.
Another popular tool for extracting data from the web is a framework called Scrapy, which we will focus on in this article.
Also explore: Introduction to Python
Best-suited Machine Learning courses for you
Learn Machine Learning with these high-rated online courses
What is Scrapy?
Scrapy is an open-source Python framework used to automate the process of extracting large-scale data through crawling websites and processing & storing it in your preferred format.
Using Scrapy, you can create your own ‘spiders’ or crawlers to perform scraping. It also offers a base structure for building crawlers with inbuilt support for recursive web scraping while going through extracted URLs.
Scrapy is a very versatile tool for web scraping. It efficiently processes different types and formats of data you feed. It can also handle patchy data and fix it for better results.
Now, let’s talk about the components of Scrapy and how they interact with each other:
Scrapy Architecture
The diagram below demonstrates the architecture of Scrapy and how the data flows inside the system.
This architecture involves seven main components, as explained below:
- Scrapy Engine: The engine is responsible for maintaining the data flow across all system components and triggering events when certain actions occur.
- Scheduler: It accepts requests from the Scrapy engine and enqueues them to feed it back to the engine whenever requested.
- Downloader: It is responsible for fetching the web pages and delivering them to the Scrapy engine, which, in turn, returns them to the spiders.
- Spiders: Custom classes written by Scrapy users that define how a website will be scraped, that is, how the crawling and parsing will happen for a particular website (or a group of websites).
- Item pipeline: Once the spiders have parsed a website's items, the item pipeline processes them and stores them in databases.
- Downloader middlewares: This component sits between the Scrapy engine and the Downloader. It processes requests from the engine before forwarding them to the Downloader and does the same for responses from the Downloader to the engine.
- Spider middlewares: This component sits between the Scrapy engine and the Spiders. It processes the spider inputs (responses from the engine) and the outputs (requests to the engine).
Scraping Book Titles Using Scrapy
Let’s see how we can use Scrapy to fetch book titles from a website. For our demonstration, we will be using a faux website called Books. toscrape, which is specifically used for web scraping purposes. The prices and ratings mentioned here are random and hold no real meaning.
Step 1 – Install the prerequisites.
To create a Scrapy project and write spiders, we will first install the required packages in our working environment.
We are using the Jupyter Notebook to execute our code.
pip install scrapy pip install crochet
Remember that Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor. However, the Twisted reactor can only be initiated once and throws a ReactorNotRestartable error when the code block is run the second time. To mitigate this error, we use the Crochet package to test our spider easily.
Step 2 – Import the crawler
CrawlerRunner is a Scrapy utility that provides more control over the crawling process. If your application is already using Twisted and you want to run Scrapy in the same reactor, it's recommended that you use this over CrawlerProcess.
import scrapyfrom scrapy.crawler import CrawlerRunner #to run our spider
Step 3 – Setup crochet
import crochetcrochet.setup()
Step 4 – Inspect the webpage
- Go to books.toscrape.com
- Inspect the first title of the book
Step 5 – Build the spider
class BookSpider(scrapy.Spider): name='BookSpider' # used to invoke spider #Used to start the requests start_urls=['http://books.toscrape.com/catalogue/page-1.html', 'http://books.toscrape.com/catalogue/page-2.html', 'http://books.toscrape.com/catalogue/page-3.html'] ''' Invoked by scrapy engine for every url Here we will use selectors to scrap the website ''' def parse(self,response): book_list=response.css('article.product_pod>h3>a::attr(title)').getall() for i in book_list: print(i)
Step 6-Run the spider
Now, let’s make our spider crawl, shall we?
def run_spider(): crawler = CrawlerRunner({ 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)', }) d = crawler.crawl(BookSpider) return d run_spider()
Voila! We have successfully scraped the book titles from the website using our Scrapy Spider.
Endnotes
Web scraping is an essential technique in Natural Language Processing, data mining, and machine learning. This article discussed how Scrapy is a potent tool for large-scale scraping and taught us how to build a Scrapy web crawler to fetch data from the Internet. I hope this article proves helpful for you.
This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio