Using Excel Spreadsheets in Python
Excel spreadsheets are a popular tool for data analysis, but what if you want to manipulate and analyze that data using Python? Luckily, there’s a library called openpyxl that allows you to work with Excel spreadsheets in Python.
In this article, we’re going to explore how to how to extract data from the following website and store in an Excel file. We will be covering the following sections:
What is Web Scraping?
Web scraping refers to a technique used for extracting data from websites. It is a way to automate the process of gathering information that you might otherwise have to collect manually by visiting each website individually.
The process of web scraping involves using software programs called web scrapers or crawlers to collect data from websites. These programs can be customized to scrape specific types of data from specific websites, depending on your needs.
Web scraping is a useful technique for a variety of purposes. For example, it can be used to collect information on competitors, monitor prices on e-commerce sites, or gather data for academic research. It’s also commonly used in industries such as finance and marketing to gather data for analysis.
However, it is important to note that web scraping is a legally gray area, and it is important to ensure that you’re not violating any laws or terms of service agreements when scraping data from websites. Some websites may prohibit scraping, and in some cases, scraping data may be considered a violation of copyright laws. Additionally, web scraping can go by different names, such as web harvesting, web data extraction, or web data mining.
Openpyxl is a powerful library that allows you to read and write Excel files, as well as manipulate the data within them. With openpyxl, you can easily add, delete, and modify data in Excel spreadsheets using Python.
Another useful tool for data analysis is web scraping. It is the process of extracting data from websites. BeautifulSoup is a Python library that makes web scraping easy and efficient. It allows you to navigate HTML and XML documents, search for specific elements, and extract data from them.
Explore free Python courses
Web Scraping General Process Flow
- Send an HTTP request to the webpage URL.
- The server will respond with the HTML contents of the webpage.
- Collect the relevant data from the response content.
- Organize the data into a structured format
- Eliminate unnecessary, redundant, missing data
- Store the data in a format useful to the user.
Best-suited Python courses for you
Learn Python with these high-rated online courses
What is BeautifulSoup?
It is a Python library used for web scraping purposes. It provides a way to parse and extract data from HTML and XML files. The library was created by Leonard Richardson and is maintained by a community of contributors. It’s free and open-source, which means that anyone can use it and contribute to its development.
One of the biggest advantages of using BeautifulSoup is its ease of use. The library is designed to be intuitive and easy to learn, even for beginners. BeautifulSoup provides a variety of functions and methods that is used for extracting data from HTML and XML files, without requiring extensive knowledge of programming.
Another advantage of BeautifulSoup is its flexibility. It can be used with a variety of different parsers, including lxml, html5lib, and Python’s built-in html.parser. This means that you can choose the parser that best suits your needs and the requirements of the website you’re scraping.
Beautiful Soup is widely used in the web scraping community because of its versatility and ease of use. It can be used for applications, including data mining, web content extraction, and information retrieval.
Explore free data mining courses
Scraping IMDb Top 250 Movies Using BeautifulSoup
Let’s see how we can use Scrapy to fetch book titles from a website. For our demonstration, we will be using a faux website, called books.toscrape, which is specifically used for web scraping purposes. The prices and ratings mentioned here are random and hold no real meaning
We are using Google Colaboratory to execute our code.
#Load relevant packages
import numpy as np # linear algebraimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)import osfor dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename))
# You are allowed to write up to 20GB to current directory (/kaggle/working/). This will get preserved as output when you create a version using "Save & Run All" # You are also allowed to write temporary files to /kaggle/temp/. However, they won't be saved outside of the current session
Step 1 – Install required libraries
Now, let’s install the required packages in our working environment.
pip install beautifulsoup4pip install requestspip install openpyxl
Step 2 – Import the libraries
# Importing required librariesfrom bs4 import BeautifulSoupimport requests, openpyxl
Step 3 – Prepare the excel sheet
#Creating excel fileexcel = openpyxl.Workbook()sheet = excel.active
#Changing sheet namesheet.title = 'IMDb Top 250 Movies'
#Adding a rowsheet.append(['Rank','Movie Name','Year of release','IMDB Rating'])
Step 4 – Create the Web Sraper
# Scraping required datatry: html = requests.get('https://www.imdb.com/chart/top/') html.raise_for_status()
soup = BeautifulSoup(html.text,'html.parser') movies = soup.find('tbody',class_='lister-list').find_all('tr') for movie in movies: name = movie.find('td',class_='titleColumn').a.text rank = movie.find('td',class_='titleColumn').get_text(strip=True).split('.')[0] year = movie.find('td',class_='titleColumn').span.text.strip('()') rating = movie.find('td',class_='ratingColumn imdbRating').strong.text print(rank,name, year, rating) sheet.append([rank,name, year, rating])except Exception as e: print(e)
Voila! We have successfully scraped the movie titles from the website using BeautifulSoup. Refer to the sample screenshot above.
Step 4 – Saving excel File
#Saving the excel fileexcel.save('IMDb Top 250 Movies.xlsx')
Endnotes
By combining openpyxl and BeautifulSoup, you can create a powerful data analysis pipeline that extracts data from websites and stores it in an Excel spreadsheet. This can be especially useful for automating data collection and analysis tasks. It is a powerful way to work with Excel spreadsheets and extract data from websites using Python. With these two libraries, you can easily automate data collection and analysis tasks, and create powerful data pipelines for your projects.
This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio