Spark, Hadoop, and Snowflake for Data Engineering
- Offered byCoursera
Spark, Hadoop, and Snowflake for Data Engineering at Coursera Overview
Duration | 29 hours |
Start from | Start Now |
Total fee | Free |
Mode of learning | Online |
Difficulty level | Advanced |
Official Website | Explore Free Course |
Credential | Certificate |
Spark, Hadoop, and Snowflake for Data Engineering at Coursera Highlights
- Earn a certificate from Duke University
- Add to your LinkedIn profile
- 21 quizzes
Spark, Hadoop, and Snowflake for Data Engineering at Coursera Course details
- What you'll learn
- Create scalable data pipelines (Hadoop, Spark, Snowflake, Databricks) for efficient data handling.
- Optimize data engineering with clustering and scaling to boost performance and resource use.
- Build ML solutions (PySpark, MLFlow) on Databricks for seamless model development and deployment.
- Implement DataOps and DevOps practices for continuous integration and deployment (CI/CD) of data-driven applications, including automating processes.
- This is primarily aimed at first- and second-year undergraduates interested in engineering or science, along with high school students and professionals with an interest in programmingGain the skills for building efficient and scalable data pipelines. Explore essential data engineering platforms (Hadoop, Spark, and Snowflake) as well as learn how to optimize and manage them. Delve into Databricks, a powerful platform for executing data analytics and machine learning tasks, while honing your Python data science skills with PySpark. Finally, discover the key concepts of MLflow, an open-source platform for managing the end-to-end machine learning lifecycle, and learn how to integrate it with Databricks.
- This course is designed for learners who want to pursue or advance their career in data science or data engineering, or for software developers or engineers who want to grow their data management skill set. In addition to the technologies you will learn, you will also gain methodologies to help you hone your project management and workflow skills for data engineering, including applying Kaizen, DevOps, and Data Ops methodologies and best practices.
- With quizzes to test your knowledge throughout, this comprehensive course will help guide your learning journey to become a proficient data engineer, ready to tackle the challenges of today's data-driven world.
Spark, Hadoop, and Snowflake for Data Engineering at Coursera Curriculum
Overview and Introduction to PySpark
Meet your Co-Instructor: Kennedy Behrman
Meet your Co-Instructor: Noah Gift
Overview of Big Data Platforms
Getting Started with Hadoop
Getting Started with Spark
Introduction to Resilient Distributed Datasets (RDD)
Resilient Distributed Datasets (RDD) Demo
Introduction to Spark SQL
PySpark Dataframe Demo: Part 1
PySpark Dataframe Demo: Part 2
Welcome to Data Engineering Platforms with Python!
What is Apache Hadoop?
What is Apache Spark?
Use Apache Spark in Azure Databricks (optional)
Choosing between Hadoop and Spark
What are RDDs?
Getting Started: Creating RDD's with PySpark
Spark SQL, Dataframes and Datasets
PySpark and Spark SQL
Big Data Platforms
Apache Hadoop Concepts
Apache Spark Concepts
RDD Concepts
Spark SQL Concepts
PySpark Dataframe Concepts
PySpark
Meet and Greet (optional)
Let Us Know if Something's Not Working
Practice: Creating RDD's with PySpark
Practice: Reading Data into Dataframes
Snowflake
What is Snowflake?
Snowflake Layers
Snowflake Web UI
Navigating Snowflake
Creating a Table in Snowflake
Snowflake Warehouses
Writing to Snowflake
Reading from Snowflake
Accessing Snowflake
Detailed View Inside Snowflake
Snowsight: The Snowflake Web Interface
Working with Warehouses
Python Connector Documentation
Snowflake Architecture
Snowflake Layers
Navigating Snowflake
Creating a Table
Writing to Snowflake
Snowflake
Azure Databricks and MLFLow
Accessing Databricks
Spark Notebooks with Databricks
Using Data with Databricks
Working with Workspaces in Databricks
Advanced Capabilities of Databricks
PySpark Introduction on Databricks
Exploring Databricks Azure Features
Using the DBFS to AutoML Workflow
Load, Register and Deploy ML Models
Databricks Model Registry
Model Serving on Databricks
What is MLOps?
Exploring Open-Source MLFlow Frameworks
Running MLFlow with Databricks
End to End Databricks MLFlow
Databricks Autologging with MLFlow
What is Azure Databricks?
Introduction to Databricks Machine Learning
What is the Databricks File System (DBFS)?
Serverless Compute with Databricks
MLOps Workflow on Azure Databricks
Run MLFlow Projects on Azure Databricks
Databricks Autologging
PySpark SQL
PySpark DataFrames
MLFlow with Databricks
DataBricks
ETL-Part-1: Keyword Extractor Tool to HashTag Tool
DataOps and Operations Methodologies
Kaizen Methodology for Data
Introducing GitHub CodeSpaces
Compiling Python in GitHub Codespaces
Walking through Sagemaker Studio Lab
Pytest Master Class (Optional)
What is DevOps?
DevOps Key Concepts
Continuous Integration Overview
Build an NLP in Cloud9 with Python
Build a Continuously Deployed Containerized FastAPI Microservice
Hugo Continuous Deploy on AWS
Container Based Continuous Delivery
What is DataOps?
DataOps and MLOps with Snowflake
Building Cloud Pipelines with Step Functions and Lambda
What is a Data Lake?
Data Warehouse vs. Feature Store
Big Data Challenges
Types of Big Data Processing
Real-World Data Engineering Pipeline
Data Feedback Loop
GitHub Codespaces Overview
Getting Started with Amazon SageMaker Studio Lab
Teaching MLOps at Scale with GitHub (Optional)
Getting Started with DevOps and Cloud Computing
Benefits of Serverless ETL Technologies
Kaizen Methodology
DevOps
DataOps
DataOps and Operations Methodologies
ETL-Part2: SQLite ETL Destination