University of California, Davis - Distributed Computing with Spark SQL
- Offered byCoursera
Distributed Computing with Spark SQL at Coursera Overview
Duration | 13 hours |
Start from | Start Now |
Total fee | Free |
Mode of learning | Online |
Difficulty level | Intermediate |
Official Website | Explore Free Course |
Credential | Certificate |
Distributed Computing with Spark SQL at Coursera Highlights
- Shareable Certificate Earn a Certificate upon completion
- 100% online Start instantly and learn at your own schedule.
- Course 3 of 4 in the Learn SQL Basics for Data Science Specialization
- Flexible deadlines Reset deadlines in accordance to your schedule.
- Intermediate Level
- Approx. 13 hours to complete
- English Subtitles: Arabic, French, Portuguese (European), Italian, Vietnamese, German, Russian, English, Spanish
Distributed Computing with Spark SQL at Coursera Course details
- This course is for students with SQL experience and now want to take the next step in gaining familiarity with distributed computing using Spark. Students will gain an understanding of when to use Spark and how Spark as an engine uniquely combines Data and AI technologies at scale. The four modules build on one another and by the end of the course the student will understand: Spark architecture, Spark DataFrame, optimizing reading/writing data, and how to build a machine learning model. The first module will introduce Spark, including how Spark works with distributed computing and what are Spark Dataframes. Module 2 covers the core concepts of Spark such as storage vs. computing, caching, partitions and Spark UI. The third module looks at Engineering Data Pipelines covering connecting to databases, schemas and type, file formats and writing good data. The final module looks at the application of Spark with Machine Learning through the business use case, a short introduction to what machine learning is, building and applying models and a final course conclusion. By understanding when to use Spark, either scaling out when the model or data is too large to process on a single machine, or having a need to simply speed up to get faster results, students will hone their SQL skills and become a more adept Data Scientist.
Distributed Computing with Spark SQL at Coursera Curriculum
Introduction to Spark
Course Introduction
Why Distributed Computing?
Spark DataFrames
The Databricks Environment
SQL in Notebooks
Import Data
A Note From UC Davis
Readings and Resources
Assignment #1 - Queries in Spark SQL
Assignment #1 Quiz - Queries in Spark SQL
Module 1 Quiz
Spark Core Concepts
Introduction to Spark Core Concepts
Spark Terminology
Caching
Shuffle Partitions
Spark UI
Broadcast Joins
Readings
Assignment #2 - Spark Internals
Assignment #2 Quiz - Spark Internals
Module 2 Quiz
Engineering Data Pipelines
Engineering Data Pipelines
Spark as a Connector
Accessing Data
File Formats
Schemas and Types
Writing Data
Managed and Unmanaged Tables
Readings
Assignment #3 - Engineering Data Pipelines
Assignment #3 Quiz - Engineering Data Pipelines
Module 3 Quiz
Machine Learning Applications of Spark
Machine Learning Applications of Spark
Applications of Machine Learning
Machine Learning Fundamentals
Linear Regression
Training Linear Regression Model
Applying Machine Learning with UDFs
Course Summary
Readings
Assignment #4 - Logistic Regression Classifier
Assignment #4 Quiz - Logistic Regression Classifier
Module 4 Quiz