Coursera

University of California, Davis - Distributed Computing with Spark SQL

Offered byCoursera

Distributed Computing with Spark SQL
at
Coursera
Overview

Duration	13 hours
Total fee	Free
Mode of learning	Online
Difficulty level	Intermediate
Official Website	Explore Free Course
Credential	Certificate

Distributed Computing with Spark SQL
at
Coursera
Highlights

Shareable Certificate Earn a Certificate upon completion
100% online Start instantly and learn at your own schedule.
Course 3 of 4 in the Learn SQL Basics for Data Science Specialization
Flexible deadlines Reset deadlines in accordance to your schedule.
Intermediate Level
Approx. 13 hours to complete
English Subtitles: Arabic, French, Portuguese (European), Italian, Vietnamese, German, Russian, English, Spanish

Read more

Distributed Computing with Spark SQL
at
Coursera
Course details

Skills you will learn

Spark Database Administration

More about this course

This course is for students with SQL experience and now want to take the next step in gaining familiarity with distributed computing using Spark. Students will gain an understanding of when to use Spark and how Spark as an engine uniquely combines Data and AI technologies at scale. The four modules build on one another and by the end of the course the student will understand: Spark architecture, Spark DataFrame, optimizing reading/writing data, and how to build a machine learning model. The first module will introduce Spark, including how Spark works with distributed computing and what are Spark Dataframes. Module 2 covers the core concepts of Spark such as storage vs. computing, caching, partitions and Spark UI. The third module looks at Engineering Data Pipelines covering connecting to databases, schemas and type, file formats and writing good data. The final module looks at the application of Spark with Machine Learning through the business use case, a short introduction to what machine learning is, building and applying models and a final course conclusion. By understanding when to use Spark, either scaling out when the model or data is too large to process on a single machine, or having a need to simply speed up to get faster results, students will hone their SQL skills and become a more adept Data Scientist.

Read more

Distributed Computing with Spark SQL
at
Coursera
Curriculum

Introduction to Spark

Course Introduction

Why Distributed Computing?

Spark DataFrames

The Databricks Environment

SQL in Notebooks

Import Data

A Note From UC Davis

Readings and Resources

Assignment #1 - Queries in Spark SQL

Assignment #1 Quiz - Queries in Spark SQL

Module 1 Quiz

Spark Core Concepts

Introduction to Spark Core Concepts

Spark Terminology

Caching

Shuffle Partitions

Spark UI

Broadcast Joins

Readings

Assignment #2 - Spark Internals

Assignment #2 Quiz - Spark Internals

Module 2 Quiz

Engineering Data Pipelines

Engineering Data Pipelines

Spark as a Connector

Accessing Data

File Formats

Schemas and Types

Writing Data

Managed and Unmanaged Tables

Readings

Assignment #3 - Engineering Data Pipelines

Assignment #3 Quiz - Engineering Data Pipelines

Module 3 Quiz

Machine Learning Applications of Spark

Machine Learning Applications of Spark

Applications of Machine Learning

Machine Learning Fundamentals

Linear Regression

Training Linear Regression Model

Applying Machine Learning with UDFs

Course Summary

Readings

Assignment #4 - Logistic Regression Classifier

Assignment #4 Quiz - Logistic Regression Classifier

Module 4 Quiz

Other courses offered by Coursera

Databases and SQL for Data Science with Python

IBM - Institute of Business ManagementCertificate

Total Fees

– / –

Duration

3 months

Difficulty level

Beginner

Databases and SQL for Data Science with Python

IBM - Institute of Business ManagementCertificate

Total Fees

– / –

Duration

20 hours

Difficulty level

Beginner

Skills

Learn SQL Basics for Data Science Specialization

University of California, DavisCertificate

Total Fees

– / –

Duration

2 months

Difficulty level

Beginner

Skills

Data analysis MySQL Apache

Machine Learning for Marketing Specialization

CourseraCertificate

Total Fees

– / –

Duration

3 months

Difficulty level

Beginner

Skills

View Other 6719 Courses

Distributed Computing with Spark SQL

at

Coursera

Student Forum

Anything you would want to ask experts?

Write here...

TechnologyDatabasesDatabase AdministrationDistributed Computing with Spark SQL

Useful Links

Know more about Coursera

All About Coursera

Reviews on Placements, Faculty & Facilities

Know more about Programs

Computer Courses (IT & Software)

Internet of Things

Web Development

Waterfall / SDLC

Fullstack Development

Agile (Scrum, Kanban)

Online Java Courses