豆豆友情提示:这是一个非官方 GitHub 代理镜像,主要用于网络测试或访问加速。请勿在此进行登录、注册或处理任何敏感信息。进行这些操作请务必访问官方网站 github.com。 Raw 内容也通过此代理提供。
Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

Module 6: Batch Processing

6.1 Introduction

  • 🎥 6.1.1 Introduction to Batch Processing

  • 🎥 6.1.2 Introduction to Spark

6.2 Installation

Follow these instructions to install Spark:

🎥 6.2.1 (Optional) Installing Spark (Linux)

Alternatively, if the setups above don't work, you can run Spark in Google Colab.

Note

It's advisable to invest some time in setting things up locally rather than immediately jumping into this solution

6.3 Spark SQL and DataFrames

  • 🎥 6.3.1 First Look at Spark/PySpark

  • 🎥 6.3.2 Spark Dataframes

  • 🎥 6.3.3 (Optional) Preparing Yellow and Green Taxi Data

Script to prepare the Dataset download_data.sh

Note

The other way to infer the schema (apart from pandas) for the csv files, is to set the inferSchema option to true while reading the files in Spark.

  • 🎥 6.3.4 SQL with Spark

6.4 Spark Internals

  • 🎥 6.4.1 Anatomy of a Spark Cluster

  • 🎥 6.4.2 GroupBy in Spark

  • 🎥 6.4.3 Joins in Spark

6.5 (Optional) Resilient Distributed Datasets

  • 🎥 6.5.1 Operations on Spark RDDs

  • 🎥 6.5.2 Spark RDD mapPartition

6.6 Running Spark in the Cloud

  • 🎥 6.6.1 Connecting to Google Cloud Storage

  • 🎥 6.6.2 Creating a Local Spark Cluster

  • 🎥 6.6.3 Setting up a Dataproc Cluster

  • 🎥 6.6.4 Connecting Spark to Big Query

Homework

Community notes

Did you take notes? You can share them here