Day 5: Apache Spark & Production Scala

Today’s Objective

Apache Spark is the dominant distributed data processing framework, and it is written in Scala. Today you write Spark jobs, understand Spark's execution model, and look at Scala's role in the big data ecosystem.

Spark distributes computation across a cluster. Dataset[T] is a typed, immutable distributed collection. SparkSession is the entry point. Transformations (map, filter, groupBy) are lazy — they build an execution plan. Actions (collect, count, write) trigger execution. The Catalyst optimizer rewrites the plan for efficiency before execution. DataFrame (Dataset[Row]) adds SQL-like operations and schema inference.

Spark SQL and DataFrames

DataFrames support SQL queries: spark.sql('SELECT country, AVG(gdp) FROM nations GROUP BY country'). DataFrame API: df.select('col1', 'col2'), df.filter($'age' > 18), df.groupBy('country').agg(avg('gdp'), count('*')). DataFrames are untyped at compile time but optimized by Catalyst. For type safety, convert to Dataset[T]: df.as[CaseClass].

Scala in the Data Engineering Ecosystem

Scala powers: Apache Kafka (JVM client), Apache Flink (streaming), Databricks (Spark as a service), LinkedIn's Scalding (Hadoop MapReduce). sbt (Scala Build Tool) manages dependencies. The Typelevel ecosystem provides functional Scala: Cats (type class instances), Cats Effect (IO monad, async), fs2 (streaming), Doobie (database), Http4s (web). These libraries are favored for high-correctness production services.

scala

SCALA

import org.apache.spark.sql.{SparkSession, functions => F}

val spark = SparkSession.builder() .appName('PrecisionAI Analytics') .master('local[*]') .getOrCreate()

import spark.implicits._

// Load data
val df = spark.read .option('header', 'true') .option('inferSchema', 'true') .csv('students.csv')

// Transformation pipeline (lazy)
val result = df .filter($'score' >= 70) .withColumn('letter_grade', F.when($'score' >= 90, 'A') .when($'score' >= 80, 'B') .when($'score' >= 70, 'C') .otherwise('F')) .groupBy('letter_grade') .agg( F.count('*').as('count'), F.avg('score').as('avg_score') ) .orderBy('letter_grade')

// Action triggers execution
result.show()

// Write to Parquet
result.write.mode('overwrite').parquet('output/grades')

spark.stop()

Add .cache() to a DataFrame that is used multiple times in your Spark job. Without caching, Spark re-reads and re-processes the data for each action. With caching, it stores the result in memory/disk after the first action.

Apache Spark & Production Scala

Today’s Objective

Spark SQL and DataFrames

Scala in the Data Engineering Ecosystem

Supporting Resources

Go deeper with these references.

Day 5 Checkpoint