Day 5 of 5
⏱ ~60 minutes
Scala in 5 Days — Day 5

Apache Spark & Production Scala

Apache Spark is the dominant distributed data processing framework, and it is written in Scala. Today you write Spark jobs, understand Spark's execution model, and look at Scala's role in the big data ecosystem.

Spark Dataset API

Spark distributes computation across a cluster. Dataset[T] is a typed, immutable distributed collection. SparkSession is the entry point. Transformations (map, filter, groupBy) are lazy — they build an execution plan. Actions (collect, count, write) trigger execution. The Catalyst optimizer rewrites the plan for efficiency before execution. DataFrame (Dataset[Row]) adds SQL-like operations and schema inference.

Spark SQL and DataFrames

DataFrames support SQL queries: spark.sql('SELECT country, AVG(gdp) FROM nations GROUP BY country'). DataFrame API: df.select('col1', 'col2'), df.filter($'age' > 18), df.groupBy('country').agg(avg('gdp'), count('*')). DataFrames are untyped at compile time but optimized by Catalyst. For type safety, convert to Dataset[T]: df.as[CaseClass].

Scala in the Data Engineering Ecosystem

Scala powers: Apache Kafka (JVM client), Apache Flink (streaming), Databricks (Spark as a service), LinkedIn's Scalding (Hadoop MapReduce). sbt (Scala Build Tool) manages dependencies. The Typelevel ecosystem provides functional Scala: Cats (type class instances), Cats Effect (IO monad, async), fs2 (streaming), Doobie (database), Http4s (web). These libraries are favored for high-correctness production services.

scala
import org.apache.spark.sql.{SparkSession, functions => F}

val spark = SparkSession.builder()
  .appName('PrecisionAI Analytics')
  .master('local[*]')
  .getOrCreate()

import spark.implicits._

// Load data
val df = spark.read
  .option('header', 'true')
  .option('inferSchema', 'true')
  .csv('students.csv')

// Transformation pipeline (lazy)
val result = df
  .filter($'score' >= 70)
  .withColumn('letter_grade',
    F.when($'score' >= 90, 'A')
     .when($'score' >= 80, 'B')
     .when($'score' >= 70, 'C')
     .otherwise('F'))
  .groupBy('letter_grade')
  .agg(
    F.count('*').as('count'),
    F.avg('score').as('avg_score')
  )
  .orderBy('letter_grade')

// Action triggers execution
result.show()

// Write to Parquet
result.write.mode('overwrite').parquet('output/grades')

spark.stop()
💡
Add .cache() to a DataFrame that is used multiple times in your Spark job. Without caching, Spark re-reads and re-processes the data for each action. With caching, it stores the result in memory/disk after the first action.
📝 Day 5 Exercise
Write Your First Spark Job
  1. Add Spark to build.sbt: libraryDependencies += 'org.apache.spark' %% 'spark-sql' % '3.5.0'
  2. Create a SparkSession with master('local[*]') for local development
  3. Generate a CSV file with 10,000 rows of student data
  4. Write a Spark job that computes grade distribution and writes results to Parquet
  5. Run with: sbt run — check the Spark UI at http://localhost:4040 while it runs

Day 5 Summary

  • Spark Dataset/DataFrame distributes computation across a cluster transparently
  • Transformations are lazy; actions (collect, show, write) trigger execution
  • Catalyst optimizer rewrites the execution plan before running
  • Add .cache() to DataFrames used multiple times to avoid recomputation
  • Typelevel ecosystem (Cats, Cats Effect, fs2) provides functional Scala for production services
Challenge

Write a Spark job that processes a 10GB log file: parse each line, extract timestamp/level/message, compute error rates per hour, identify the top 10 most frequent error messages, and write the results as a partitioned Parquet dataset.

Finished this lesson?