Apache Spark is the dominant distributed data processing framework, and it is written in Scala. Today you write Spark jobs, understand Spark's execution model, and look at Scala's role in the big data ecosystem.
Spark distributes computation across a cluster. Dataset[T] is a typed, immutable distributed collection. SparkSession is the entry point. Transformations (map, filter, groupBy) are lazy — they build an execution plan. Actions (collect, count, write) trigger execution. The Catalyst optimizer rewrites the plan for efficiency before execution. DataFrame (Dataset[Row]) adds SQL-like operations and schema inference.
DataFrames support SQL queries: spark.sql('SELECT country, AVG(gdp) FROM nations GROUP BY country'). DataFrame API: df.select('col1', 'col2'), df.filter($'age' > 18), df.groupBy('country').agg(avg('gdp'), count('*')). DataFrames are untyped at compile time but optimized by Catalyst. For type safety, convert to Dataset[T]: df.as[CaseClass].
Scala powers: Apache Kafka (JVM client), Apache Flink (streaming), Databricks (Spark as a service), LinkedIn's Scalding (Hadoop MapReduce). sbt (Scala Build Tool) manages dependencies. The Typelevel ecosystem provides functional Scala: Cats (type class instances), Cats Effect (IO monad, async), fs2 (streaming), Doobie (database), Http4s (web). These libraries are favored for high-correctness production services.
import org.apache.spark.sql.{SparkSession, functions => F}
val spark = SparkSession.builder()
.appName('PrecisionAI Analytics')
.master('local[*]')
.getOrCreate()
import spark.implicits._
// Load data
val df = spark.read
.option('header', 'true')
.option('inferSchema', 'true')
.csv('students.csv')
// Transformation pipeline (lazy)
val result = df
.filter($'score' >= 70)
.withColumn('letter_grade',
F.when($'score' >= 90, 'A')
.when($'score' >= 80, 'B')
.when($'score' >= 70, 'C')
.otherwise('F'))
.groupBy('letter_grade')
.agg(
F.count('*').as('count'),
F.avg('score').as('avg_score')
)
.orderBy('letter_grade')
// Action triggers execution
result.show()
// Write to Parquet
result.write.mode('overwrite').parquet('output/grades')
spark.stop()
Write a Spark job that processes a 10GB log file: parse each line, extract timestamp/level/message, compute error rates per hour, identify the top 10 most frequent error messages, and write the results as a partitioned Parquet dataset.