At the beginning of every research effort, researchers in empirical software engineering have to go through the processes of extracting data from raw data sources and transforming them to what their tools expect as inputs. This step is time consuming and error prone, while the produced artifacts (code, intermediate datasets) are usually not of scientific value. In recent years, Apache Spark has emerged as a solid foundation for data science and has taken the big data analytics domain by storm. With Spark, researchers can map their data sources into immutable lists or data frames and transform them using a declarative API based on functional programming primitives. The primitives exposed by the Apache Spark API can help software engineering researchers create and share reproducible, high-performance data analysis pipelines, that automatically scale processing to clusters of machines.
This technical briefing will cover the following topics:
Functional programming basics: what is map? What is fold? What does group by and join do? Apache Spark in a nutshell: what are RDDs and what are Dataframes? How can we query any dataset with SQL? Present a live demo of applying Apache Spark on a software engineering task.
The speaker has extensive experience in applying big data technologies on software engineering data, and has been teaching Apache Spark to BSc and MSc students.
Tue 29 May
|09:00 - 10:30|
Georgios GousiosTU Delft
|11:00 - 12:30|