Cross-Language Optimizations in Big Data Systems: A Case Study of SCOPE
Building scalable big data programs currently requires programmers to combine relational (SQL) with non-relational code (Java, C#, Scala). Relational code is declarative — a program describes what the computation is and the compiler decides how to distribute the program. SQL query optimization has enjoyed a rich and fruitful history, however, most research and commercial optimization engines treat non-relational code as a black-box and thus are unable to optimize it. This paper empirically studies over 3 million SCOPE programs across five data centers within Microsoft and finds programs with non-relational code take between 45-70% of data center CPU time. We further explore the potential for SCOPE optimization by generating more native code from the non-relational part. Finally, we present 6 case studies showing that triggering more generation of native code in these jobs yields significant performance improvement: optimizing just one portion resulted in as much as 25% improvement for an entire program.