Last week the Gathr team hosted a webinar on Structured Streaming, “The Structured Streaming Upgrade to Apache Spark and How Enterprises Can Benefit” and received overwhelming participation from the industry, including many of you reading this. Amit Assudani (Sr. Technical Architect – Spark, Gathr) and I took a deep dive into Structured Streaming and shared our views on how it enables the real-time enterprise and is simplifying building stream processing applications on Spark. Here is a summary of our current view on Structured Streaming-
Need for Structured Streaming
Open source engines such as Apache Storm, Apache Spark and Apache Flink have made it possible to build apps for fault-tolerant processing of real-time data streams. However, building robust stream processing applications is still hard, and involves various complexities to be considered. The biggest complexity is data itself – for instance, there are various formats of data, it needs to be cleansed, it can come in at different speeds, can become corrupt, and needs seamless integration with other data in external storage. And then, streaming applications don’t work in isolation and usually involve other workloads such as interactive queries, batch workloads, and machine learning on top of streaming.
Apache Spark evolution has been in the context of addressing these complexities. Since its release, Spark Streaming has become one of the most widely used distributed streaming engines. As per a Nov’16 survey by Taneja Group, nearly 54% of the 7000 enterprise respondents said they are actively using Spark, and 65% plan to increase the use of Spark in the next 12 months. Top use cases include ETL, Real Time Processing, Data Science and Machine Learning. This concurred with the results from the poll we conducted during the webinar, where 53% of ~200 attendees said they were already using Spark and 32% are planning to use it in the near future.
Structured Streaming, experimentally introduced in Spark 2.0, is designed to further simplify building stream processing applications using Spark. It is a fast, scalable, and fault tolerant stream processing engine built on top of Spark SQL, and provides unified high level APIs dealing with complex data, workloads and systems. It comes with a growing ecosystem of data sources that allow it to integrate streaming applications with evolving storage systems.
What Structured Streaming brings in?
Structured Streaming works with the premise that seamlessly building stream processing applications normally requires strong ‘reasoning’ (i.e. worrying about and designing mechanisms) for end to end guarantees, intermediate aggregates and data consistency. The philosophy it follows is – to effectively perform stream processing, the developer should not have to reason about streaming at all. As an end user you shouldn’t have to reason about what happens when the data is late or system fails. Structured streaming provides strong guarantees about consistency with batch jobs, it takes care to process data exactly once and update output sinks regularly.
New concepts like ‘Late Data Handling and Watermarking’ enable these guarantees. Structured Streaming allows handling of delayed data by maintaining intermediate state for partial aggregates allowing late data to update aggregates of old windows correctly. These time windows can be defined by watermarking the time interval till when the late data is allowed to update aggregates. Another feature is ‘Event Time’; earlier Spark only considered the time when data entered the system and not the actual event time. Now, Structured Streaming allows aggregates and windows to be updated based on event time, and these aggregates are maintained by Spark itself.
One of the biggest functionalities Structured Streaming brings is that it simplifies event-time stream processing that works on both batch and streams. This was not possible on DStreams, the real time processing API in the earlier version of Spark. To do this Spark has a new model, a new way to treat streams, TABLES. Except this table is an, append only, unbounded table. Streams are treated as conceptual tables, unbounded and continuously growing. In actual execution, the unbounded table is an incrementalised query, in a way allowing a single dataset data frame API to deal with both static table and an unbounded table. As new data is coming in stream there are new rows added to this table thus unifying both batch and streaming data by the single concept of tables.
To explain it further it can be said that Structured Streaming allows the developer to write the business logic in the code once, conveniently apply it to batch data or streaming data, having to change only a single line or two of code. For instance you want to write a batch query using data frames, you will have a simple code, to convert it to streaming data all you have to do is change ‘read’ to ‘read stream’ and ‘write to ‘write stream’, but the actual query, the business logic does not change, the code remains the same. Essentially, Structured Streaming converts periodic batch jobs to a real-time data pipeline, converting your batch like query and automatically incrementalise it so that it operates on batches of new data.
With Spark 2.2, Structured Streaming moves out of the experimentation phase. This version marks Structured Streaming as production ready, losing the experimental tab, its stable API will be able to support all future versions of Spark.
But let’s also see what Structured Streaming is not? Structured Streaming is not a very big change to Spark itself. It is only a collection of additions to Spark Streaming retaining the fundamental concept of micro batching at the core of Spark’s streaming architecture. It enables users to continually and incrementally update the view of the world as new data arrives, while still using the same familiar Spark SQL abstractions. It has maintained a tight integration with the rest of Spark and supports serving interactive queries on streaming state with Spark SQL and integrates with MLlib.
Though a big leap towards true streaming, Structured Streaming is not there yet when it comes to use cases requiring 10-20 ms turnaround time. You still need to care about true event streaming engines like Storm and Flink. But there is a promise to move in that direction, using the same code with a dramatically different engine with plans to support use cases with latency as low as 1 ms. This move will further revolutionize the use of Spark for building true event time processing applications.
You can access our webinar to take a deeper dive into the functionalities of Structured Streaming, features and highlights, mid-to long term outlook, and the challenges that still persist. Look out for some very interesting questions from the audience that we answered live. We also got very encouraging feedback and look forward to bringing you all more such content in the future.
Gathr is an end-to-end, unified data platform that handles ingestion, integration/ETL (extract, transform, load), streaming analytics, and machine learning. It offers strengths in usability, data connectors, tools, and extensibilty.
Gathr helped us build “in-the-moment” actionable insights from massive volumes of complex operational data to effectively solve multiple use cases and improve the customer experience.
IN THE SPOTLIGHT
"In-the moment" actionable analytics
Identify up sell/ cross-sell opportunities
Gathr is in the Top 14 club, says Forrester
- Development Tools
- Advanced Analytics
Data integration just got free - forever
SaaS + BYOC = Best of both worlds
Learning and Insights
Stay ahead of the curve
Q&A with Forrester
Building a modern data stack: What playbooks don’t tell you
4 common data integration pitfalls to avoid
Why modernizing ETL is imperative for massive scale, real-time data processing
Don’t just migrate. Modernize your legacy ETL.