Spark Streaming Contest: Real-time Anomaly Detection Apps

At Impetus, we take data analytics innovation seriously. Very seriously. And one of the ways we continue to improve our big data software products and services, as well as retain our industry leadership standard, is through community programs that empower users to explore innovative uses for analytics technologies with our real-time streaming software, Gathr.

One of our big data programs was the inaugural Spark Streaming Innovation Contest, an international data hackathon that drew roughly 600 participants from around the world with a grand prize of $10,000 for the best submission. Held from February through April, we opened the contest to the general community, calling on business analysts and engineers to solve real-world anomaly detection problems.

Because hackathon participants vary in skill level and experience, we outfitted them with two tools. Apache Spark and Gathr. We wanted them to be able to access their data quickly while eliminating the need to build complicated models to gain insights.

Apache(R) Spark ™ is the most popular stream processing engine due to its open source framework, powerful programming model, and advanced analytics capabilities. However, Spark typically requires a lot of setup, coding and modeling; therefore, we equipped users with Gathr, a development platform that enables users to create real-time stream processing and machine learning applications.

Gathr makes anomaly detection on Apache Spark extremely easy, allowing developers to leverage their data quickly and spend their time gaining insights instead of programming. With these tools in hand, hackathon participants could build anomaly detection applications quickly, even without prior experience of using Gathr.

A panel of experts, including the Gathr product team, architects and engineers, as well as Alex Woodie, managing editor of Datanami and Mike Matchett, senior analyst and consultant at Taneja Group, evaluated and scored each submission.

Perhaps one of the most shocking discoveries we made is that this year’s winners weren’t even veteran data scientists. “I wouldn’t call myself a data science expert,” said Venu Kanaparthy of Redlands, California. Kanaparthy won the grand prize of $10,000 with his machine learning application for anomaly detection using Spark. Despite his limited experience, he says that he “was able to build a fully functional anomaly detection application on Spark working part-time evenings over about 4 weeks.”

A total of $18,000 was awarded in prize money, including two runners-up. The First runner-up (awarded $5,000) was Anindya Saha from Foster City, California. The second runner-up (awarded $3,000) was Kalyan Janaki from Denver, Colorado. We congratulate our winners and are already looking forward to next year’s competition.

Powerful User Experiences

Having a powerful tool at your fingertips is one thing, but having a tool that is easy and exciting to use leads to better big data analysis. And we know how important UX is. So, when the hackathon ended, we surveyed the participants about their experiences.

One of the biggest complaints we’ve come across from data analysts and data engineers in the field is that there are many tools in their toolkit, but they lack good user experiences. Good UX is innovation and is at the core of everything we do, so we listened carefully.

We were delighted to hear that they found the platform “very intuitive.” In fact, most commented on how easy it is to use, saying that ““even a business analyst who has some familiarity with Spark can easily create and run Spark-based machine learning pipelines.” The main advantage of the platform, said one user, is the ability to “develop and deploy pipelines with no or less coding.” Overall, contestants agreed that Gathr is as valuable for business analysts as it is for data scientists and developers.

We are inspired. And we will create more programs like these. What about you? Ready to build a real-time streaming application?

Table of Contents

Recent Posts

View more posts


50X faster time to value with Confluent and Gathr...


Data + AI Summit 2023: A must-attend for data scientists,...


Move away from batch ETL with next-gen Change Data Capture


ETL vs ELT: Which data integration practice is right for...