Deserialize Avro Kafka message in pyspark

Sep 23 2020 Process>Spark

Recently, I worked on a project to consume Kafka message and ingest into Hive using Spark Structure Streaming. I mainly used python for most of the work with data pipeline construction, and this project is not exception.

Everything moved smoothly at the beginning when launching first Spark Structure Streaming to read simple message in raw text format from Kafka cluster. The problem was rising when I tried to parse the real Kafka message serialized in Avro format.

Spark & Kafka docker-compose

Sep 20 2020 Process>Spark

To quickly launch spark and kafka for local development, docker would be the top choice as of its flexibily and isolated environment. This save lot of time for manually installing bunch of packages as well as conflicting issues.

Kafka fundamental

Mar 20 2020 Connect>Kafka

Apache Kafka is a distributed streaming platform. It is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault tolerant, wicked fast, and runs in production in thousands of companies.

#streaming

Deserialize Avro Kafka message in pyspark

Spark & Kafka docker-compose

Kafka fundamental

Your browser is out-of-date!