Spakify is a music streaming sevice as similar to Spotify. Every users’ activities on Sparkify application are logged and sent to Kafka cluster. To improve the business, the data team will collect data to a Big Data Platform for further processing, analysing and extracting insights info for respective actions. One of the focusing topic is churn user prediction.

Read More

Have you ever wonder you are underpay at work?
Or you are not sure how much should be your well-deserved deal to a job offer?
In this post, we will walk through Stack Overflow survey from developer over the world in 2020 to clear out these questions.

Read More

Hadoop can handle with very big file size, but will encounter performance issue with too many files with small size. The reason is explained in detailed from here. In short, every single on a data node needs 150 bytes RAM on name node. The more files count, the more memory required and consequencely impacting to whole Hadoop cluster performance.

Read More

According to First Alliances, there is a rapidly rising demand for more IT talents which is still small pool in Vietnam. Working in Data Science segment, I’m more focusing on data-related positions. From experiences with sharing knowlege within my network, the survey reflects the quite correct the wages people can make per month. I emphase this because I saw some other recent surveys too much exaggerated or underated based on some unrealiable sources.

Read More

Spark is not only a powerful regarding data processing in batch but also in streaming. From version 2.x, Spark provides a new stream processing paradism called structure streaming based on Spark SQL library. This helps developer work with stream process easier compared to DStream API in earlier version. This post will walk through the basic understanding to get started with Spark Structure Streaming, and cover the setting to work with the most common streaming technology, Kafka.

Read More

Recently, I worked on a project to consume Kafka message and ingest into Hive using Spark Structure Streaming. I mainly used python for most of the work with data pipeline construction, and this project is not exception.

Everything moved smoothly at the beginning when launching first Spark Structure Streaming to read simple message in raw text format from Kafka cluster. The problem was rising when I tried to parse the real Kafka message serialized in Avro format.

Read More

To quickly launch spark and kafka for local development, docker would be the top choice as of its flexibily and isolated environment. This save lot of time for manually installing bunch of packages as well as conflicting issues.

Read More

This post provides a general setup to start with Spark development on local computer. This is helpful for getting started, experimenting the Spark functionalities or even run a small project.

Read More

Understand Spark hierarchy in term of hardware and software design will help you better in develop an optimized Spark application.

Read More

Apart from core APIs which requires external systems to install Kafka client for the integration, Kafka also supports Kafka Connect API with REST API for the more flexibility in communicating with different systems. This note is to collect list of basic APIs on frequent useage.

Read More

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×