Airflow on EKS

Aug 20 2021 Data

Airflow is a powerful platform to automate and manage workflows. There are several options to deploy Airflow on AWS, including MWAA, ECS OR EKS:

Deploying Airflow on AWS Managed Workflow for Apache Airflow (MWAA): This option provides a fully managed service for Apache Airflow and is a good choice for those who want a quick and easy way to get started with Airflow on AWS.

Data Lake on AWS

Jun 21 2021 Data

In our last post, we explored the topic of the Data Platform on AWS. This post continues the discussion by offering an in-depth look into the central component of the data platform, the data lake, which serves as the single source of truth.

A data lake is a centralized repository for storing structured and unstructured data at any scale. It helps organizations effectively store, manage, and analyze growing amounts of data. Building a data lake on AWS offers cost-effective, secure storage and real-time analysis using scalable infrastructure, robust security, and analytical tools for making data-driven decisions and improving business value.

The proposed architecture is presented as below with 5 main components Ingestion, Storage, Processing, Meta Data & Governance and Orchestration.

Data Platform on AWS

May 21 2021 Data

As AI continues to impact the world, the importance of data in business decision making has become increasingly apparent. Data also offers the potential to deliver greater value with less effort. To fully realize these benefits, it is essential to prioritize the development of a robust data platform architecture.

This series begins with the goal of constructing a comprehensive data platform on AWS, designed to meet the diverse needs of companies from startups to enterprises. Our objective is to create a platform that is scalable, reliable, secure, flexible, and cost-effective.

Compact multiple small files on HDFS

Dec 5 2020 Data

Hadoop can handle with very big file size, but will encounter performance issue with too many files with small size. The reason is explained in detailed from here. In short, every single on a data node needs 150 bytes RAM on name node. The more files count, the more memory required and consequencely impacting to whole Hadoop cluster performance.

Getting started with spark structure streaming

Oct 10 2020 Data

Spark is not only a powerful regarding data processing in batch but also in streaming. From version 2.x, Spark provides a new stream processing paradism called structure streaming based on Spark SQL library. This helps developer work with stream process easier compared to DStream API in earlier version. This post will walk through the basic understanding to get started with Spark Structure Streaming, and cover the setting to work with the most common streaming technology, Kafka.

Deserialize Avro Kafka message in pyspark

Sep 23 2020 Data

Recently, I worked on a project to consume Kafka message and ingest into Hive using Spark Structure Streaming. I mainly used python for most of the work with data pipeline construction, and this project is not exception.

Everything moved smoothly at the beginning when launching first Spark Structure Streaming to read simple message in raw text format from Kafka cluster. The problem was rising when I tried to parse the real Kafka message serialized in Avro format.

Spark & Kafka docker-compose

Sep 20 2020 Data

To quickly launch spark and kafka for local development, docker would be the top choice as of its flexibily and isolated environment. This save lot of time for manually installing bunch of packages as well as conflicting issues.

Setup Spark local development

Aug 13 2020 Data

This post provides a general setup to start with Spark development on local computer. This is helpful for getting started, experimenting the Spark functionalities or even run a small project.

Data

Your browser is out-of-date!