The goal of this project is to collect and visualize the stock price of all tickers in Vietnam. There is quite limited access to API for a single business user, this project aim at scrap data from website, clean, extract and load into data warehouse. The final product is a maintainable/reliable data pipeline with exposed analytic dashboard hosted on cloud, and end authorized users can access to 24/7 with daily updated data.
Data are collected for almost 10 years from 2010 to support the analytic (such as back-test or trading strategy for a particular ticker) steps afterward. For the scope of this project, the stock price will be visualized in time series domain for the monitor purpose only.
To prove the working concept with reasonable cost, below is the proposed architecture solution.
The whole picture derives from 4 main components:
This project can be accomplished from 2 data sources: ticker URL & price URL hosted by VNDirect. This is where all Vietnam tickers officially registered.
Airflow is chosen to play the orchestration role to run ETL and scheduling jobs. Airflow is dockerized and deployed on Digital Ocean cloud service. Airflow docker is selected to facilitate CI/CD tasks later on.
Refer here: Airflow Docker Repository
Up to now, there are 1710 tickers are registered and ~2.4M records stored on Postgres database maintained by Digital Ocean service. Since there are not so many users access at the moment and the data size is not quite significant big, this selection is cost effective while not affect the performance.
Because of time constraint, only visualization is implemented for this project to visualize the colleted data from Postgres database. Future feature will be Data Analytics with Spark by leveraging available data for 10 years to conduct the backtest as an example.
The superset visualization tool is also dockerized and hosted on EC2 t2.micro which is free up to 1 year service. This option is very optimal since visualization task is not a heavy hardware consumption.Refer here: Superset Docker Repository
As mentioned before, there are 2 main data sources including ticker and price. The data are in html format which required the ELT job to include some special techniques to extract needed information.
Data model is simply described as below.
Although the data model looks simple, the heavy task of ETL job is to extract the needed data from html format. In this project, Selenium and BeautifulSoup libraries are used to automate the process of crawling stock data.
The pipeline includes 2 main steps: ticker ingestion and price ingestion. At price ingestion step, it is devided into different flows by exchange to enable mutithread processing for better performance. Within each thread, the data quality check has been includeded before every insert command.
Full source code can be found here: Github