Data, Drives, and Donuts

A personal blog about some hobbies of mine.

Vacation Project: Self-Healing Data Stream

June 16, 2020

Yesterday I started a two-week vacation from Amazon Web Services. While I am excited to take a break, I want to use some of this time to play with AWS technologies I have not yet used. I wanted to go beyond tutorials so I came up with a small project: a self-healing data stream. I will create routine data streams that are processed continuously to create downstream facts and aggregations. The challenge is that the data capture mechanism has an intentional flaw may result in incomplete data sets being used by downstream clients. This requires building an “auditor” to step in, analyze the data sets, and take corrective actions if data quality is impaired.

Planned Data Operations

I will eventually draw a proper diagram, but here is the hand drawn diagram that has helped me shaped up this project:

basicdatadiagram

High level operations includes:

  • Data Acquisition: Pull some financial stock data on a regular cadence (hour? minute? exact timing TBD), write the payloads into a data stream, store the at rest in S3, then use data lake technologies to enable analytics and downstream processing.
  • Data Transformation: Continuously read from the data lake to transform granular data into data sets representing new facts and aggregations. This involves repartitioning the data into partitions that better align with downstream queries (eg: partition by stock symbol as opposed to ingestion time).
  • Data Validation: Start an audit process to learn the current state of data quality. Should data quality be impaired, then the audit process should kick off:
    • Data Repair: Reuse the Data Acquisition steps to obtain any missing data chunks.
    • Data Resynthesis: Reuse the Data Transformation steps to rematerialize transformed data sets impacted by the repair operation.
    • Alarm: Fire an alarm if repair and/or resynthesis attempts fail or exhaust their retry attempt count.

Planned Technologies

Starting Point

I have been using Finnhub’s API for obtaining stock data. Per the API documentation, I can get live stock quotes or I can get “candle” data to grab historical quotes.

Right now I am experimenting with a local client to get an understanding on how the API responds. Once I get something basic written to move data from the API to a data stream, I will move the code to Github and blog my progress.

I suspect much of this will change along the way… so here we go!