Data – Data, Drives, and Donuts

By The Numbers: Thru 2021-08-28

August 28, 2021

Departed Chicago and arrived near Boston. To date that is 3.8k miles driven with 49.7k records captured. And yes, there were donuts.

By The Numbers: Thru 2021-08-24

August 24, 2021

During my road trips, I like to write “by the numbers” themed posts where I share a little bit about how the data shaping up during the trip. Here’s the first update for the 2021 Epic Road Trip!

Road trip delay, so let's build a data pipeline!

August 18, 2021

Today was supposed to mark the beginning of my road trip. Oops! I am grateful for the delay as I got to see some great people and build some data goodness instead.

New blog, who dis?

August 16, 2021

Ok so “new blog” is a stretch. This is the same blog that’s hosted via Github Pages but with a new backend and purpose. Check out this post to learn about recent updates to my blog and the new stuff I will be posting here.

Vacation Project: Final Code Update

June 27, 2020

Final vacation project update! Decided to post it as a video instead of a blog post. Now I’m going to let it run for a week and then measure how well my process did at repairing a data stream.

Vacation Project: Do a little step function dance

June 24, 2020

Today’s accomplishment was crafting the first cut of a Step Function deployed via SAM & CloudFormation. I went head first into writing code… and quickly realized my previous drawing needed some more love. I redrew my previous step function so I could track the input parameters and detail the decision points. Here is that new drawing along with how AWS visualized it via the Step Functions console:

Vacation Project: Parsing results from Athena

June 23, 2020

Part of today was spent at Mount Rainier (Link: Photos in iCloud), so I did not put in a full day’s effort. Today’s updates (Link: Commits in GitHub) involve parsing the response from Athena’s get_query_results() method. It is not pretty but it does the job:

Vacation Project: Weekend Update

June 22, 2020

I said I was not going to work on this project over the weekend. That was a lie: I pushed a few commits into the repo through the weekend. Last I wrote, I was a bit frustrated by the response structure provided by Athena’s get_query_results() method. This response is a row-based dictionary where each element lists the related columns’ values. It is probably the simplest way to share tabular results that do not have a primary key but it flies in the face against column-oriented data types that have become a modern standard.

Vacation Project: Day Three

June 19, 2020

You see this image below? It scares me. This is how a query looks when you ask Athena to grab query results for you. It mimics the rows and columns in terms of how a person would think of a query result. The raw JSON file is available at the end of this post. Let’s take a step back and talk about how I got here.

Vacation Project: Day Two

June 18, 2020

At this point, all of the infrastructure work is complete and I am pulling stock data every minute for 11 stocks. The biggest additions from yesterday include using a time-triggered Lambda to queue up stock data requests for another Lambda to go get. The results get moved into a data stream then stored in a data lake. Now I have a data catalog available that enables us to query Amazon Athena (serverless query service) to do some basic analytics on the data. The latest hand-drawn monstrosity of an architecture diagram looks like this:

Vacation Project: Day One Learning

June 17, 2020

The basic infrastructure is complete: code grabs stock data from Finnhub and pushes the result into a data stream where it eventually gets stored in a data lake. The API secret token is stored in AWS Secrets Manager and never exposed. Everything is deployed via CloudFormation; you can start looking at my code on Github.

Vacation Project: Self-Healing Data Stream

June 16, 2020

Yesterday I started a two-week vacation from Amazon Web Services. While I am excited to take a break, I want to use some of this time to play with AWS technologies I have not yet used. I wanted to go beyond tutorials so I came up with a small project: a self-healing data stream. I will create routine data streams that are processed continuously to create downstream facts and aggregations. The challenge is that the data capture mechanism has an intentional flaw may result in incomplete data sets being used by downstream clients. This requires building an “auditor” to step in, analyze the data sets, and take corrective actions if data quality is impaired.