A personal blog about some hobbies of mine.
June 22, 2020
I said I was not going to work on this project over the weekend. That was a lie: I pushed a few commits into the repo through the weekend. Last I wrote, I was a bit frustrated by the response structure provided by Athena’s get_query_results() method. This response is a row-based dictionary where each element lists the related columns’ values. It is probably the simplest way to share tabular results that do not have a primary key but it flies in the face against column-oriented data types that have become a modern standard.
Frustrations aside, I am still eager to figure out some type of data orchestration solution to routinely repair the data stream. I am going to scope the effort down to focus on using Step Functions to perform the following data processing steps:

To make it easier, I went ahead and took the dive into AWS Lambda Layers. Contrary to my initial thoughts, Layers are surprisingly easy to incorporate into a serverless app. I can load Pandas as a dependency into the Layer and then let my Lambda Functions rely on said Layer. That results in Pandas being available without overloading my Lambda Function size limits. Now that Layers are in play, I can think of it as ‘Helper Layer’ to also store common methods in one file instead of copy/pasting them into each individual Lambda Function.
I made the following changes/updates between now and my prior post:
get_interested_stocks(), get_environ_variable(), send_to_sqs(), and so on.generate_data_check_query() to generate a query that will ask Athena to report the missing minutes of data between a start and end range.submit_athena_query() to accept a query and return a query execution ID.wait_for_athena_results() to accept a query execution ID and poll until the query run is complete.get_athena_query_results() to retrieve query results in the row-based dictionary structure.get_athena_query_results() and form it into a Pandas data frame.