Why and How We Moved to The Modern Data Technology Stack
Why Talk About This:
At the heart of Haackalytics, I want to figure out how to build the best algorithms to help soccer teams win and players improve. Any time spent in engineering is time not spent on soccer algorithms. So it’s important to get the tech right. Getting the tech right means that data and analysis are consistent and it means that when I serve a result it is correct. The modern stack that I use makes all of my experiments repeatable and all data available to anyone on my team. A good data tech stack will enable you to do analysis with a fraction of the development cost when compared to running data locally or in a standard database.
What does our stack do:
Our stack enables us to rapidly prototype and make new experiments around soccer analytics. It makes experiments repeatable and versioned and data accessible to anyone with the right access. It enables apps to be built on top of it because data can be moved directly into the app. It enables our data to stay current by having scheduled jobs to ingest and load new data. It also enables version histories of all of our data so that if any corrupted data enters are system reverting back is simple. We also have tests to make sure we are serving back the most accurate data possible.
What are the key components of an analytics stack:
This is a biased subset of what I think is really important to build any sort of analytics stack. We really believe that:
- Standardized loaders (Scrapers, S3 dumps, Programmatic uploads, Lambda functions, Event streams etc.)
- A data warehouse that enables (Snowflake, Redshift)
- Exploration
- Standard ELT jobs through a scheduler (DBT)
- Dashboard and charting functionality
- A machine learning library (metaflow) that is system independent and keeps track of:
- Repeatable experiments
- A copy of the data at each step of the pipeline
- Version Control for code and data (Git and clever use of Snowflake/DBT)
- Need all code to be versioned and to trigger runs in production on merge to main branch
- Serverless technology to deploy results and serve to clients and power different use cases (lambda/sagemaker)
How Does it work:
Data starts by being loaded by one of our loaders and imported into snowflake (at a monthly cadence, could be faster if necessary). Once new data is loaded into the warehouse there are triggers for running the Extract Load Transform jobs that are set up to create derivative views of the original raw data. Once a table is done it is exported into a use case scenario, either a dashboard or a report to server to a client, or a machine learning model, or its built into an app powered with elastic search.
How the data flows: Transfermarkt data example
For this example we are going to describe how data is moved from raw player match appearance data in transfermarkt to a report that helps show what the peak of age of soccer players is.
- Data is scraped from transfermarkt and stored in S3
- Data is staged in Snowflake from S3 and then copied into a table
- This data includes meta data around when it was scraped and uploaded
- This table just keeps getting appended to whenever new data is ingested
- Each row has a unique_id and each upload has a unique id.
- fields are staged as variant fields
- Data in snowflake is transformed from raw to queriable data with DBT whenever new data is scraped
- Data is now in a queryable table with the following columns and more in DBT.
- game_date, is_injured, injury_type, birthdate, player_id, name, row_id, minutes_played, season, age, position
- Minutes are summed for each player based on their age. in DBT
- Average minutes per age is computed to see at what age do players have the most minutes in DBT
- Chart of data in 6 is surfaced in snowflake dashboard and png is uploaded to article.
- Bonus: data from 5 is used to build a machine learning model for each player to predict how many games in next season (metaflow)
- Bonus: Results are saved in database and served back to user using an api built on lambda.
Tooling Philosophy:
Good tools in a system are replaceable components, meaning you can’t get too attached or too in love with a specific tooling. Additional tools are integrated based on needs of clients and solutions that we are providing. It’s important to also adjust your tooling based on the scale of your data.
Credits:
A lot of credit goes to Jacopo Tagliabue and his talks on MLOPS. You can see this one here which revolutionized my thinking on the topic https://www.youtube.com/watch?v=Ndxpo4PeEms.
Topics not fully discussed but crucial to consider:
- Version Control and how triggers execute (github actions)
- Development and production schemas for data
- Web app building and how to surface data from your warehouse into webapp
- Elasticsearch and how to surface data back to users
- Airflow and job scheduling
- Testing through DBT and Great expectations
- Security and permission controls
- Environments and Containers
If you want advice on any of the above feel free to drop me a line! I can be reached at chris@haackalytics.com
How Expensive is it:
Cheap (under 100$ a month). Obviously this scales with the size of your data, but most of these components are fairly cheap unless your data scales astronomically (maybe its tricky for tracking data in soccer) or you need GPUs.