Building a fit-for-purpose modern data stack, part 2.

Author:Moss Pauly

Published:Feb 16, 2023

This is the second piece of a multi-part blog post by Sydney-based Zipster Moss Pauly, Senior Manager, Data Products, detailing the creation of our ANZ data platform and the team’s journey over the last 18 months.

Part one set the scene for how the team established a framework to guide decision making on the various components that now make up Zip’s stack.

In this final, longer instalment, Moss deep dives into a summary of the key decisions the team tackled when modernising Zip’s data stack, and excitingly what comes next. So grab a coffee or drink of your choice, get comfortable and dig in….

Thanks for stopping by again. As I mentioned in the first part, we didn’t want to make decisions on components in this stack lightly. Here I share an in depth reflection of six key decisions we made regarding the stack and its component parts - hopefully you find the takeaways and tips useful as part of your own journey.

Decision #1 — How are we going to transform our data?

We started here as we wanted to nail how we store business logic and transforms to build data models in a scalable, future-proof way. We didn’t want this decision bounded by constraints on our data warehouse, ingress, or egress compatibility.

We had the following core requirements:

The tooling needed to be able to support DAG type dependencies (n) layers deep.
It needed to be SQL-based.
We needed ways of tying documentation (e.g. metadata) directly to models in a way that we could check that they stayed in sync as part of our PR process.
Capitalising on existing models for third-party data sources would be nice to have.

Dependency management and documentation were both significant pain points of our existing transformation stack which in essence was Airflow orchestration of SQL, through Athena, on an S3 Data Lake. While we looked at a few options in this space, dbt was the clear winner here. The sentiment around it is fantastic and this became very clear quickly in conversations we had with other dbt customers.

With the framework we mentioned above, we went through a POC before documenting the outcome and recommendation to proceed with dbt. We then formalised a decision with our stakeholders.

After 18 months and with 1000+ models in production, we’re pretty happy with how dbt’s going. We’re currently on the cloud instance and the new hosted IDE is a game changer. We hit scaling issues pretty badly with our project size initially, but these are mostly resolved now.

Top tips for setting yourself up for success with dbt:

Enforce documentation and test requirements from the start with dbt_meta_testing.
Read gitlab’s dbt guidebook, it’s an incredible resource.
Spend the time to go deep on understanding macros and incremental models. They’re both insanely powerful.

Decision #2 — Data storage and compute

The second component that we looked at was the data warehouse itself — the compute and storage that dbt runs on. Given that we have a data science community operating in Databricks, we were keen on evaluating Delta Lake as an architecture alongside Snowflake as a data warehouse. We did extensive POCs on both, as we knew that this was not a decision we’d reverse easily and that it’d have an enormous impact on the quality of life of all data practitioners at Zip over the coming years.

There have been some entertaining fights in the data space of late about the performance of these solutions. As general advice, I’d say performance should be at the bottom of your list when it comes to making a decision. The reality is they’re all fast, who really cares about the basis points of speed in comparison to the quality of life working with it day to day?

The main components we evaluated here were:

If you wanted to run a quick query, what’s the time to result? (Opening the tooling, navigating it if required, waiting for a cluster spin up etc…)
What granularity of cost visibility can we have easily?
How well written is the documentation and how easy is it to find answers to questions?
How well does it integrate with dbt?
What would the management impost be on the small team responsible for operating and maintaining the platform?
How easy will it be to hire people with previous experience on this or bring people up to speed from scratch on this?
How easily can we manage PII redaction in this stack to protect our customers’ privacy?

After evaluating all the above we settled on Snowflake being the right choice for us. One of the bits of advice I got from talking with someone in a company who switched about a year before us was There’s so much that really just works in it. Looking back it was close at the time but we’re really glad we made the decision that we did.

After using Snowflake for 18 months, I really couldn’t agree more with this. This is echoed throughout the data community at Zip.

See below for our high-level data warehouse architecture:

MP_data chart

Top tips for setting yourself up for success with Snowflake:

Spend the time to really consider your security model and work through how it can be applied with role inheritance.
If you went with dbt as well as Snowflake, look at and leverage dbt_snow_mask to dynamically mask all PII from the start.
Zero copy cloning is amazing for enabling you to test model changes on a production clone. Use and abuse it.

Decision #3 — Event Collection

Our previous stack's inability to deal with the volume of raw events coming in was one of the catalysts for kicking off our journey in the first place. We decided to look in detail at Segment, Rudderstack, and Snowplow. My partner in crime Adam Schmidt (Principal Engineer at Zip) and I had very strong opinions about what we needed in an event collection platform having both worked in this space relatively extensively in the past.

We did not want a solution where we’d constantly need discussions about what we can and can’t track due to cost.
We must be able to enforce schema compliance on events.
The solution must be scalable to billions of events a month and must be able to handle sudden spikes in volume gracefully.
We wanted to have first-party ownership of our events.

We decided to do a POC for Snowplow due to a strong preference for open-source solutions in this space, and Adam deployed it to production. After extensive evaluation, we decided that we were happy to take the overhead for managing this infrastructure ourselves in favour of not having a SaaS cost tied to Events/Month. This is the first decision I’ll outright say is not for everyone:

Snowplow is incredibly flexible, but that’s a blessing and a curse. There’s a bit of a learning curve in understanding how to configure and optimise the pipeline, and you need to have well-supported streaming infrastructure (in our case Kafka) at your disposal.
Our platform and tooling are mature, and we have a reasonably sized engineering team. If you’re a small-ish organisation, start with the cloud offering.
We’ve had to find some creative ways to provide tooling for developing/testing/troubleshooting event implementations
In addition to deploying the pipeline, there are a few somewhat special things about the way we integrated Snowplow which I’d strongly recommend implementing, regardless of how you collect events.

Top tips for instrumenting any event collection framework:

Traceability on every event collected to a specific system and team. We do this in Snowplow leveraging appids to each service. Any events that come through with an _app_id not registered in our system get rejected. Additionally, every event is enriched with a service_name and team_owner based on a lookup on the app_id. This means that we can trace every single event in our platform to a team owner. Even though we have 20+ services sending us page_view events, we know exactly which team own each and every one that comes through.
Abstract the integration to enable centralised standardisation. We built out zip_analytics.js which is an opinionated abstraction library on the Snowplow SDK as well as a number of other analytics libraries. This makes our life so much easier than trying to coordinate 20+ teams to make a config change in their tracking implementation, and forces a predictable consistency in how event data arrives on the platform. This has already paid dividends with us rolling out a number of tweaks to how we track events over a very significant number of integrations with almost no coordination work needed.
Ensure that the data exiting the pipeline can be loaded into your warehouse with minimal fuss, especially in the early stages where you’re iterating and extending your event models frequently. The loader should be as dynamic as possible to reduce engineering effort in adding and modifying events/contexts.
Spend a bit of time up-front ironing out your principles for event namespaces, event names, and schema conventions. Use your most popular customer feature as a real-world test bed for how you intend to work with events going forward.

Decision #4–Third-Party Data Ingestion

We deliberately evaluated solutions for first-party and third-party data separately due to the following reasons:

Our appetite for an acceptable cost per data point is different between first-party and third-party data due to the massive differences in volume between these.
Data modelling can theoretically be a solved problem for a lot of third-party data sources, this is not the case for first-party data.
If you’ve been in the data space for a bit, you’ve likely had to deal with reverse engineering modelling the data out of a third party. Maybe this wasn’t too bad (shoutout to Twilio’s data structures ❤) or maybe you spent months of your life trying to model a Salesforce data feed (you have my sympathy). We were keen to capitalise on standard models in dbt for third-party data sets where possible.

Fivetran was active in this space, so we decided to do a POC with them. We evaluated a few other options on paper, but Fivetran came out ahead. This decision focussed on:

Coverage of our core third-party data feeds.
The shape of landed data structures.
Documentation of data structures.
Prepackaged dbt transform availability.

In all the above situations we were happy with how Fivetran performed during our POC.

Recently, our CIO was wanting to query Twilio data and pinged me about the Twilio table structure while I was getting coffee. I was able to send back a link to the Twilio ERD in Fivetran about 5 seconds later that fully explained everything. Well-documented third-party integrations are really valuable.

Top tips for Fivetran:

We pulled out some of the prepackaged dbt transforms from Fivetran that we initially had in for IDE performance reasons. I suspect this isn’t as much of a problem anymore with dbt’s new IDE though. Moving forward, our sensible default is “add them when you need them not when you can”.
Use Fivetran to send its own usage logs to your warehouse. It gives you a nice set of tables to monitor consumption and usage centrally in your BI tool of choice.

Decision #5–First-Party Data Ingestion

At Zip, we have a pretty significant amount of first-party data. This is highly distributed through a core monolithic SQL Server database, Kafka event streams, Postgres databases, and DynamoDB tables that back our microservice architecture. We needed a scalable way of dealing with the rapid ingestion of all these in an hourly fashion, and the cost of doing this volume of data in Fivetran was not feasible for us so we started looking elsewhere.

Similar to paying for event ingestion, we really wanted to avoid needing to have conversations about what we do and don’t ingest based on cost. We looked again at the open-source community here and discovered Airbyte, which was a relatively young open-source project that looked promising.

Our requirements were as follows:

Support SQL Server
Support of CDC on SQL Server
Support Postgres
Support S3 Ingestion
Snowflake Destination Support
Cost Efficient
Sub 1-hour integration of a new data source and/or tables
Connector-based with a rich collection of pre-canned connectors, and the ability to build our own if we wanted to.

Airbyte fitted these requirements pretty well, so we started our POC again deploying to our production environment. This journey was a little bumpier than some of our others, with one of our data engineers submitting over 10 PRs to the project remediating some issues with the MSSQL connector. The Airbyte community and core maintainers are really active and supportive, so getting the changes merged was not a problem.

That aside, we’re really happy with where we landed. Ingestion is quick, it’s stable, and it meets our needs. We’re doing hourly ingestion from a number of databases big and small and we haven’t needed to touch this much after the initial setup.

Decision #6–Data Egress

We started a conversation with Census in this space back in the early days when we were looking at Snowflake and dbt, although this is one of the most recent additions to the data stack.

We ended up waiting until we had a number of use cases we could solve with this tool to make the business case and get this across the line. We ended up doing a full POC on Grouparoo and Census. Grouparoo is an open-source project recently acquired by Airbyte, due to our existing experience with that project and our preference for open-source solutions it made sense to see if this would work for us. While Grouparoo initially looked promising, we decided we weren’t comfortable with its maturity. The issues here were primarily due to connection instability during scaling and the methodology for existing record checks in the Braze connector.

I can’t wait to see what Airbyte does with Grouparoo in the near future, but it didn’t fit our needs at the time of evaluation. A lot of our use cases for data egress revolve around shipping data into Braze, our CRM platform.

The key aspects we were evaluating are below:

Sub-3-hour updating of 4 million records in Braze
Robust updating of only records that have changed (Braze charges per data point updated so this is a must for cost efficiency)
Friendly UI to empower product and marketing teams to control what data ships where.

We were delighted with Census from the start and decided to move forward with them — a decision we’re really happy with. Since then, we have started working on some creative aspects with their segment builder empowering our product and marketing teams, which I’m sure we’ll share in a future blog post.

Top tips here:

Triggering Census syncs post-completion of dbt jobs is your friend for seamless integration.
Entity models have been released recently. These are game-changing when it comes to enabling non-technical people to create segments that would normally need to do complex joins across a large number of data sources.

What’s next?

The above forms the core of the data platform at Zip.

However, I view this as Building our data platform is really just the entry ticket into being a modern data-led company. We’re currently well underway with a host of next-phase activities built on a strong foundation:

Enabling our data science teams with Feast as a feature store
Visualising model performance using Streamlit applications
Evaluating data catalogue options to move data trust and data definitions from being predominately tribal knowledge to explicitly documented and easily accessible
Monitoring and alerting for data SLA breaches, data profiling and core funnel performance across 40k merchants.

I’m proud to have been part of the journey of modernising the data stack at Zip, and couldn’t have done it without the outstanding effort of the data engineering team (special shout out to Adam Schmidt, Kiran Reddy, Nick Heron, Thomas Gao, Alex Crawford, Frank Fernando and Sivakumar Ramaswamy), data analytics team, and data science teams. Additionally, this would not have been possible without the data leadership at Zip (Tal Bergman — Director of Data and Analytics and Adam Finger — CIO) recognising the opportunity in this space and backing our journey into it.