Cloud Data Warehouses

A guide to choosing a modern cloud data warehouse

Nate Robinson | March 18, 2025

Executive Summary

At Snowpack we recommend a narrow set of these databases to our clients of which we are intimately familiar with the details of implementing and managing. We stand behind each of these databases for different reasons and while all of them are very powerful, selecting the right data warehouse for your needs is important for the long term scalability and financial viability of your data organization.

In this article we will discuss three separate data warehouses in depth: Google BigQuery, Snowflake, and Motherduck. We will also note alternative offerings to explore with which we have ample experience, but not the expertise to offer a truly informed recommendation on.

Google BigQuery: Most expensive option with a powerful ecosystem
Snowflake: Best choice for most mid-market customers
Mother Duck/DuckDB: Cost-friendly option for small organizations and fast-moving DS teams

Introduction To Cloud Data Warehouses

We should first explain what cloud data warehouse technologies offer, and why they are paramount to analytical success. Data warehouses are best described by their striking differentiation from traditional databases that you are likely familiar with, referred to as OLTP (online-transaction-processing) or transactional databases. Transactional databases include traditional SQL-type databases like MySQL and PostgreSQL, as well as modern NoSQL databases (MongoDB, Redis and Cassandra etc). These databases are optimized for atomic read/write operations that follow the ACID principle which gives us confidence in the deterministic behavior of these operations. OLTP databases power most modern applications, when you create a "new user" or view a payment these items are stored and returned from a transactional database usually operating on a remote server. Analytics databases conversely are used by organizations that need to analyze and report on large amounts of data with summary type operations. We refer to these as OLAP (online-analytical-processing) databases, or Analytical databases for short. Where an OLTP type database is optimized for retrieving or updating single rows at a time, an OLAP database is optimized for performing mathematical operations on thousands, even millions of rows at a time, an operation that would grind a transactional database to a halt. These types of operations might look like "calculate the sum of invoices, less expenses on all invoices that were generated in the last two years". While analytical databases can handle complex operations, they cannot handle the high volume of input and output operations that a transactional database can, nor will they support the same kind of ACID requirements needed for a production application.

The simplest way to think about this is that traditional transactional databases operate row by row – with transactions inserting, retrieving, or editing generally one row at a time. Analytics databases operate on columns – with operations doing calculations on singular columns but accessing millions of underlying rows.

Now what separates an analytics database from a cloud data warehouse? Generally networking and cloud computing. An analytics database could be embedded in a device, or run locally without network connectivity, as you might do with a duckdb database, but "cloud data warehouses" need to exist on remote servers, allowing them access to near infinite "horizontal" scaling and enabling access for many users at the same time. Most of our modern options are going to be cloud data warehouses, it's rare for modern analytics organizations to have a purely offline analytical database, these are typically reserved for specific high-sensitivity datasets. From here on out you may see us refer to "cloud data warehouses", "analytics databases", and "OLAP" databases interchangeably, you should now understand that these are mostly synonymous with some minor details separating their meanings on the margin.

Features of Cloud Data Warehouses

There are a base set of features we would expect with any cloud data warehouse:

Ability to import/load data
Integrated Storage
Ability to query data with a SQL type language
A built-in SQL IDE
Ability to communicate with external connectors such as BI tools

Data Warehouses

We discuss three major data warehouses in depth, BigQuery, Snowflake, and Motherduck. Any of these three warehouses will provide high utility, however some may be more cost-effective than others for a given organization, while the features/scale of others may be a better choice for a different organization. We will also provide a quick blurb on other cloud data warehouses that are options, however we do not recommend these for a variety of reasons.

Google BigQuery

Overview

Google's BigQuery was the original petabyte-scale cloud data warehouse. BigQuery pioneered the implementation of MapReduce – an algorithm that allows functions to be distributed across multiple isolated units of work before eventual combination. This allows horizontal scaling for queries run via BigQuery. BigQuery's terminology for how queries are executed take a bit of time to understand. When you execute a query it will register a job in bigquery, this job is given a slot, which is "compute space" on their servers to execute that work. The data is then retrieved, subdivided across slots, calculated, combined and returned to the user. This means that the compute on BigQuery is nearly infinitely scalable, they refer to this load sharing as Massively Parallel Processing (MPP), however you are competing for "slots" or virtual compute units with both your own queries as well as other users queries on Google's servers.

Cost Structure

While the architecture under the hood is complicated, BigQuery's cost model is quite simple with fees generated by compute and storage. They charge a fee of $6.25/TiB of data scanned for compute, and $0.02/GiB per Month of data stored. Data storage is quite affordable, for example, 1 TiB of data stored would cost $20 per month to keep in BigQuery. We find most moderate to large scale businesses are not storing too much more data than this. Compute is significantly more expensive and has the potential to scale out of control if poorly managed. For example – if we had a 1TiB table, each query against that table would cost $6.25, now imagine you have this linked up to a dashboard, each subsequent refresh of the dashboard is going to cost $6.25, a problem we've seen with many organizations that have ballooning BigQuery costs.

On the one hand – scanning large amounts of data can become expensive quickly, on the other hand we do not need to consider how much compute resources we allocate, nor do we need to think of the implications of user scale, we could have 200 analysts all executing queries to BigQuery without ever having to worry about limited compute to handle all of these queries. There are also levers to pull for handling the scale of compute data – BigQuery gives us a nifty feature called partitions which we can use to limit queries, by explicitly operating against limited windows of data, this allows us to reduce the data scanned by queries and thus the cost. Finally when leveraged by a sufficiently mature data organization, transformation layers will do the "heavy lifting" of data prep on the large tables and only allow users access to relatively small tables which significantly reduce the costs. Finally if an organization is pushing massive amounts of compute to BigQuery, they can reduce their costs and improve their query performance by using slot reservations. Slot reservations work by purchasing a commitment to a certain amount of slots in a pool called a reservation. Compute can then be run out of that reservation which is billed at a flat rate for the reservation rather than for the amount of bytes scanned. This allows you to control costs on large compute operations that are predictable, however the cost improvements are quite opaque.

Ecosystem

Beyond just BigQuery's engine, Google Cloud's ecosystem supports many additional features that make BigQuery a compelling choice for larger organizations. If you are already on Google Cloud Platform, you can handle your data permissions and governance through IAM, and natively integrate with bucket storage in Google Cloud Storage for storing and ingesting data. Finally BigQuery supports exporting of data to both Google Sheets, and Looker within their own ecosystem which can easily solve reporting problems within the Google ecosystem.

Snowflake

Overview

Snowflake is a proprietary cloud database technology that has exploded in the industry in the past 5 years. Snowflake is more of a hybrid technology than BigQuery that will require some configuration for access to compute, and won't scale quite as seamlessly with large user bases as BigQuery, but we find it is often much more cost effective and has a more analytics-focused ecosystem than BigQuery for midsized organizations. The other benefit of Snowflake is that it's cloud agnostic, which does not shackle you to vendor lock-in with either Google Cloud or AWS. Snowflake operates on a "hybrid" architecture – where you pay for access to different size "warehouses", behind the scenes compute, storage, and client-applications are separate clusters in this warehouse to reduce blocking on certain operations. This also means that different warehouses can be applied to different types of problems, it's possible to have multiple warehouses, one that is more powerful, and thus more expensive to handle large bulk operations, and then another smaller warehouse for daily user interaction.

Cost Structure

Snowflake's pricing model is less transparent than other cloud data warehouse providers, which often leads to trepidation for decision makers. The costs are still split between storage and compute. Storage is similar to BigQuery, costing $23/TiB per month, however this is calculated post-compression which Snowflake does automatically, which means that the billed storage will often be significantly less than the amount of raw data uploaded. Compute is tracked via "credits". The value of these credits range from $2 to $4 depending on the level of service your organization needs, and the size of your warehouse determines the amount of credits billed per hour of usage. We've included a small table below to indicate the cost per hour of usage on different Snowflake Tiers, for different warehouses.

Snowflake Hourly Cost by Service Tier and Warehouse Size¹Reference: https://www.snowflake.com/legal-files/CreditConsumptionTable.pdf

Service Tier	XS Warehouse (1cr/hour)	S Warehouse (2cr/hour)	M Warehouse (4cr/hour)	L Warehouse (8cr/hour)
Standard ($2/credit)	$2 per Hour	$4 per Hour	$8 per Hour	$16 per Hour
Enterprise ($3/credit)	$3 per Hour	$6 per Hour	$12 per Hour	$24 per Hour
Business Critical ($4/credit)	$4 per Hour	$8 per Hour	$16 per Hour	$32 per Hour

The upside with Snowflake's credits and usage is that consumption is recorded by the minute. This means that we will only be charged for the uptime that a warehouse needs to execute an operation. These warehouses will "come online" pretty much instantly when requested for an operation, and turn off after completion, with the caveat that the minimum time granularity is 1 minute so any isolated operation under a single minute will incur one minute of uptime.

This compute time is also shared across users – so imagine that I had an XS warehouse, and only allowed my users to access it for a single hour a day, they could write as many queries to that warehouse as they want in the 1-hour it's available, and while the warehouse's total compute might limit how many operations they can successfully complete, it will only charge for the single hour, a total of $2. This can be economical if used correctly! Theoretically you shouldn't have users running queries overnight, so your warehouse might only be live during business hours, but even this can get complicated. The ability of a warehouse to process both 1. Complicated queries, and 2. Large numbers of queries are limited by its compute. So if we begin sending difficult operations, or too many queries we might overload the database and it might be running through an endless queue of queries 24/7. This is why larger warehouses are available, you can speed up execution and potentially make things more economical by switching to larger warehouses for certain types of operations.

We typically recommend that our clients run a larger warehouse (typically Medium to Large) to handle complicated transformational queries on a schedule, this ensures the work doesn't balloon and typically only requires an hour or two of daily uptime. Then a smaller warehouse is used to serve "user generated" queries for things like dashboards and operational work on pre-calculated datasets.

Ecosystem

Like BigQuery, Snowflake has a robust ecosystem of additional features that you can access as part of this data warehouse. The features of Snowflake however are more specifically curated for data analytics organizations. A robust "worksheet" or in-browser IDE allows you to query and view data from their web portal, much like BigQuery, however they have much stronger support for data visualization, charting and graphing built directly into the UI. Snowflake also has a product called "Snowsight" which is a BI type tool that allows us to generate dashboards, visualizations, and operate notebooks leveraging Python natively in the web UI. While these types of features are available through Google, they are not as natively coupled to the BigQuery product. All of these features are included free of additional cost, due to the fact that accessing them requires running a warehouse to access the data, thus incurring uptime costs. Finally Snowflake has leaned heavily into the Apache Iceberg hype to provide an interface for leveraging the open source table format to store data externally and access it natively in Snowflake.

Motherduck

Overview

MotherDuck is a low cost alternative to BigQuery and Snowflake that is relatively new on the market, having just opened to the public in August 2023. This however does not mean that it does not already have a significant amount of support. Motherduck extends DuckDB to cloud operations, taking the well-loved offline analytical database tool DuckDB and scaling it for use across an organization. DuckDB is an OLAP database often used by data scientists and analysts that can be used for performing complicated queries on large datasets so long as the data is localized to a single machine (read a person's laptop). By hosting DuckDB on a cloud platform, MotherDuck enables us to run an analytics organization much like the larger competitors while taking advantage of cost saving measures. This is a tool that is great for early analytics organizations that are cost-conscious and not yet ready to scale both their size of data, or broader organizations, however, need to unlock analytical and data science capabilities for an organization that is >1 person. If you are currently executing BI queries against a production read-replica, or you have a data science team that needs to share data and begin doing exploratory experimentation, then this is a great next step.

Cost Structure

MotherDuck has a relatively simple price model that again focuses on a split between storage and compute. The difference however is that MotherDuck offers a flat cost structure for growing organizations – for $25/mo a standard plan will allow you unlimited users and includes 100GB of storage, and 100 compute hours. Unlike Snowflake and in a way similar to BigQuery there is only a single compute tier with which all of your queries are executed against. Unlike Snowflake, compute is measured down to the second – so short, infrequent operations will not chip away at this budget as fast as they would on Snowflake. Any overages from this budget are billed at a rate of $80/TiB per month for storage, and $0.25 per hour of compute time. Notably this compute cost is ⅛ that of the smallest Snowflake database. The $25/mo plan works out in such a way that your first 100GB of storage is free and you're paying upfront for the first 100 compute hours at cost.

On the outset, this doesn't seem like too much better of a deal than running a very small Snowflake instance with limited usage, however we need to keep in mind that MotherDuck will allow us to store compressed data (like Parquet) to reduce storage costs, and offers powerful local scaling for users running DuckDB connection through their own machines. The local scaling does not offer guaranteed improvements, but is proven to significantly reduce the computational load that MotherDuck requires and thus bills. What enables this is that users can run a local DuckDB instance on their machine, and connect to the MotherDuck cloud instance. By issuing queries through their local database to the cloud instance, MotherDuck will automatically scale the workload across both machines, limiting the cloud compute and doing costly compute operations locally on users machines. What's even more exciting is that this has recently been scaled to browsers! So when executing queries within the MotherDuck UI – a portion of the compute is automatically pushed to an in-browser WASM DuckDB instance, again reducing the computational burden on the cloud database and keeping costs due to computation hours low. The exact computational savings here are not explicitly promised, but we expect anywhere from a 25-40% reduction in the standard computation time that would be billed to the cloud host. This however will not work with most BI tools or dashboarding connections unless they are explicitly running a DuckDB connection to power them, such as Rill or Hex.

Ecosystem

The MotherDuck ecosystem is relatively new but quickly growing. MotherDuck supports connectors to most standard BI tools, and their online IDE provides an accessible entry point for data analysts and data engineers to manage the database. The IDE is notebook oriented, which is an analyst forward methodology for data exploration and includes some minor visualization features but is not a dashboarding tool such as what is available from Snowflake via Snowsight, nor does it include a pairable BI tool like Google's Looker. MotherDuck's ecosystem lies in its third-party connections, you do not need to tie yourself to vendor lock-in the way you might with Biguery, nor do you need to store your data as you might with Snowflake. Data can stay remote, in AWS S3, Google Cloud Storage, or externally as structured files (csv, parquet etc). It's disingenuous to say that your setup overhead will be more than Google or Bigquery, it might involve more browser tabs, but ultimately you'll be doing the same operations, and ultimately your data is more portable with DuckDB instance able to run anywhere from cloud locations to local desktops.

Limitations

DuckDB is pretty clear about its limitations, which are the scalability across an organization, and the relatively new ecosystem of support. DuckDB's achilles heel is also its greatest strength, which is its architecture. By optimizing compute across a single node – DuckDB is able to offer blazing speed at low costs. But, this does not allow it to scale horizontally. In a world where BigQuery can scale compute across a massive amount of machines, the compute on DuckDB is still constrained to a single machine, or in the instance of shared compute, two machines. This means that small to medium-sized jobs that can be completed on a single machine are very efficient, but DuckDB will simply not be able to handle the scale of massive datasets. It's simply a question of available memory, if the machine running DuckDB cannot store the data required to do computation, it won't be able to handle some data sizes that may require Snowflake, or BigQuery. We do caution, that most of our clients, and most businesses are not butting up against these requirements, we're dealing with the world of terabytes of data involved in a single query which is often a result of ETL mismanagement rather than true technical limitation. But, if you have that amount of data and are willing to pay the cost to process it, then a technology like BigQuery or Snowflake is required.

The second key limitation is multi-user access and multitenancy. DuckDB will not function in the same capacity as a data lake that you would connect a transformation layer tool like dbt or SQLmesh and allow many users to submit data transformations. While databases can be shared, they can only receive writes, creates and appends from a single owner, which means one person needs to handle data updates and management while other users can then read the data from that master node.

Alternative Data Warehouses

There is a long list of additional cloud data warehouses that we may not necessarily specialize in at Snowpack, but have some experience working with and have certainly seen work smoothly. There may be certain features that push firms to these specific warehouses and in some cases they can be very effective at solving a specific problem.

Redshift

One of the first cloud-native data warehouses, Redshift is tightly integrated with the AWS ecosystem and optimized for large-scale analytics. It offers a cost-effective solution for enterprises but can require significant tuning for performance and cost optimization. This may make sense to your organization if you are locked into the AWS ecosystem but often it makes sense to choose a more user-friendly 3rd party like Snowflake.

Databricks

Built on Apache Spark, Databricks is more of a unified analytics platform than a traditional data warehouse. It excels in large-scale data processing, AI/ML workloads, and data lakehouse architectures, combining elements of both data lakes and warehouses. This vertical integration also comes with a cost, both in terms of raw dollars and headcount. This is a specialized product that can be very powerful at large enterprise organizations but will require total commitment across your stack to Databricks as a single vendor which for many can be a risky endeavor.

Clickhouse

An open-source, columnar OLAP database designed for real-time analytics. It's highly efficient for analytical workloads with extremely fast query speeds but requires careful tuning and infrastructure management, often making it better suited for engineering-heavy teams. This database excels in offering "real time" analytics for engineering teams and is more suited for a very specific use case rather than a do-it-all analytics warehouse.

Microsoft Fabric/Synapse Analytics

Microsoft's cloud data platform, previously known as Azure Synapse, provides a hybrid of traditional data warehousing and big data analytics. It integrates well with Power BI and Azure services but can be complex to optimize. Fabric really only makes sense if your organization is tightly coupled to the Microsoft stack and is deeply invested in Power BI.