Analyzing VPC Flow Logs to Reduce NAT Gateway Costs

With cloud infrastructure costs increasingly a focus, many organizations are scrutinizing AWS bills for potential savings. NAT Gateway usage can be a significant line item, yet its intricacies can make cost-saving opportunities less obvious. As the standard solution for routing traffic from private subnets to the internet, NAT Gateways are a critical component of many VPC architectures – but their impact on the bottom line can be substantial and often overlooked.

In this post, I’ll share my journey of how I analyzed my organization’s VPC flow logs, and later Route 53 resolver query logs, to pinpoint the services and requests driving our NAT Gateway costs. This deep dive allowed us to identify key contributors and take targeted action, ultimately dropping our monthly NAT Gateway expenses by ~75%.

Understanding NAT Gateway Costs

Before diving into the analysis, it’s important to understand how NAT Gateway is billed and why a suboptimal configuration can dramatically inflate your costs. From AWS:

If you choose to create a NAT gateway in your VPC, you are charged for each “NAT Gateway-hour” that your gateway is provisioned and available. Data processing charges apply for each gigabyte processed through the NAT gateway regardless of the traffic’s source or destination. Each partial NAT Gateway-hour consumed is billed as a full hour. You also incur standard AWS data transfer charges for all data transferred via the NAT gateway. If you no longer wish to be charged for a NAT gateway, simply delete your NAT gateway using the AWS Management Console, command line interface, or API.

In summary:

  1. NAT Gateway-hour: Charged for each hour the gateway is provisioned
  2. Data processing: Per-gigabyte charge for all traffic through the gateway
  3. AWS data transfer: Standard charges for data transferred via the gateway

As your traffic grows, so does your NAT Gateway charge. However, certain misconfigurations can lead to unnecessary and costly NAT Gateway processing. From my experience, two common types of issues stand out:

  1. Resources accessing external services that could be reached via VPC endpoints (either interface/PrivateLink or gateway types).
  2. Resources hitting “external” services that should be internal. This includes API Gateway misuse, incorrect ALB schemes (internet-facing instead of internal), Lambda Function URLs, or other internal services unnecessarily exposed to the public internet. These force client resources to egress over NAT and then back in, increasing costs. Not to mention the increased external security footprint and latency.

With that in mind, make sure to take a look at your current NAT Gateway spend in the AWS Cost Explorer. You’ll want to understand your baseline and the extent of your problem before performing a deep dive and implementing changes.

Optimization Process

Collecting Log Data

The first step in our optimization journey is to collect VPC flow log data. AWS has extensive docs on this with various options – my recommendation is to publish the logs to S3 in parquet format. The Quick Start version:

  • Create a bucket with the required bucket policy
  • Browse to your VPC in the console and select Actions, Create flow log.
  • Make sure to select:
    • Filter: All
    • Destination: Send to an Amazon S3 bucket
    • Log record format: Custom format, Select all
    • Log file format: Parquet
    • Partition logs by time: 24 hours

 After saving the configuration, you can expect to start seeing logs in the bucket within 10-20 minutes. If Terraform is more your preference, this should get you started:

HCL
variable "vpc_id" {
  default = "YOUR-VPC-ID"
}

variable "env" {
  default = "dev"
}

data "aws_vpc" "selected" {
  id = var.vpc_id
}

data "aws_s3_bucket" "bucket" {
  bucket = "YOUR-BUCKET-WITH-POLICY"
}

resource "aws_flow_log" "flow_log" {
  log_destination      = "${data.aws_s3_bucket.bucket.arn}/vpc-flow-logs-${var.env}"
  log_destination_type = "s3"
  traffic_type         = "ALL"
  vpc_id               = data.aws_vpc.selected.id
  log_format = "$${account-id} $${action} $${az-id} $${bytes} $${dstaddr} $${dstport} $${end} $${flow-direction} $${instance-id} $${interface-id} $${log-status} $${packets} $${pkt-dst-aws-service} $${pkt-dstaddr} $${pkt-src-aws-service} $${pkt-srcaddr} $${protocol} $${region} $${srcaddr} $${srcport} $${start} $${sublocation-id} $${sublocation-type} $${subnet-id} $${tcp-flags} $${traffic-path} $${type} $${version} $${vpc-id}"
  destination_options {
    file_format        = "parquet"
    per_hour_partition = false
  }
  tags = {
    "Name" = "vpc-flow-logs-${var.env}"
  }
}

Analyzing the Data

With our logs collected, it’s time to dive into the analysis. I recommend starting with at least a full day’s collection of logs. This might help you better capture scenarios where traffic spikes only periodically, for example when batch jobs occur.

Start by pulling the files to a local folder:

Bash
$ aws s3 sync s3://YOUR-BUCKET/vpc-flow-logs-prd/AWSLogs/YOUR-ACCOUNT-ID/vpcflowlogs/us-east-1/2024/10/14/ .

Install the DuckDB cli (if you haven’t already) and import the flow logs into a persistent database file:

Bash
$ duckdb data.db -c "CREATE TABLE flowlogs as SELECT * FROM './*.parquet';"

DuckDB is fast and powerful – now you can work with the logs using familiar SQL syntax and any supported DB client.

Let’s start with a couple simple queries that show some general traffic patterns. Note I will assume you are using a 10.0.0.0/8 private subnet.

Top external uploads / egress traffic:

Bash
$ duckdb data.db 
v1.1.2 f680b7d08f
Enter ".help" for usage hints.
D .maxrows 100
D SELECT l.dstaddr, sum(l.bytes) AS total_bytes, l.dstport
  FROM  main.flowlogs l
  WHERE l.log_status = 'OK' AND l.srcaddr IN ('YOUR-PRIVATE-NAT-GW-IP')
      AND l.dstaddr NOT LIKE '10.%'
  GROUP BY l.dstaddr, l.dstport
  ORDER BY total_bytes DESC
  LIMIT 100;
┌────────────────┬─────────────┬─────────┐
    dstaddr      total_bytes  dstport 
    varchar        int128      int32  
├────────────────┼─────────────┼─────────┤
  **REMOVED**       17216379      443 
  **REMOVED**        9771108      443 
  **REMOVED**        4865039      443 
  **REMOVED**        3732846      443 

Top downloads:

SQL
SELECT l.srcaddr , sum(l.bytes) AS total_bytes, l.srcport
FROM  main.flowlogs l
WHERE l.log_status = 'OK' AND l.dstaddr  IN ('YOUR-PRIVATE-NAT-GW-IP')
    AND l.srcaddr NOT LIKE '10.%'
GROUP BY l.srcaddr , l.srcport
ORDER BY total_bytes DESC
LIMIT 100;

This is a good starting point and will give you insights into some of your traffic patterns, but it presents only a partial view of the overall situation.

Limitations

Because we are working with Layer 5 / IP address level data, the query results above will not represent traffic in aggregate when services use DNS load balancing across multiple IPs. As a result, the true impact of individual external services on NAT Gateway usage can be widely underrepresented or even completely hidden, as their traffic is dispersed across numerous IP addresses.

Let’s explore a real world example. Since Datadog offers a PrivateLink integration, it will be a good case study to demonstrate the scenario. Focusing on just their Profiling service’s upload endpoint intake.profile.datadoghq.com, we can see numerous IPs are in rotation:

Bash
$ dig +short intake.profile.datadoghq.com
alb-logs-http-prof-shard0-1245244872.us-east-1.elb.amazonaws.com.
3.233.154.55
3.233.154.61
3.233.149.128
3.233.149.157
3.233.149.138
3.233.154.63
3.233.149.142
3.233.154.51

In fact, if you ran this multiple times in a row, you would see quite a large mix of IPs. There are so many individual IPs involved, it might be easy to miss the overall traffic volume and the potential savings of a PrivateLink integration.

Let’s filter egress just by these IPs. Below are some example results – note the individual data transfer of each IP, and then the aggregate usage of the IP group overall:

Bash
$ duckdb data.db 
v1.1.2 f680b7d08f
Enter ".help" for usage hints.
D SELECT l.dstaddr, sum(l.bytes)/(1024 * 1024 * 1024) AS total_gb, l.dstport
  FROM  main.flowlogs l
  WHERE l.log_status = 'OK' AND l.srcaddr IN ('YOUR-PRIVATE-NAT-GW-IP')
      AND l.dstaddr NOT LIKE '10.%'
      AND l.dstaddr IN (
          '3.233.154.55',        
          '3.233.154.61',
          '3.233.149.128',
          '3.233.149.157',
          '3.233.149.138',
          '3.233.154.63',
          '3.233.149.142',
          '3.233.154.51'        
          )    
  GROUP BY l.dstaddr, l.dstport
  ORDER BY total_gb DESC 
  LIMIT 100;

100% ▕████████████████████████████████████████████████████████████▏ 
┌───────────────┬────────────────────┬─────────┐
    dstaddr          total_gb       dstport 
    varchar           double         int32  
├───────────────┼────────────────────┼─────────┤
 3.233.149.157  2.8821019837632775      443 
 3.233.154.51   2.4727102955803275      443 
 3.233.149.142  1.8827753942459822      443 
 3.233.154.63   1.7361681573092937      443 
 3.233.154.61   1.5937512768432498      443 
 3.233.149.138  1.1878074323758483      443 
 3.233.154.55   0.8651017183437943      443 
 3.233.149.128  0.5049652773886919      443 
└───────────────┴────────────────────┴─────────┘
D SELECT sum(l.bytes)/(1024 * 1024 * 1024) AS total_gb, l.dstport
  FROM  main.flowlogs l
  WHERE l.log_status = 'OK' AND l.srcaddr IN ('YOUR-PRIVATE-NAT-GW-IP')
      AND l.dstaddr NOT LIKE '10.%'
      AND l.dstaddr IN (
          '3.233.154.55',        
          '3.233.154.61',
          '3.233.149.128',
          '3.233.149.157',
          '3.233.149.138',
          '3.233.154.63',
          '3.233.149.142',
          '3.233.154.51'        
          )    
  GROUP BY l.dstport;
┌────────────────────┬─────────┐
      total_gb       dstport 
       double         int32  
├────────────────────┼─────────┤
 13.125381535850465      443 
└────────────────────┴─────────┘

As you can see, at least ~13GB are being uploaded through NAT Gateway for the Profiler feature alone (not counting other potential IPs), making the PrivateLink integration a prime opportunity for considerable cost savings.

After PrivateLink is setup and configured, you will have private DNS pointing to VPC endpoints rather than resolving externally. You can tell it’s working if the hostnames now resolve to private IPs within your VPC:

Bash
$ dig +short intake.profile.datadoghq.com
10.0.100.131
10.0.100.171

This traffic will no longer route over your NAT Gateway. After implementation, be sure to revisit AWS Cost Explorer to measure the impact of your changes. In my case, we saw a measurable reduction within a day or so.

AWS Service Labels

You might notice fields in the flow logs named pkt_dst_aws_service and pkt_src_aws_service – use them with caution, or not at all. They can be inaccurate for the type of analysis above. Here was the explanation I received from AWS Support when asking about unexpected values of known API Gateways:

The behavior you are experiencing is due to fields such as pkt-src-aws-service and pkt-dst-aws-service being metadata fields, as opposed to packet header fields. “Metadata fields that do not come directly from the packet header are best effort approximations, and their values might be missing or inaccurate”.

External Services vs Internal Clients

I recommend first focusing on external service patterns rather than immediately trying to track down internal resources causing the traffic in your VPC. If the use-case is legitimate (for example, Datadog), you are better off focusing on making the communication more efficient for all VPC resources. Internal source IPs may not help much anyway since they could point to a k8s node hosting a multitude of applications/pods, for example.

Of course, do track down sources of traffic when there is one outlier/overall contributor or the destination service is unrecognized.

Ongoing Monitoring

It’s important to note that NAT Gateway usage patterns can change as your infrastructure evolves and grows. Regularly revisiting your analysis is worth considering in order to maintain cost efficiency.

Introducing NAT Gateway Analyzer

To address the challenges encountered above and simplify the optimization process, I am excited to share the launch of NAT Gateway Analyzer:

https://natgatewayanalyzer.com/

What began as a tool to explore and support my own cost-cutting initiative has evolved into a solution I believe can bring substantial value to the broader AWS community. By automating and streamlining the process we’ve discussed, NAT Gateway Analyzer aims to help others quickly uncover and act on cost-saving opportunities.

Two of the key problems I am attempting to solve with this tool are:

  1. Automation – this exercise can yield significant savings, but the manual process of importing flow logs, querying, categorizing, and ordering services by data transfer can be time-consuming and complex. Moreover, you’ll likely need to perform this process multiple times to confirm your results and review for any gaps or additional savings.
  2. Estimating Hostnames – this feature addresses the challenge of gaining full visibility into data transfer for individual services. The tool tries to identify collections of IPs associated with specific hostnames, enabling you to view aggregate traffic across a service and target your optimization efforts more precisely. See the Coming Soon section below for a note on forthcoming additional improvements to accuracy.

You can drill into individual hostname groups to find the data transfer by IP, direction, and port.

Coming Soon

Stay tuned for Part 2 of this post, where I’ll expand on improving the accuracy of tracing service usage through resolver query logs. I also plan to integrate this feature into NAT Gateway Analyzer to further enhance its capabilities.

Share this: Facebooktwitterlinkedin