Faster S3 Object Listing

Amazon S3 is a popular object storage service from Amazon Web Services (AWS) that can store vast amounts of data. In a real-world project, I needed to list millions of objects in an S3 bucket. This process can be slow because the S3 API returns a maximum of 1,000 object entries per request. That means an application must make multiple requests to retrieve all the objects.

Issue ¶

To demonstrate the problem, I wrote a simple Go program that creates 25,000 empty objects in an S3 bucket. The name of each object is a random UUIDv4. You can see the code here.

To list objects in a bucket, the S3 API provides the ListObjects endpoint. In the AWS SDK for Go v2, we find the ListObjectsV2 operation that calls this endpoint.

As mentioned, this endpoint returns a maximum of 1,000 object entries per request. If there are more objects to list, the response contains a continuation token that can be used to retrieve the next set of objects. An application must keep sending requests until the response no longer contains a continuation token. Pagination is a common pattern in the AWS API, and many services use it to handle large data sets efficiently.

The SDK for Go provides a convenient way to handle pagination using the ListObjectsV2Paginator. This paginator automatically handles the continuation token; an application iterates over the results with the HasMorePages and NextPage methods. Under the hood, the paginator keeps sending requests until all objects are listed.

  s3Client := s3.NewFromConfig(cfg)
  bucket := "rasc-test-list"

  input := &s3.ListObjectsV2Input{
    Bucket: aws.String(bucket),
  }

  paginator := s3.NewListObjectsV2Paginator(s3Client, input)

  startTime := time.Now()

  var objectCount int
  for paginator.HasMorePages() {
    page, err := paginator.NextPage(context.Background())
    if err != nil {
      log.Fatalf("failed to get page: %v", err)
    }

    for range page.Contents {
      objectCount++
    }
  }

main.go

This code takes about 11 seconds to list all objects on my machine. If you have millions of objects, listing them all can take a long time, even with a fast internet connection.

Faster ¶

How can we make this faster? As mentioned, the object names are random UUIDv4s. This means the object name starts with a hexadecimal character (a-f, 0-9). The ListObjects endpoint supports a Prefix parameter to filter objects by name. If you specify a prefix, the list operation returns only objects that start with that prefix.

We can use this to our advantage. We can start 16 goroutines, each with a different prefix, and run them in parallel. Each goroutine lists the objects that start with its prefix and sends its results to a channel. Once all goroutines finish, the application can process the results. This example program is only interested in the number of objects found, so it counts them. You can modify it to return the object entries.

func main() {
  cfg, err := config.LoadDefaultConfig(context.TODO(), config.WithSharedConfigProfile("home"))
  if err != nil {
    log.Fatalf("failed to load config: %v", err)
  }

  s3Client := s3.NewFromConfig(cfg)
  bucket := "rasc-test-list"

  prefixes := []string{"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "a", "b", "c", "d", "e", "f"}

  startTime := time.Now()

  resultChan := make(chan result, len(prefixes))
  var wg sync.WaitGroup

  for _, prefix := range prefixes {
    wg.Add(1)
    go func(prefix string) {
      defer wg.Done()
      count, err := listObjectsWithPrefix(context.Background(), s3Client, bucket, prefix)
      resultChan <- result{prefix: prefix, count: count, error: err}
    }(prefix)
  }

  wg.Wait()
  close(resultChan)

  totalObjects := 0
  for result := range resultChan {
    if result.error == nil {
      totalObjects += result.count
    } else {
      fmt.Printf("prefix %s: ERROR - %v\n", result.prefix, result.error)
    }
  }

  elapsed := time.Since(startTime)

  fmt.Printf("Total objects found: %d\n", totalObjects)
  fmt.Printf("List operation completed in: %v\n", elapsed)
}

func listObjectsWithPrefix(ctx context.Context, s3Client *s3.Client, bucket, prefix string) (int, error) {
  input := &s3.ListObjectsV2Input{
    Bucket: aws.String(bucket),
    Prefix: aws.String(prefix),
  }

  paginator := s3.NewListObjectsV2Paginator(s3Client, input)
  var objectCount int

  for paginator.HasMorePages() {
    page, err := paginator.NextPage(ctx)
    if err != nil {
      return 0, err
    }

    for range page.Contents {
      objectCount++
    }
  }

  return objectCount, nil
}

main.go

This code is significantly faster. On my machine, it takes about 3 to 4 seconds to list all the objects.

Splitting the list requests with a prefix works only if the prefixes are known in advance and the object names are not clustered around a small set of prefixes. If 90% of the objects start with the same prefix, this approach will not help much. You could try to find longer prefixes; it does not have to be a single character; it can be a string of any length.

Fortunately, in my real-world project and this example, the names are random UUIDs, so we can assume the prefixes are evenly distributed.

Generic ¶

The following code shows an attempt to make this solution more generic.

The fastListS3Objects function takes a list of prefixes and the number of goroutines to use. It starts a number of goroutines equal to the number of prefixes or the specified limit, whichever is smaller. Like the previous example, this code is only interested in the number of objects.

type result struct {
  prefix string
  count  int
  error  error
}

func fastListS3Objects(ctx context.Context, s3Client *s3.Client, bucket string, prefixes []string, numGoroutines int) (int, time.Duration, error) {
  startTime := time.Now()

  maxWorkers := min(len(prefixes), numGoroutines)

  resultChan := make(chan result, len(prefixes))
  workChan := make(chan string, len(prefixes))
  var wg sync.WaitGroup

  for _, prefix := range prefixes {
    workChan <- prefix
  }
  close(workChan)

  for range maxWorkers {
    wg.Add(1)
    go func() {
      defer wg.Done()
      for prefix := range workChan {
        count, err := listObjectsWithPrefix(ctx, s3Client, bucket, prefix)
        resultChan <- result{prefix: prefix, count: count, error: err}
      }
    }()
  }

  wg.Wait()
  close(resultChan)

  totalObjects := 0
  var firstError error
  for result := range resultChan {
    if result.error == nil {
      totalObjects += result.count
    } else {
      fmt.Printf("prefix %s: ERROR - %v\n", result.prefix, result.error)
      if firstError == nil {
        firstError = result.error
      }
    }
  }

  elapsed := time.Since(startTime)
  return totalObjects, elapsed, firstError
}

func listObjectsWithPrefix(ctx context.Context, s3Client *s3.Client, bucket, prefix string) (int, error) {
  input := &s3.ListObjectsV2Input{
    Bucket: aws.String(bucket),
    Prefix: aws.String(prefix),
  }

  paginator := s3.NewListObjectsV2Paginator(s3Client, input)
  var objectCount int

  for paginator.HasMorePages() {
    page, err := paginator.NextPage(ctx)
    if err != nil {
      return 0, err
    }

    for range page.Contents {
      objectCount++
    }
  }

  return objectCount, nil
}

main.go

This function can be used like this. This example uses the same prefixes as before but only starts 8 goroutines.

  s3Client := s3.NewFromConfig(cfg)
  bucket := "rasc-test-list"

  prefixes := []string{"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "a", "b", "c", "d", "e", "f"}
  numGoroutines := 8
  totalObjects, elapsed, err := fastListS3Objects(context.Background(), s3Client, bucket, prefixes, numGoroutines)

main.go

In my tests, this is surprisingly almost as fast as the previous example with 16 goroutines. The reason is unclear. My internet connection is not fast, and perhaps sending 16 requests at the same time is too much to handle efficiently. It is worth experimenting with the number of goroutines to find the optimal value for your use case.

Conclusion ¶

This article shows how to list many objects in an S3 bucket quickly. Using the prefix parameter in the list objects request allows an application to start multiple goroutines and call the list objects endpoint in parallel. This can significantly reduce the time it takes to list all objects in a bucket.

As mentioned, this approach only works if the prefixes are known in advance and the object names can be split into multiple prefixes. If most object names start with the same prefix, this approach will not help much. However, if the object name prefixes are evenly distributed, this approach is effective.