How do S3’s multi-part uploads introduce overhead for small file uploads compared to file systems?

***savas*** · 10-23-2023, 09:50 AM

[Image: drivemaker-s3-ftp-sftp-drive-map-mobile.png]

Using S3's multi-part uploads for small files introduces some overhead that you might not encounter with traditional file systems, and I’ll explain how that works. S3 is designed for robustness, scalability, and handling large files efficiently, which is great in many scenarios. However, those strengths can turn into bottlenecks or inefficiencies when dealing with small files.

In a standard file system, when you upload a file, you typically have a straightforward process. You may just call a method to write the entire file in one go. The underlying file system takes care of the rest, handling the I/O operations directly with minimal latency. The file gets written, and you get confirmation almost immediately. This simplicity is beneficial when dealing with small files because the overhead is usually minimal. The underlying file structure can also be optimized for small writes, allowing it to handle the file and metadata in a very efficient manner.

On S3, the multi-part upload process is more elaborate, which can lead to increased latency and unnecessary complexity for small files. You need to first initiate the upload by preparing a multi-part upload request. This request involves interacting with the S3 service to allocate resources for the upload. At this point, there’s a round trip to the S3 API, which adds latency, even before you get into the actual file transfer. You essentially declare your intent to upload, which is a step absent in a file system.

Once that step is complete, you move on to the part where you actually upload the individual parts of the file. If you want to take advantage of multi-part uploads, you should break your file into smaller chunks. This means you might need to read the small file, split it into, say, four or five parts, and then upload each part using a separate request. Each of these upload requests introduces its own API call, and if you’re handling multiple small files, this can multiply quickly.

Consider the scenario where you have a file size below the threshold for optimal performance. You might end up uploading parts that are smaller than the minimum recommended part size for S3, which is currently 5 MB. If you’re uploading a 2 MB file, you still need to create a multi-part upload. You’ll make an initial request to start the upload, followed by uploading a single part, and then finally, you’ll need to complete the upload with another API call. Instead of a seamless upload process, you’re instead looking at at least three API calls plus the associated overhead for each one.

Looking at the real implications here: you’re generating requests that may be higher than the data you’re uploading. For instance, if you upload a 1 MB file, you may be sending three to five requests just to finish that upload. The network latency for each of those requests can accumulate into significant delays, especially when you’re working with numerous small files. In contrast, a file system would write that same file in a single operation, effectively minimizing the time taken to complete the upload.

Also, there’s the matter of bandwidth usage. It’s possible that for small files, when you use multi-part upload, you could end up wasting bandwidth. I’ve seen situations where overhead just in the metadata and API interactions consumed more megabytes than the actual data transferred. Remember that every part upload involves a header, a body, and another acknowledgment back from S3. All that data counts toward your bandwidth, which means that for a small file, you’re sending a lot of “overhead” data relative to the actual content of the file.

You might say, “Well, can’t I just skip multi-part uploads for small files?” The answer is a bit nuanced. If you're managing those small files, you could consider using a different approach, like the single-object upload API provided by S3. However, I want to stress that even that has its own limitations and could lead to complications if you start mixing and matching approaches across your uploads. The goal of using one consistent strategy is to simplify your application's code and make it maintainable.

There’s also the scenario of retries and fault tolerance. S3 is supposed to be resilient, meaning if a single part upload fails, you can simply retry that part. The problem with small files is that if you’re uploading them in multiple parts, and one of the parts fails, now you have to handle each part’s state individually. You find yourself writing additional logic to manage which parts succeeded and which didn’t, and that overhead isn’t just in complexity—it’s also in developmental time and future maintainability.

Another aspect that plays into the perceived inefficiency is that S3 is optimized for larger datasets. When you start using it for smaller files, you’re not fully engaging with its strengths. The architecture of S3 is built around distributing large amounts of data across many servers, and when you’re using it for small file uploads, you can feel like you’re running counter to its design intent.

Considering the billing side, S3 has its pricing model which is influenced by both data stored and operations performed. So uploading many small files using the multi-part upload API might also lead to higher operational costs compared to a similar process using a file system, where you usually just pay for disk space usage. If you have a ton of little uploads, S3 may charge you not just for data but for every single request. It’s a scenario where you could easily find that being efficient with uploads in the way S3 wants you to does not actually lead to cost efficiency.

Finally, there's the issue of metadata handling. You might think that once you’ve uploaded your file, you’re done, but in S3, each part of your multipart upload needs to be tracked, and that tracking can become cumbersome. You have to consider how you want to manage that metadata over time—how to version your files, track changes, or clean obsolete ones. In a traditional file system, the overhead is minimal; you just delete or replace a file without considering multiple parts, versions, and related complexities.

If you do have to work regularly with small files in S3, you could consider alternatives like batch processing files into a larger object before upload, if applicable, or even using edge functions to compress and minimize uploads. You save on overhead at both the request level and latency, getting files onto S3 in a manner that better fits its architecture.

All these computations should lead you to think carefully about the strategy you want to adopt when using S3 for small file uploads. While S3 offers fantastic scalability and reliability, those very features can introduce complexity and overhead that makes handling smaller files challenging compared to traditional file systems. Just weighing your options before committing to one approach can save you a lot of headaches down the road!