What is the S3 Multipart Upload API and how is it used?

***savas*** · 12-16-2023, 02:12 AM

[Image: drivemaker-s3-ftp-sftp-drive-map-mobile.png]

The S3 Multipart Upload API allows you to upload large files to Amazon S3 in several smaller parts. This is especially useful because large files can sometimes be cumbersome to upload in one go. By breaking them down into smaller chunks, you can make the process faster and more reliable. I find it particularly effective because you can upload these parts in parallel, which can lead to reduced upload times. Moreover, if you encounter any issues during the upload process, you can retry just the parts that failed instead of starting the entire upload again. This capability is a game changer when you're working with large datasets or media files.

To get started, you initiate the multipart upload process by making a request to the S3 API. You provide the bucket name you’re targeting and the object key (which is essentially the name you want to assign to the uploaded file). When you do this, S3 responds with an Upload ID, which you’ll need for all subsequent requests. This ID acts like a session token for the multipart upload, ensuring that all your parts are associated with the same upload.

I usually save that Upload ID because I’ll need it when I upload each part. You can upload up to 10,000 parts (although it’s typically more than enough for most use cases). Each part can be between 5 MB and 5 GB, which gives you a lot of flexibility. You can upload several pieces, such as 1 GB each, or you could have parts that are only 5 MB. It really depends on what you’re trying to achieve and how you want to structure your uploads.

Uploading the parts is pretty straightforward. You’ll perform an HTTP PUT request to the specific S3 endpoint, including your Upload ID and the part number. A part number is just a sequential number that you assign to each part, starting from 1. For example, if you’re uploading three parts, you would have part numbers 1, 2, and 3. It's important to ensure that the part numbers you use are unique within that particular multipart upload; otherwise, you could end up with confusion over what parts are what.

After you upload each part, S3 will return an ETag for each part you upload successfully. An ETag is a unique identifier for the version of the part and is crucial when you complete the upload. I recommend storing these ETags because you’ll need to provide them when you call the complete multipart upload operation.

At some point in your process, you might face scenarios where things don’t go as planned. Maybe your connection drops, or perhaps the network becomes unstable. This is where multipart upload shines because you won’t have to restart your entire upload. You can retry by simply re-uploading the parts that didn’t make it. This feature allows you to continue where you left off without losing the already uploaded parts. You have the granular control necessary, and that makes a significant difference, especially in a production environment.

Once all parts are uploaded, you'll need to compile them together. This is done through the complete multipart upload API call where you’ll send in the same Upload ID, along with a list of part numbers and their associated ETags. One thing I find useful is ensuring that the order of the parts in the request matches the order in which you uploaded them. S3 will take care of stitching them back together, but having them in the correct sequence saves a headache later on.

There could be times when you want to abandon an upload in progress. This might be due to a change in project requirements or simply a mistake. In such cases, you can send a request to abort the multipart upload using the Upload ID. Once aborted, S3 will automatically delete all parts associated with that upload, freeing up any space they consumed. Make sure you understand that this is irreversible. You won't be able to recover those parts once the upload is aborted.

There are numerous practical applications I’ve seen in real-world scenarios. For instance, let’s say you’re dealing with video processing for a streaming service. Video files can easily exceed several gigabytes, and using multipart upload for this makes a lot of sense. You can have the backend handle multiple uploads simultaneously, getting your content to S3 faster without hitting timeouts or connection limits.

In another use case scenario, consider backups. If you’re backing up large datasets from a database or file system, implementing multipart uploads can significantly minimize the risks of corruption or timeouts. If the upload takes too long, parts can fail, and having the ability to retry individual parts means that I avoid losing the entire backup.

You should be aware of costs associated with multipart uploads. While the storage for the parts is relatively reasonable, you will incur charges for both the data transfer and the number of requests. Each part upload counts as a request. It’s essential to monitor your usage to ensure you're not caught off guard by unexpected charges. While AWS does have a free tier for S3, keep in mind that multipart uploads can still lead to costs depending on the frequency and size of your uploads.

It's interesting to note that multipart uploads aren’t just for large files. There are cases where you might have many smaller files, and you need to ensure those are uploaded efficiently as well. Slightly changing your approach to partitioning even smaller files could yield faster upload times compared to handling them as single requests. The flexibility of defining your part size plays a significant role in this decision-making process.

As you’re working with the Multipart Upload API, I highly recommend you experiment with the SDKs provided by AWS. Whether you’re using the Java SDK, Python’s Boto3, or any other language, these libraries simplify the process significantly. They’ll handle a lot of the complexity for you, enabling you to focus on the core logic of your application.

All in all, using the S3 Multipart Upload API provides several advantages, particularly for large file uploads where reliability and efficiency are paramount. I’ve built applications with multipart uploads, and there's just something satisfying about the speed and resilience it can provide. The real power of S3 comes from its ability to deal with various file sizes and types efficiently. You can spend more time worried about crafting your applications, and less time fussing over transfers getting bogged down.

So, next time you find yourself needing to upload large files or assets, think about the drawbacks of traditional methods. Using the Multipart Upload API can significantly enhance your workflow and efficiency, making it a tool in your belt for a vast array of projects. That way, you’re not just another upload; you’re a well-oiled machine churning out files that are ready for action.