How do you handle large files with S3 Multipart Upload?

***savas*** · 04-20-2021, 02:58 PM

[Image: drivemaker-s3-ftp-sftp-drive-map-mobile.png]

Handling large files with S3 Multipart Upload is something I find really efficient once you get the hang of it. You might have encountered issues with uploading files that exceed 5GB to S3 in a single request. That’s where multipart upload comes into play. It allows you to split the upload into several parts, which you can upload individually. Not only does this method streamline the process, but it also provides more resilience against network interruptions.

Let’s say you have a massive file, like a high-res video or a large database dump. Instead of uploading this file in one go, which could quickly fail or time out, I usually break it down into smaller chunks. You can go for parts that are 5MB to 5GB in size; however, I often find around 100MB to be a sweet spot. What you want to do is start the multipart upload by making a request to initiate this process. The response you get will include an upload ID, which is essential because you'll use it for all subsequent upload requests.

After you get your upload ID, you can proceed to upload each part. To do this, I use the PutObject API call with the upload ID and the part number. Each part must be uploaded, and I like to keep track of which parts I’ve successfully uploaded using that upload ID. You might run into network instabilities, so it's crucial to check each part after upload. If a part fails, you can simply re-upload that specific chunk without needing to restart the whole upload process.

By doing this, you can also take advantage of parallel uploads. Let’s say you have 8 parts to upload; I often spin up multiple threads or processes to upload them concurrently. This drastically reduces the total time taken for the upload. If you’re using a language like Python, you can utilize libraries like "boto3" that support threading.

Once you’re confident that all the parts are uploaded, you can call the complete multipart upload API. Here, you’ll need to specify the upload ID and a list of the parts you uploaded, along with their corresponding ETags. The ETag is a unique identifier that S3 generates for each part you upload, and it’s used to verify that the parts were uploaded correctly.

It’s also worth mentioning that while this process is ongoing, you can check the status of your upload using the ListParts API. This is especially useful if you have multiple files being uploaded simultaneously. You can query which parts have been uploaded, and whether any retries are necessary. There's something comforting about being able to monitor this process and manage parts without running into situations where you have to guess.

One specific situation that made multipart upload shine for me was during a project where I needed to upload log files that outgrew typical limits during peak seasons. I had logs that were several hundred gigabytes large. By implementing multipart uploads, I could analyze and upload smaller sections of these logs individually during the quieter hours, and it saved me from losing progress if a connection dropped.

If you ever import data into S3 regularly, leveraging lifecycle rules can also be a good way to manage those large files. For example, you can set rules to transition files into cheaper storage classes after a specific period. It's invaluable to optimize costs while managing these file sizes. You don’t want to blindly leave everything in the standard tier if they aren’t accessed frequently.

Lastly, monitor your uploads closely through the AWS CloudTrail or S3 server access logs. If you're handling sensitive data, you can pay attention to logs to see who accessed what and when. Knowing where your data is at any point offers peace of mind, and realizing that you could track issues easily troubleshooting any snafus that might pop up in the upload process.

If you run into any issues along the way, like partial uploads, reverting to the AbortMultipartUpload operation can come in handy. You can use this to free up your storage and avoid incurring costs from partially uploaded files. I once had an uploading session go wrong with multiple parts uploaded but I couldn’t complete the session due to an application error. The abort option saved my progress and kept things tidy.

To sum it all up, you’ll find that using multipart upload for large files in S3 simplifies your workflow considerably. The flexibility it provides not only saves time and effort but also significantly improves reliability. Technical projects can become much easier when you stay action-oriented about handling large files, especially considering that you can troubleshoot small sections rather than tackling an entire upload if something goes awry.

I’m always keen to share more experiences or dive deep into code examples if you want. You can create your upload, manage parts efficiently, and automate the whole process without constant worry. Setting your uploads to work parallelly, tracking each part, and ensuring you use lifecycle management for cost-effectiveness just takes some initial setup effort. I’m sure you’ll find a lot of benefits implementing these practices into your projects. Once you get a handle on it, you'll wonder how you ever managed large uploads without it!