09-30-2022, 11:36 PM
![[Image: drivemaker-s3-ftp-sftp-drive-map-mobile.png]](https://doctorpapadopoulos.com/images/drivemaker-s3-ftp-sftp-drive-map-mobile.png)
You really can amp up your S3 upload performance by using parallel requests, and there are a few key aspects to focus on. I often see users relying on sequential uploads without realizing that this can bottleneck their throughput significantly, especially when dealing with large files or numerous smaller files. I want to help you grasp how to leverage parallel uploads effectively, which involves threading, chunking, and some smart use of the SDKs, depending on the programming language you're working with.
First things first, if you’re uploading a single file, consider breaking that file down into smaller chunks. Amazon S3 supports multipart uploads, and this lets you upload large files efficiently by splitting them into parts. I’ve found this particularly helpful when I’m dealing with files larger than 5 GB, but it’s also useful for smaller files since it allows for better error handling and can speed up the process. When I use multipart uploads, I typically utilize the SDK that fits my programming environment—like boto3 if I’m in Python or AWS SDK for JavaScript—and invoke the multipart upload method.
For example, if I’m working with boto3 in Python, I can initialize a multipart upload like this:
import boto3
s3_client = boto3.client('s3')
response = s3_client.create_multipart_upload(Bucket='mybucket', Key='mylargefile.dat')
upload_id = response['UploadId']
At this stage, I would split my file into parts, ensuring each part is a minimum of 5 MB (except maybe the last one). For instance, if I’m uploading a 100 MB file, I could break it up into ten 10 MB chunks. I would create a separate thread for each part to upload them concurrently. This is where the parallel aspect shines. Every thread can be uploading a different part at the same time, drastically increasing the speed of the entire upload process.
I usually have to manage these threads properly, so I’m not overwhelming my system or violating AWS's request rate limits. AWS has limits on the number of requests you can send per second, but by controlling the concurrency, I can maximize the upload throughput without running into throttling issues. I write my threading logic with a basic thread pool, which tends to keep things clean and manageable.
Here’s a quick example using Python’s "concurrent.futures" module to manage that threading efficiently:
from concurrent.futures import ThreadPoolExecutor
def upload_part(part_number, part_data):
response = s3_client.upload_part(
Bucket='mybucket',
Key='mylargefile.dat',
PartNumber=part_number,
UploadId=upload_id,
Body=part_data
)
return response['ETag']
with ThreadPoolExecutor(max_workers=5) as executor:
futures = {executor.submit(upload_part, i, get_part_data(i)): i for i in range(1, num_parts + 1)}
In this example, I'm creating a thread for each part. I typically set the "max_workers" parameter based on empirical evidence of the optimal throughput I've observed in my past uploads, keeping in mind the API limits set by AWS. This gives me a decent blend of speed and resource usage.
After all parts are uploaded, I assemble them using the "complete_multipart_upload" method, passing in all the ETags returned by the uploads. Here’s how that looks:
parts = [{'ETag': future.result(), 'PartNumber': future_number} for future_number, future in futures.items()]
s3_client.complete_multipart_upload(Bucket='mybucket', Key='mylargefile.dat', UploadId=upload_id, MultipartUpload={'Parts': parts})
You should ensure that every part is uploaded successfully before completing the upload; otherwise, you may end up with a corrupted file. In situations where the upload of any part fails, you can abort the multipart upload with "abort_multipart_upload", and clean up the parts.
For smaller files or a large number of small files, I have a separate strategy. I still use parallel requests, but I don’t need to worry about chunking the files as much. Instead, I’d leverage multi-threading or asynchronous uploads. If I have a directory with hundreds of files I want to push to S3, from my experience, using a similar thread pool concept works wonders.
You can utilize libraries such as asyncio in Python combined with the asynchronous calls from the S3 SDK. With async operations, I can queue multiple uploads and handle completion when they finish. This makes it possible for me to keep my network channel busy without waiting on each request to finish before starting the next one.
Here's a basic example:
import asyncio
async def async_upload(file_path):
loop = asyncio.get_event_loop()
await loop.run_in_executor(None, lambda: s3_client.upload_file(file_path, 'mybucket', file_path))
async def main(file_paths):
tasks = [async_upload(file) for file in file_paths]
await asyncio.gather(*tasks)
file_paths = ['file1.txt', 'file2.txt', 'file3.txt'] # Paths to your files
asyncio.run(main(file_paths))
In this snippet, I’m queuing uploads asynchronously, allowing me to fire all uploads at once without blocking on any individual one.
Whatever approach I take, I make sure to monitor the upload performance. Tools like AWS CloudWatch can provide insights into the actual upload speeds, error rates, and latency. Having visibility into these metrics lets me tune my process based on real-world performance. For instance, if I see that the upload speed decreases significantly when I have more than a certain number of concurrent threads, I would adjust my "max_workers" down until the performance stabilizes.
It’s essential to account for network bandwidth as well. There’s no point in trying to run 50 parallel uploads if your bandwidth can only handle five realistically. I often perform tests with varied thread counts and file sizes to find that sweet spot based on the network environment I’m working in.
Implementing these strategies has made a noticeable difference in my S3 upload tasks. By parallelizing uploads, chunking large files, utilizing the right threading models, and keeping an eye on performance metrics, I’m able to optimize uploads significantly. I can't stress enough how essential it is to tailor your approach based on your scenario—whether you’re uploading one massive file or dozens of small ones. It’s all about finding that balance and adapting your implementation according to the specific conditions you encounter.