Why can data transfer to S3 be slower than using traditional file storage systems?

***savas*** · 04-05-2022, 08:35 AM

[Image: drivemaker-s3-ftp-sftp-drive-map-mobile.png]

You might find that data transfer to Amazon S3 can be surprisingly slower than what you would get with traditional file storage systems, and there are several technical reasons for this. I want to break this down for you, as I've seen it happen in various projects I’ve worked on, and it’s not always obvious at first glance.

A big factor is how S3 operates as an object storage system. Traditional file storage systems, like NAS or SAN, use a file system that’s optimized for local access. You have directories and files which you can stream data to and from in a more direct way. Think about the performance optimization that comes from having local disks or network-attached storage directly accessible in your local area network. You can expect high I/O operations, low latency, and generally better performance with local file systems, especially when dealing with small files.

On the other hand, S3 is designed for scale and durability, and not necessarily for high-speed data access. When you upload files to S3, you’re interacting with a web interface. The HTTP protocol is inherently more complex than raw stream data transfer used by traditional systems. This means you might experience higher latency due to the overhead of establishing a connection, ensuring request integrity, and multiple round trips required for operations like listing objects or checking file metadata. Each of those actions adds its own milliseconds to your total data transfer time.

I’ve noticed that the way S3 handles object storage, where each object is a discrete unit that never gets modified after upload, is quite different from the block storage systems you might be used to. If you need to update a file, you’re not just writing changes; you have to re-upload the entire object. That’s particularly problematic if you’re dealing with highly transactional workloads or very large files where modification is necessary. In a traditional file system, appending or tweaking a file can likely happen without a complete send-back of the original file, which saves time and bandwidth.

Another point of interest is the network bandwidth itself. While S3 might offer rapid transfer speeds under ideal conditions, you have to consider that your actual connection plays a massive role in performance. If you’re working with large datasets or small file transfers, the quality of your internet connection may not compare favorably against a well-configured LAN, particularly if you’re transferring a lot of small files. The speed of data transit over the internet can be compromised by a multitude of factors like latency, jitter, or packet loss, which traditional systems don’t usually experience as drastically due to their localized presence.

You also have to think about S3’s eventual consistency model, which can slow down data transfer under certain conditions. With S3, after you upload an object, it’s not immediately available for read operations. Depending on your use case, this can introduce unexpected delays. In a local filesystem scenario, the file is immediately accessible once written, which is super beneficial for highly interactive applications. If you’re building something where immediate consistency is critical, this difference can become a bottleneck you didn't account for.

Consider multipart uploads for large files as well. S3 allows you to break large uploads into smaller chunks, which you’d think would speed things up since you’re sending data in parallel. However, if you have a huge number of small files that you’re managing, it could actually slow you down. Each multipart upload requires a complex set of calls including initiating the upload, uploading parts, and finally completing the upload. These excessive API calls could introduce extra latency compared to simply writing a file in a single connection on local systems.

Cache coherency is another crucial factor. In traditional file systems, operating systems can utilize cache memory both at the application level and at the file system level. This allows frequently accessed files to be served from memory, drastically enhancing data transfer speeds. With S3, though, you’re often working with data that must be pulled fresh from the cloud, which isn’t cached on your local endpoints. Even if you're using tools or architectures that allow for local caching strategies, it typically requires additional setup and architectural complexity like leveraging services such as AWS CloudFront.

Data formats and serialized objects also have a say in how data transfer performance unfolds. Loading and saving binary files or serialized data is decidedly more storage-efficient than dealing with text files or JSON documents, which are typically larger and require more processing. If you’re transferring smaller objects, the serialization overhead may start becoming a significant bottleneck as each request adds extra bytes onto what you’re transferring.

In practice, I’ve found that using compression can alleviate some of these slow data transfer issues with S3, but that itself can introduce processing overhead for both uploading and downloading. Compressing large files before you hit S3 can save you bandwidth and reduce load times, yet you weigh the advantages against the time taken to compress the files in the first place. If you’re constantly transferring files back and forth, the additional time required to compress and decompress can add up, making for a frustrating experience.

Security practices also play a role in how fast your data transfers can be. Transferring data to S3 over HTTPS is recommended for encrypting your data-in-transit. While this is essential for security, it too adds overhead due to the additional layers of encryption and decryption. While traditional file systems that sit behind a secure network might not require this overhead, data going to S3 often needs encryption based on policy or compliance reasons, meaning there’s less bite behind that transfer speed.

I've worked with various application scenarios, and one thing that keeps coming up is balancing how you store data based on your access patterns. If you’re dealing with big analytic loads or your application has unpredictable access patterns, deploying an architecture that balances S3 objects with databases or block storage could yield the performance you want more reliably. You might put ephemeral files on S3 while keeping other frequently accessed datasets locally or within a more performant service like EBS.

Database technologies also factor in here. If you’re transitioning from traditional systems that utilize databases, think about how database I/O can be directly optimized with traditional storage. You can leverage stored procedures, cached results, and various relational optimizations to increase throughput. With S3, there’s less layering to help you out with immediate data retrieval, which can feel like a trade-off in performance.

The bottom line is, while S3 offers scalability and durability, the architectural differences between it and traditional file storage solutions create a performance contrast you should definitely think through when designing any data-intensive application. It’s all about understanding those trade-offs, recognizing your specific requirements, and adapting your strategy accordingly to make sure you’re avoidable of those common pitfalls around speed.