Why is S3 not ideal for applications requiring fine-grained low-latency file operations?

***savas*** · 08-29-2020, 03:08 PM

[Image: drivemaker-s3-ftp-sftp-drive-map-mobile.png]

I see your question about S3 and its limitations for applications needing fine-grained, low-latency file operations. It’s an interesting point you brought up because many folks do tend to overlook how S3’s architecture can be a bottleneck depending on what you’re trying to accomplish. When I think about using S3, the first thing that comes to mind is its object storage nature. I get it that it's brilliant for massive storage, but that design makes a difference in how you approach file operations.

You have to realize that S3 is designed to handle large quantities of unstructured data and excels at managing and retrieving those large blobs efficiently across its distributed architecture. But that means it's not really built for the quick, frequently changing read and write operations you might need. For instance, when you’re dealing with applications that require rapid updates to files, every millisecond counts. I could argue that with S3, you often wind up waiting longer than necessary, especially if you’re working with a large dataset or a typical object write command.

That brings us to data access patterns. With S3, you’re typically dealing with the entire object every time you read or write. Let's say you’re trying to update a small piece of information within a large file stored on S3. You can’t just change that snippet. Instead, you need to download the entire object, make the change locally, and then write it back to S3. I imagine that sounds cumbersome, right? You could be looking at significant data transfer times, latencies, and increased costs all because you have to manage the whole object instead of just its components.

Consider a collaborative document editing application where users frequently update small snippets of text. If you’re using S3 as your backing store, you’ll end up with inefficiencies because every minor change requires a complete download and upload, introducing latency that can frustrate users. You want a setup where I can update the document at a granular level, and S3 simply doesn’t provide that ability. That’s why many turn to file systems like EFS or even block storage solutions like EBS for those specific workloads.

I also want to touch upon consistency models. S3 offers eventual consistency for overwrites in most cases, and while that doesn’t usually pose a problem for data analysis or large-scale backup applications, it can be a pain if you're craving immediate consistent reads after a write operation. Imagine if you're developing software where users submit data rapidly and you need immediate feedback for validation. You find yourself in a pickle when you write to S3, and the system tells you the data is there, but it hasn’t fully propagated yet. There’s where you stand to lose trust in the data immediately available to you.

Another element worth noting is the API call overhead that can add even more latency. S3 uses a RESTful API, which works well for its intended use case, but if you’re making numerous quick calls—like in a microservices architecture where you’re constantly accessing or modifying small files—you might run into a lag due to the overhead of initiating HTTP requests. Every API call has an overhead, and the more fragmented or granular your operations become, the more amplified that delay is. If you’re looking for something where you can batch operations together, that’s another place where S3 becomes less than ideal since there are limits to batching with S3 when compared with file systems optimized for that.

Let's talk about throughput. While S3 has high throughput capabilities for large objects, if you need to perform fast sequential reads and writes, you can hit a wall. Most block storage options or even local file systems provide predictable performance with lower latency because they interact with files directly, sidestepping the networking and HTTP overhead. For many applications that need to churn through data rapidly—think video processing or high-frequency trading strategies—you’ll generally want something that can work closer to the hardware and provide low-latency access.

You might think about environments with dynamic file needs, like a gaming application that constantly reads and writes asset data. If you store all those assets in S3, you could face latency that would definitely degrade the user experience. Game assets frequently change during runtime, so accessing them directly over an object storage API would slow down load times. A better approach would be to maintain them in a file system with in-memory caching so I could pull them exactly when I need them without the typical overhead.

If you're a data scientist, you might also encounter S3's shortcomings when preparing large datasets for machine learning. Many of those workflows involve not just reading large amounts of data, but various transformations and iterations that require quick access to different parts of the dataset. Using something like S3 for a transformation pipeline can become a severe bottleneck when you're iterating frequently on model training, where you want to access, modify, and quickly write back small parts of your dataset.

Performance-wise, I feel it's also essential for you to be aware that S3 has different performance characteristics based on how you're accessing your objects. For example, while the throughput can be strong for larger objects, trying to hit S3 with lots of small files can become a nightmare—not to mention the 1,000 object list limit per prefix which can eventually lead to complexity in managing those files. If your application breaks down tasks into many small files for quick manipulation, you’ll quickly find yourself falling into performance traps as you queue operations to S3.

I want to highlight how S3 likely won't integrate well with systems that require synchronous data access. Have you ever dealt with databases that need real-time updates or shared state among multiple clients? In such cases, the lag from S3 would introduce inconsistencies that could lead to corrupted states in your application. I think this is particularly critical in applications like real-time analytics dashboards where you need to reflect updates almost instantaneously.

In summary, what you run into when you examine S3 closely for specific file operations is a series of mismatches between its capabilities and the needs of applications that demand quick, consistent, and granular file manipulations. I understand S3 is an excellent option for certain use cases, particularly around archival and large data storage, but it becomes less suitable for low-latency operations you might expect in various modern applications. That’s crucial to keep in mind as you're designing your systems.