What are the challenges of using S3 for file locking and concurrent access scenarios?

***savas*** · 08-11-2024, 12:56 PM

[Image: drivemaker-s3-ftp-sftp-drive-map-mobile.png]

Understanding the challenges of using S3 for file locking and concurrent access is crucial if you’re considering it for your applications. S3 is built around a model that’s fundamentally about scalability and durability, but that design can create some specific headaches when it comes to managing concurrent file access.

One of the biggest issues with S3 is that it’s an object store, not a traditional file system. With typical file systems, you have mechanisms like file locks or advisory locks that prevent multiple processes from writing to the same file at the same time. With S3, you don’t have that inherent locking behavior. If you have two processes trying to write to the same object, you could end up with one overwriting the other, resulting in data loss or corruption. This is particularly troubling if you have critical data that multiple systems need to work with concurrently.

Let’s say you have an application that allows users to upload files to S3. If two users happen to upload a file with the same key—let's say both upload a version called "report.pdf"—the first upload will succeed, but the second upload will overwrite the first. What you end up with is a situation where users might think they're accessing their own unique versions of a file when, in reality, what's stored in S3 is determined by the timing of those upload requests.

Now, you can get around this by implementing some sort of naming convention that incorporates user IDs, timestamps, or some unique identifiers, but that can complicate how you manage and organize your files. You need to ensure that every user interaction produces a unique key, which can become cumbersome, especially as the scale of your application grows.

Another significant point that gets tricky is managing read and write consistency. S3 operates on an eventual consistency model for overwrite PUTS and DELETES, meaning that after a successful write, it doesn’t immediately guarantee that you'll receive the latest version when you read it back—at least not in all regions. If I write a new version of a file and then immediately try to read it back, I might still get the old version if there's any delay in propagation. For applications that depend on real-time data accuracy, this can be a serious problem.

Think about a scenario in which I have a multi-step workflow that processes uploaded data. If I have a step that requires the latest version of a file immediately after it’s uploaded, I can’t really rely on S3 if I can’t confirm that I’m working with the latest data. You might mitigate this by incorporating additional checks or secondary systems that ensure consistency, but that just adds complexity on top of your architecture.

You also need to consider how you handle conflicts. Imagine I have a scenario where two processes are making updates to a file in S3 almost simultaneously. Both might read an earlier version, make their changes, and save them back. The second process, unaware of the changes made by the first process, inadvertently overwrites some crucial data. To handle this kind of conflict seamlessly, I might need to enforce a versioning system in my application layer, where I check the version of the file before I write changes. But with S3, managing that versioning system can become an overhead since S3 itself doesn’t manage file versions unless you explicitly enable versioning for the bucket. Even with versioning enabled, reconciling the versions and managing what to do when conflicts arise complicates my application logic significantly.

There are also implications for performance that you might not consider at first glance. With S3, there’s a definite latency in accessing your objects, which can affect the speed and responsiveness of your application. If I'm working with a web application that requires constant access to frequently updated files, I might experience delays in retrieving the necessary objects, which can impact user experience. This isn’t just a minor headache; it can lead to significant performance bottlenecks, especially under high concurrency.

Another layer of complexity arises when you think about visibility and notification. If you’re trying to build an application where users receive real-time updates when a file is modified, you can’t rely solely on S3 to handle that. S3 doesn’t have built-in event notifications that will feed you information about all changes unless you use triggers like SNS or SQS. Setting all that up requires additional work and might introduce its own set of issues. Sometimes, when those events propagate through, they might introduce delays, and you can end up notifying your user of an update that isn’t actually reflected in the data they are seeing.

Think about logging as well. In a concurrent access scenario, if both processes are logging their actions, do I consolidate logs from multiple processes to make sense of a timeline? If I don’t grab logs in the right order, it could be challenging to diagnose why a particular inconsistency occurred. Collecting these logs might require more orchestration as well, which could compromise simplicity.

If I decide to implement caching in front of S3 to mitigate some latency and improve performance, it introduces even more variables into the equation. Data consistency between my cache and the S3 objects becomes important. If I cache an object to improve performance but neglect to invalidate that cache when an object gets updated, I might serve stale data to users. Creating a cache invalidation strategy can turn into a headache, especially if updates happen frequently.

Another major challenge is handling multipart uploads. S3 allows you to upload files in several parts, which can be great for performance, but coordinating those uploads invites its own issues. You have to make sure that all parts are uploaded successfully and in the correct order before finalizing the upload. If one part fails, I have to handle retries, which can be tricky if other processes are trying to write to the same object. This could lead to scenarios where I end up with incomplete files if my application isn't robust enough.

We also have to talk about access management. When multiple users or processes are hitting S3 in parallel, you need to ensure that permissions are managed correctly. IAM policies can help with this, but they can also become a tangle of rules that are hard to manage across teams. If you misconfigure permissions, you risk unauthorized access or, conversely, lock out legitimate users at crucial times.

You might consider leaning on AWS Lambda to handle some pre- or post-processing when dealing with files, but that introduces a whole new layer of complexity around invocation, timeouts, and error handling. If my triggers don’t fire correctly or if I have a timeout in callbacks, my process could fail half-way, leaving a mess behind that’s tough to clean up.

Eventually, as your needs evolve, you might reach a point where S3 starts to feel like a misfit for your file locking requirements and concurrent access scenarios, forcing me to either build extensive infrastructure around it or rethink your approach entirely by considering alternatives better suited for collaborative or concurrent workloads. This often isn’t just about technology but understanding what’s best aligned with what I’m trying to achieve as a developer.