What are the limitations when using S3 for storing large datasets?

***savas*** · 10-31-2021, 06:52 AM

[Image: drivemaker-s3-ftp-sftp-drive-map-mobile.png]

You might think that using S3 for large datasets is a no-brainer due to its scalability and durability, but there are specific limitations that can impact how efficiently you can manage your data. I’ve worked with S3 a lot, and there are a few things I’ve learned the hard way that I think you should be aware of.

The first limitation you’ll run into is the performance in terms of latency and throughput. S3 operates over HTTP, which means that every single request you make, whether it’s a PUT, GET, or LIST, incurs network overhead. If you’re dealing with a massive dataset, the time you spend waiting for data retrieval can quickly add up. For example, if I’m working with a dataset that contains thousands of small files, each GET request adds latency. The simultaneous fetches don’t necessarily speed everything up. Instead, I often find that they can overwhelm my network connection. When you need to access files quickly, sometimes you might want to reconsider how you structure your data. Instead of many tiny files, there might be a case for batching your data into larger objects to effectively reduce the number of requests.

Another thing to keep in mind is the S3 eventual consistency model. While it works quite well for many scenarios, it can become problematic for large datasets that are frequently updated. If you’re like me and you need real-time data access, any delay in visibility could lead to stale reads. The data I’m pulling might not reflect recent uploads for a short amount of time. For example, if I store logs in S3 and then immediately try to read them, sometimes I have to deal with discrepancies because I didn’t process the latest writes properly. Understanding how S3’s eventual consistency affects your data access patterns is crucial, especially when you depend on having accurate, up-to-date information.

Cost is another area where limitations might bite you in the rear. S3’s pricing is tiered based on storage, requests, and data transfer. Over time, I’ve seen costs stack up, especially when you’re pulling data frequently or have high replication needs across multiple regions. If you're querying large datasets frequently, consider the cost implications of constantly retrieving data since every GET and LIST can start to add up. You might want to think about how to minimize unexpected expenses, perhaps by optimizing your data retrieval patterns.

Another thing you need to consider is the size of individual objects you’re dealing with. While S3 allows you to store objects up to 5 TB, managing large files can get tricky. Uploading and modifying large files often requires using multipart upload for better resilience and performance. You’ll be in a situation where if a network interruption occurs during the upload, you’ll need to restart from the last uploaded part, which can be quite the hassle if you’re uploading massive datasets. I’ve personally lost considerable time in these scenarios, and it’s something you need to account for in your workflow.

Think about regional limitations when you’re storing and processing your datasets. S3 allows you to choose from various geographic regions for your storage, but each region has its own limitations regarding availability, durability, and latency. If your application demands rapid access to data stored in multiple locations, replicating your datasets across different regions incurs additional costs. At one point, I had a project that required low-latency access across the U.S. and Europe and ended up needing to synchronize data across regions manually, which added complexity and management overhead to the whole process.

Retrieving data with specific query capabilities can also feel limited in S3. It’s primarily an object store and lacks the traditional database query capabilities. If you're dealing with structured datasets, you might find yourself wishing for more robust query functionalities like those in SQL databases. What I found myself doing at one point was staging my data in S3 and then moving it into something like Athena for querying. There’s this additional cost associated with querying, which adds another layer to managing the overall budget of your project.

Versioning sounds great, but like everything else, it has its downsides. While S3 offers object versioning, it can lead to unwanted challenges if I don’t manage it properly. Every time you update an object, a new version is created. Over time, if you forget to clean up old versions, you can end up with a surprising amount of duplicated data, which can bloat your storage costs. I’ve seen budgets get wildly off-course just because someone wasn’t keen on maintaining their versioning hygiene.

Another point I’ve encountered is integration with other AWS services. While S3 is designed to work with the AWS ecosystem seamlessly, depending on how large your datasets are, you may find yourself facing bottlenecks with services such as Lambda or Glue when trying to process data in S3. For example, if your dataset is too large, it might time out, forcing you to find alternative methods. I’ve had situations where I needed to use batching to overcome those limits, which can add complexity to my workflows.

I also think about security concerns when storing large datasets in S3. You need to manage access controls meticulously. If you have thousands of objects, maintaining the right IAM roles and bucket policies can become burdensome. When you’re dealing with large datasets that require numerous permissions, accidentally misconfiguring something can lead to unintended data exposure or, conversely, difficulties accessing your own data. I’ve been on projects where I had to meticulously document and keep track of access permissions for multiple users, which is tedious but necessary.

Data lifecycle management can also be a headache. S3 does offer lifecycle policies for transitioning objects to cheaper storage classes, but crafting those policies takes a pretty nuanced understanding of your data access patterns. If you’re not careful, you could end up transitioning often-accessed data to a storage class with higher access costs or worse performance. Having lived through some of that trial and error, I know firsthand how important it is to analyze your datasets thoroughly before implementing lifecycle rules.

Overall, while S3 is a phenomenal tool for many use cases, it’s not without its limitations. Anytime you’re working with large datasets, you’ll need to be aware of these factors and how they can interact with your workflows. It requires you to approach your design and architecture with a mindful eye on how your data is structured and accessed, what performance needs you have, and what budget constraints you’re working under. I think keeping these limitations in mind will save you a lot of headaches while working with S3.