How do you retrieve objects from S3 Glacier?

***savas*** · 03-22-2020, 04:00 PM

[Image: drivemaker-s3-ftp-sftp-drive-map-mobile.png]

Retrieving objects from S3 Glacier isn’t as straightforward as pulling data from other S3 storage classes. You need to think through the process carefully because it's built for long-term storage, which means retrieval is more complex and can take a while. I’ll guide you through the steps, the considerations, and what to expect when you’re working on it.

First, you need to understand that S3 Glacier is intended for data that you rarely access. It’s a cost-effective solution for long-term data storage, but that comes with its own trade-offs. When you want to retrieve your objects, you need to choose the retrieval option that meets your urgency and budget since there are three main retrieval types: Expedited, Standard, and Bulk. I find it’s worth weighing these options carefully for each object you want to access.

Let's say you need a critical file that you accidentally archived in Glacier. If timing is a crucial factor, you would opt for the Expedited retrieval. This option allows you to access your data within about 1-5 minutes. That said, this rapid access comes at a premium cost. If you’re retrieving a single small file, this may be a reasonable expense, but for larger datasets or multiple files, you might want to think twice.

When you go the Expedited route, I would submit a retrieval request and specify which object you want. Make sure your request is clear; otherwise, it might not process correctly. The retrieval request can be submitted using either the AWS Management Console, AWS CLI, or SDKs, depending on your comfort level with those tools. I personally lean towards the CLI for scripting repetitive tasks, but I totally get why someone might prefer the console for its visual interface.

Next, if you’re okay with waiting a few hours, the Standard retrieval is your best friend. It’s way cheaper than Expedited and usually returns your files in 3-5 hours. In a lot of scenarios, I find that the Standard option satisfies the retrieval needs without breaking the bank. To initiate this, you’d again send a retrieval request similar to the way you would for an Expedited request. You provide the object’s key, and I strongly recommend that you confirm the bucket name as well.

The real catch comes with the Bulk retrieval, which is intended for large datasets. If you need to retrieve tons of files, Bulk can be a game changer. The cost is significantly lower, but you’re looking at a retrieval time of about 5-12 hours or even more, depending on the amount of data you’re pulling back. I’ve worked on projects where quick access wasn’t necessary, and in those cases, I’ve used Bulk retrieval for reaping the cost benefits. However, you have to factor in the time it takes; it’s definitely not for time-sensitive data.

When submitting a request, especially for Bulk, you’d typically specify a job with the necessary parameters, and AWS doesn’t just start pulling your data right away. You’ll receive a job ID that you must use to check the status of your retrieval request. This is where you need to be a bit patient. I usually check back in a few hours and use the job ID to track if the data is ready for downloading.

I can’t stress enough how you need to pay attention to the S3 namespaces. Every object in S3 Glacier is stored in buckets just like other S3 storage, but you want to ensure that you’re working in the correct namespace. Mistakes happen if you’re not careful, especially when handling similar filenames or versions. Unless you keep a careful log of what you’ve moved to Glacier, it can easily turn into a guessing game of what’s where.

You also need to remember that once your data is retrieved, it will move to the S3 Standard storage class for a limited time before returning back to Glacier. This is typically 24 hours. During this window, you can access the file as you would with any standard S3 object, and I’ve found this part particularly useful. If I need to extract several files quickly, I often do them in one go and then batch process them afterward.

One of the quirks of S3 Glacier retrieval is that you’re charged for retrieval requests and the data transferred out to your specified destination. You’ve got to plan your data retrievals a bit like you would for a project plan. Only request what's necessary to avoid unnecessary costs.

In terms of the actual mechanics, you can address the API in a couple of different ways. If you’re comfortable with AWS SDKs, you can initiate retrieval requests directly in your application using the S3 Glacier APIs. Each request involves a hash of the object you’re retrieving, along with additional parameters that outline your retrieval method. That’s extremely important because if you miss even a small detail, AWS will throw an error, and then it’s back to square one.

For those who prefer the command line, here’s a quick example using the AWS Command Line Interface. You would run something like:

aws s3api restore-object --bucket your-bucket-name --key your-object-key --restore-request '{"Days":1, "GlacierJobParameters": {"Tier": "Standard"}}'

With the 'Days' parameter, you can set how long you want the file to stay available in S3 Standard before it returns to Glacier. I usually set it for 1 day because that lines up with how I like to work, but it can be adjusted as per your needs.

Once the job completes, you will get an email if you set it up correctly or you can query the job status with the command:

aws s3api describe-job --account-id your-account-id --job-id your-job-id --vault-name your-vault-name

Diligently working through this part means you can stay on top of your operations without having to manually check every five minutes.

Remember that handling multiple jobs can quickly get cumbersome if you don’t keep tabs on them. I recommend setting up a system to log or track retrieval jobs, especially if you often retrieve data from Glacier. Not only does it help with organization, but it also makes it easier to predict and manage costs.

Lastly, you need to factor in data consistency and how Glacier fits into your broader data strategy. If you expect to work with files often, and you’re finding yourself frequently retrieving from Glacier, it might be beneficial to reconsider that data classification. Sometimes, moving files around and keeping them accessible in a more readily available S3 class can save you time and money in the long run.

Once you’ve retrieved and used your data, pay attention to your life cycle policies. If you find you're making occasional retrieves but also frequently archiving new data, automating the whole transfer and retrieval process with lifecycle management can ease future workloads. I’ve had good experiences setting lifecycle rules that transition data based on its age or usage patterns automatically. This can be incredibly handy if you're constantly adding to your data collection and need to clear out older, less critical data.

In summary, this whole process is certainly not intuitive if you’re used to accessing files from traditional storage systems. With knowledge and experience, you can become quite proficient at retrieving objects from S3 Glacier, turning a potentially time-consuming operation into a manageable task.