Need to create an incremental backup of an R2 bucket

For Workers & Pages, what is the name of the domain?

n/a

What is the error number?

n/a

What is the error message?

none

What is the issue or error you’re encountering

Trying to take a local backup of an R2 bucket with approx 3 million images / 400Gb.

What steps have you taken to resolve the issue?

We are concerned that if an attacker gets hold of our R2 API credentials, that it could delete files from our bucket. So in that case we’d like to create a local copy of the bucket, as an incremental backup. If an image is deleted then we can retrieve it.
Our first attempt was to mount the bucket in a directory using s3fs, and then using restic to back that up to a local repo. However performance seems to be terrible. On a small subdirectory, as a test (6718 files, 1.759 GiB), the sync has stalled after the scan stage, and remained inactive for the last 20 mins.

Q1: Are there any optimizations that I can apply so this sync carries on. Has anyone got this method to work?

Q2: What are other approaches that people are using to take local, incremental copies of their S3 data? I’m sure I’m not the only one. We do have an rclone script running to mirror the bucket to another bucket. But the same problem applies: if an attacker can get onto the machine with that script running, they can delete the entire contents of both buckets. So how to get a local copy?

What are the steps to reproduce the issue?

n/a

The command to back up the test directory of under 2Gb was still thinking after an hour. I figured it wasn’t going to do anything so I stopped it.

Using rclone you can clone bucket to bucket, if needed, otherwise download from bucket to your local device or some other S3 compatible service.

Might be you’d have to either check your internet speed connection or modify your parameters so you’d be able to send parallel assets in batches.

I’d suggest measurements to prevent this such as locking the machine, use key for SSH access, locking unneeded ports, create a new R2 API Token with limited or restricted privileges, or you can create one whic expires after e.g. 24 hours, etc.

Appreciate the reply. I have tried rclone bucket to bucket, and indeed we do use that for copying to a staging bucket. But the requirement is to get a local non-s3 copy. And also for incremental backups, as its a directory of images that expands by ~ 1Gb a week, and (in theory) should never have any deletions. Rclone will make a synced copy locally, but can’t do the incremental bit, which is where restic is good.

Restic can use s3 as a target, I gather, but not as a source. I could, I suppose use rclone to make a local copy, and then run restic on the local copy

My initial test on 1Gb of data was from Cloudflare’s S3 bucket (which through their CDN appears in France) to a server in a datacentre in Germany. I can get 1Gb/s download speed to that server. I don’t think the connection is a problem.

As for the security measures, of course the server is as locked down as possible. Expiring the key is not an option, as its in constant use. R2 storage doesn’t have the same granularity of security as Amazon S3, so its either read only, or read-write.

their CDN appears in France to a server in a datacentre in Germany

I think you’ll find that “FRA” in connection logs doesn’t mean FRAnce, but stands for FRAnkfurt - which is probably much closer to you.

2 Likes

I have got this data/transfer on daily basis (up to 1-2k images), WordPress media, via rclone to R2 bucket and copy to another bucket while keeping the uploaded media as well on the origin server.