Accessing R2 from databricks

Hi all,

I’m trying to read a dataset in r2 from databricks, but encountering an issue (full log below). Seems to be something related to incompatibility between R2 and S3. Has anyone encountered this and/or know of a solution? Thanks

AWSBadRequestException: listStatus on s3a://indexed-xyz/ethereum/decoded/logs/v1.2.0/partition_key=ff/dt=2023: com.amazonaws.services.s3.model.AmazonS3Exception: MaxKeys params must be positive integer <= 1000.; request: GET https://indexed-xyz.ed5d915e0259fcddb2ab1ce5592040c3.r2.cloudflarestorage.com  {key=[ethereum/decoded/logs/v1.2.0/partition_key=ff/dt=2023/], key=[false], key=[5000], key=[2], key=[/]} Hadoop 3.3.4, aws-sdk-java/1.12.189 Linux/5.10.147+ OpenJDK_64-Bit_Server_VM/25.345-b01 java/1.8.0_345 scala/2.12.14 vendor/Azul_Systems,_Inc. cfg/retry-mode/legacy com.amazonaws.services.s3.model.ListObjectsV2Request; Request ID: null, Extended Request ID: null, Cloud Provider: GCP, Instance ID: unknown (Service: Amazon S3; Status Code: 400; Error Code: InvalidMaxKeys; Request ID: null; S3 Extended Request ID: null; Proxy: null), S3 Extended Request ID: null:InvalidMaxKeys: MaxKeys params must be positive integer <= 1000.; request: GET https://indexed-xyz.ed5d915e0259fcddb2ab1ce5592040c3.r2.cloudflarestorage.com  {key=[ethereum/decoded/logs/v1.2.0/partition_key=ff/dt=2023/], key=[false], key=[5000], key=[2], key=[/]} Hadoop 3.3.4, aws-sdk-java/1.12.189 Linux/5.10.147+ OpenJDK_64-Bit_Server_VM/25.345-b01 java/1.8.0_345 scala/2.12.14 vendor/Azul_Systems,_Inc. cfg/retry-mode/legacy com.amazonaws.services.s3.model.ListObjectsV2Request; Request ID: null, Extended Request ID: null, Cloud Provider: GCP, Instance ID: unknown (Service: Amazon S3; Status Code: 400; Error Code: InvalidMaxKeys; Request ID: null; S3 Extended Request ID: null; Proxy: null)
Caused by: AmazonS3Exception: MaxKeys params must be positive integer <= 1000.; request: GET https://indexed-xyz.ed5d915e0259fcddb2ab1ce5592040c3.r2.cloudflarestorage.com  {key=[ethereum/decoded/logs/v1.2.0/partition_key=ff/dt=2023/], key=[false], key=[5000], key=[2], key=[/]} Hadoop 3.3.4, aws-sdk-java/1.12.189 Linux/5.10.147+ OpenJDK_64-Bit_Server_VM/25.345-b01 java/1.8.0_345 scala/2.12.14 vendor/Azul_Systems,_Inc. cfg/retry-mode/legacy com.amazonaws.services.s3.model.ListObjectsV2Request; Request ID: null, Extended Request ID: null, Cloud Provider: GCP, Instance ID: unknown (Service: Amazon S3; Status Code: 400; Error Code: InvalidMaxKeys; Request ID: null; S3 Extended Request ID: null; Proxy: null)

Had a similar issue when using hadoop-aws and it seems like it is due to Hadoop setting the MaxKeys parameter to 5000 by default. The issue was solved when overriding the fs.s3a.paging.maximum value in the configurations (core-site.xml).