Accessing Parquet Files via IAM Role
This page is in alpha status. The content may change without notice.
Use this guide if you need programmatic or automated access from your own AWS account (e.g., scheduled pipelines, Kubernetes workloads, or custom applications). If you access data interactively using AWS SSO, see AWS SSO Setup and Download Parquet (AWS CLI) instead.
How It Works
Haltian stores data files in a private S3 bucket in Haltian’s AWS account. To let you read those files, you create an IAM role in your AWS account. You then share the role’s ARN (Amazon Resource Name) with Haltian. Haltian adds that ARN to the bucket’s access policy, and from that point your code can assume the role to list and download files.
Prerequisites
Tools:
AWS CLI — install from the AWS CLI install guide and configure it with
aws configureor your SSO profile.Python 3 — install from python.org (only needed for the Python examples).
Python packages — install in a virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install boto3 pandas pyarrow
Values provided by Haltian:
| Value | Description |
|---|---|
<BUCKET_NAME> | The name of the S3 bucket |
<AWS_REGION> | The AWS region the bucket is in (e.g. eu-west-1) |
{organizationId} | Your organisation UUID (e.g. 00000000-0000-0000-0000-000000000000) |
Your AWS account details:
- AWS account ID — a 12-digit number. Find it in the top-right corner of the AWS Console, or run:
aws sts get-caller-identity --query Account --output text - IAM username — find it under Security credentials in the console, or run:The username is the last segment of the ARN, e.g.
aws sts get-caller-identity --query Arn --output textarn:aws:iam::123456789012:user/jane.doe.
File Path Structure
Your files are organised in the bucket under two layouts.
Legacy unversioned layout
This layout is being phased out in favor of the versioned layout below.
parquet/{organizationId}/<query_name>/<year>/<month>/<start>_<query_name>.parquet
Where <start> is a UTC timestamp at hour resolution (YYYY_mm_dd_HH). Example:
parquet/00000000-0000-0000-0000-000000000000/measurementBatteryVoltage/2025/02/2025_02_25_14_measurementBatteryVoltage.parquet
Versioned layout
parquet/v1/{organizationId}/<query_name>/<year>/<month>/<start>_<end>_<query_name>.parquet
Where <start> and <end> are UTC timestamps at minute resolution (YYYY_MM_DD_HHMM). Example:
parquet/v1/00000000-0000-0000-0000-000000000000/measurementBatteryVoltage/2025/02/2025_02_25_1400_2025_02_25_1500_measurementBatteryVoltage.parquet
| Segment | Description |
|---|---|
v1 | Schema version |
{organizationId} | Your organisation UUID, provided by Haltian |
<query_name> | The type of measurement or metadata (e.g. measurementBatteryVoltage) |
<year> | 4-digit UTC year of the export window |
<month> | 2-digit UTC month of the export window |
<start> | UTC start timestamp of the export window |
<end> | UTC end timestamp of the export window |
Step 1: Create an IAM Policy
Create a policy that grants read-only access to your organisation’s files. This covers both the versioned and legacy layouts.
Replace <BUCKET_NAME> and {organizationId} with the values provided by Haltian.
In the AWS Console:
- Go to IAM → Policies → Create policy.
- Switch to the JSON tab and paste the policy below.
- Click Next, name the policy (e.g.
my-org-parquet-reader-policy), and click Create policy.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ListParquetBucket",
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::<BUCKET_NAME>",
"Condition": {
"StringLike": {
"s3:prefix": [
"parquet/{organizationId}",
"parquet/{organizationId}/*",
"parquet/*/{organizationId}",
"parquet/*/{organizationId}/*"
]
}
}
},
{
"Sid": "GetParquetFiles",
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::<BUCKET_NAME>/parquet/{organizationId}/*",
"arn:aws:s3:::<BUCKET_NAME>/parquet/*/{organizationId}/*"
]
}
]
}
Terraform
resource "aws_iam_policy" "parquet_reader" {
name = "my-org-parquet-reader-policy"
description = "Grants read-only access to Parquet files in S3"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "ListParquetBucket"
Effect = "Allow"
Action = ["s3:ListBucket"]
Resource = "arn:aws:s3:::<BUCKET_NAME>"
Condition = {
StringLike = {
"s3:prefix" = [
"parquet/{organizationId}",
"parquet/{organizationId}/*",
"parquet/*/{organizationId}",
"parquet/*/{organizationId}/*",
]
}
}
},
{
Sid = "GetParquetFiles"
Effect = "Allow"
Action = ["s3:GetObject"]
Resource = [
"arn:aws:s3:::<BUCKET_NAME>/parquet/{organizationId}/*",
"arn:aws:s3:::<BUCKET_NAME>/parquet/*/{organizationId}/*",
]
}
]
})
}
Step 2: Create an IAM Role
Create a role, attach the policy from Step 1, and configure a trust policy that controls who can assume the role.
AWS Console
- Go to IAM → Roles → Create role.
- Under Trusted entity type, select AWS account.
- Select This account (or enter another account ID if applicable).
- Click Next.
- Find and select the policy you created in Step 1. Click Next.
- Name the role (e.g.
my-org-parquet-reader) and click Create role. - Open the role, click Trust relationships → Edit trust policy, and paste:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:user/jane.doe"
},
"Action": "sts:AssumeRole"
}
]
}
Replace 123456789012 with your AWS account ID and jane.doe with your IAM username.
Avoid using arn:aws:iam::123456789012:root as the principal — it grants every user and role in your AWS account the ability to assume this role. Scope it to a specific IAM user.
Terraform
data "aws_iam_policy_document" "parquet_reader_trust" {
statement {
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
type = "AWS"
identifiers = ["arn:aws:iam::123456789012:user/jane.doe"]
}
}
}
resource "aws_iam_role" "parquet_reader" {
name = "my-org-parquet-reader"
assume_role_policy = data.aws_iam_policy_document.parquet_reader_trust.json
}
resource "aws_iam_role_policy_attachment" "parquet_reader" {
role = aws_iam_role.parquet_reader.name
policy_arn = aws_iam_policy.parquet_reader.arn
}
Advanced: Kubernetes Workloads (EKS / IRSA)
This section only applies if you are setting up access for a workload running inside an EKS cluster. For personal or script-based access, skip to Step 3.
IRSA (IAM Roles for Service Accounts) lets a Kubernetes workload assume an IAM role without static credentials via OIDC federation.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/<YOUR_EKS_OIDC_PROVIDER>"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"<YOUR_EKS_OIDC_PROVIDER>:sub": "system:serviceaccount:<NAMESPACE>:<SERVICE_ACCOUNT_NAME>"
}
}
}
]
}
Terraform (IRSA)
data "aws_eks_cluster" "this" {
name = "<YOUR_EKS_CLUSTER_NAME>"
}
locals {
oidc_provider = replace(data.aws_eks_cluster.this.identity[0].oidc[0].issuer, "https://", "")
}
data "aws_iam_policy_document" "parquet_reader_trust_irsa" {
statement {
effect = "Allow"
actions = ["sts:AssumeRoleWithWebIdentity"]
principals {
type = "Federated"
identifiers = ["arn:aws:iam::123456789012:oidc-provider/${local.oidc_provider}"]
}
condition {
test = "StringEquals"
variable = "${local.oidc_provider}:sub"
values = ["system:serviceaccount:<NAMESPACE>:<SERVICE_ACCOUNT_NAME>"]
}
}
}
resource "aws_iam_role" "parquet_reader" {
name = "my-org-parquet-reader"
assume_role_policy = data.aws_iam_policy_document.parquet_reader_trust_irsa.json
}
resource "aws_iam_role_policy_attachment" "parquet_reader" {
role = aws_iam_role.parquet_reader.name
policy_arn = aws_iam_policy.parquet_reader.arn
}
Step 3: Share the Role ARN with Haltian
Even though your role has the right permissions on your side, Haltian must also add your role’s ARN to the bucket policy before access works.
Share your role ARN with your Haltian contact:
arn:aws:iam::123456789012:role/my-org-parquet-reader
Finding the ARN in the console:
- Go to IAM → Roles and click on the role name.
- The ARN is shown at the top of the summary page — click the copy icon.
Terraform output
output "parquet_reader_role_arn" {
description = "IAM Role ARN to share with Haltian"
value = aws_iam_role.parquet_reader.arn
}
Step 4: Access Your Files
Once Haltian confirms that your role ARN has been added to the bucket policy, you can start accessing files.
Using the AWS CLI
# Assume the role and extract credentials into shell variables
read AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN <<< $(aws sts assume-role \
--role-arn "arn:aws:iam::123456789012:role/my-org-parquet-reader" \
--role-session-name "parquet-access-session" \
--query "Credentials.[AccessKeyId,SecretAccessKey,SessionToken]" \
--output text)
export AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN
# List your files (versioned layout)
aws s3 ls s3://<BUCKET_NAME>/parquet/v1/{organizationId}/
# List your files (legacy unversioned layout)
aws s3 ls s3://<BUCKET_NAME>/parquet/{organizationId}/
# Download a specific file (versioned layout)
aws s3 cp \
s3://<BUCKET_NAME>/parquet/v1/{organizationId}/measurementBatteryVoltage/2025/02/2025_02_25_1400_2025_02_25_1500_measurementBatteryVoltage.parquet \
./
# Download a specific file (legacy unversioned layout)
aws s3 cp \
s3://<BUCKET_NAME>/parquet/{organizationId}/measurementBatteryVoltage/2025/02/2025_02_25_14_measurementBatteryVoltage.parquet \
./
Using Python (boto3) — SSO Profile
Make sure you are logged in first (see AWS SSO Setup):
aws sso login --profile <YOUR_SSO_PROFILE>
import boto3
session = boto3.Session(profile_name="<YOUR_SSO_PROFILE>", region_name="<AWS_REGION>")
sts_client = session.client("sts")
assumed_role = sts_client.assume_role(
RoleArn="arn:aws:iam::123456789012:role/my-org-parquet-reader",
RoleSessionName="parquet-access-session",
)
credentials = assumed_role["Credentials"]
s3_client = session.client(
"s3",
aws_access_key_id=credentials["AccessKeyId"],
aws_secret_access_key=credentials["SecretAccessKey"],
aws_session_token=credentials["SessionToken"],
)
# List files (versioned layout)
response = s3_client.list_objects_v2(
Bucket="<BUCKET_NAME>",
Prefix="parquet/v1/{organizationId}/",
)
for obj in response.get("Contents", []):
print(obj["Key"])
If the SSO session has expired you will get a SSOTokenLoadError or UnauthorizedSSOTokenError. See AWS SSO Setup — Troubleshooting for details.
Using Python (boto3) — Without SSO
Use this approach when credentials are available from the environment, an EC2/ECS/Lambda instance role, or ~/.aws/credentials.
import boto3
sts_client = boto3.client("sts", region_name="<AWS_REGION>")
assumed_role = sts_client.assume_role(
RoleArn="arn:aws:iam::123456789012:role/my-org-parquet-reader",
RoleSessionName="parquet-access-session",
)
credentials = assumed_role["Credentials"]
s3_client = boto3.client(
"s3",
aws_access_key_id=credentials["AccessKeyId"],
aws_secret_access_key=credentials["SecretAccessKey"],
aws_session_token=credentials["SessionToken"],
region_name="<AWS_REGION>",
)
response = s3_client.list_objects_v2(
Bucket="<BUCKET_NAME>",
Prefix="parquet/v1/{organizationId}/",
)
for obj in response.get("Contents", []):
print(obj["Key"])
Using pandas / PyArrow
Load a Parquet file directly into a DataFrame:
import boto3
import pandas as pd
# Log in first (see AWS SSO Setup): aws sso login --profile <YOUR_SSO_PROFILE>
session = boto3.Session(profile_name="<YOUR_SSO_PROFILE>", region_name="<AWS_REGION>")
sts_client = session.client("sts")
assumed_role = sts_client.assume_role(
RoleArn="arn:aws:iam::123456789012:role/my-org-parquet-reader",
RoleSessionName="parquet-access-session",
)
credentials = assumed_role["Credentials"]
storage_options = {
"key": credentials["AccessKeyId"],
"secret": credentials["SecretAccessKey"],
"token": credentials["SessionToken"],
}
# Versioned layout
s3_path = "s3://<BUCKET_NAME>/parquet/v1/{organizationId}/measurementBatteryVoltage/2025/02/2025_02_25_1400_2025_02_25_1500_measurementBatteryVoltage.parquet"
df = pd.read_parquet(s3_path, storage_options=storage_options)
print(df.head())
Security Best Practices
- Use short-lived credentials. Role assumption gives you temporary credentials that expire automatically (typically after one hour). This is safer than long-lived access keys.
- Scope the trust policy to a specific user. Restrict the trust policy to a named IAM user rather than
:root. - Rotate long-lived access keys regularly. If using static access keys in
~/.aws/credentials, rotate them regularly in IAM → Users → Security credentials. - Enable audit logging. AWS CloudTrail records every API call including role assumptions and file downloads. Enable it under CloudTrail → Create trail.