Accessing Parquet Files via IAM Role

Set up cross-account IAM role access to download Haltian IoT Parquet files from S3

ALPHA

This page is in alpha status. The content may change without notice.

When to use this method

Use this guide if you need programmatic or automated access from your own AWS account (e.g., scheduled pipelines, Kubernetes workloads, or custom applications). If you access data interactively using AWS SSO, see AWS SSO Setup and Download Parquet (AWS CLI) instead.

How It Works

Haltian stores data files in a private S3 bucket in Haltian’s AWS account. To let you read those files, you create an IAM role in your AWS account. You then share the role’s ARN (Amazon Resource Name) with Haltian. Haltian adds that ARN to the bucket’s access policy, and from that point your code can assume the role to list and download files.

Prerequisites

Tools:

AWS CLI — install from the AWS CLI install guide and configure it with aws configure or your SSO profile.
Python 3 — install from python.org (only needed for the Python examples).

Python packages — install in a virtual environment:

python -m venv .venv
source .venv/bin/activate        # On Windows: .venv\Scripts\activate
pip install boto3 pandas pyarrow

Values provided by Haltian:

Value	Description
`<BUCKET_NAME>`	The name of the S3 bucket
`<AWS_REGION>`	The AWS region the bucket is in (e.g. `eu-west-1`)
`{organizationId}`	Your organisation UUID (e.g. `00000000-0000-0000-0000-000000000000`)

Your AWS account details:

AWS account ID — a 12-digit number. Find it in the top-right corner of the AWS Console, or run:
```
aws sts get-caller-identity --query Account --output text
```
IAM username — find it under Security credentials in the console, or run:
```
aws sts get-caller-identity --query Arn --output text
```
The username is the last segment of the ARN, e.g. arn:aws:iam::123456789012:user/jane.doe.

File Path Structure

Your files are organised in the bucket under two layouts.

Legacy unversioned layout

Note

This layout is being phased out in favor of the versioned layout below.

parquet/{organizationId}/<query_name>/<year>/<month>/<start>_<query_name>.parquet

Where <start> is a UTC timestamp at hour resolution (YYYY_mm_dd_HH). Example:

parquet/00000000-0000-0000-0000-000000000000/measurementBatteryVoltage/2025/02/2025_02_25_14_measurementBatteryVoltage.parquet

Versioned layout

parquet/v1/{organizationId}/<query_name>/<year>/<month>/<start>_<end>_<query_name>.parquet

Where <start> and <end> are UTC timestamps at minute resolution (YYYY_MM_DD_HHMM). Example:

parquet/v1/00000000-0000-0000-0000-000000000000/measurementBatteryVoltage/2025/02/2025_02_25_1400_2025_02_25_1500_measurementBatteryVoltage.parquet

Segment	Description
`v1`	Schema version
`{organizationId}`	Your organisation UUID, provided by Haltian
`<query_name>`	The type of measurement or metadata (e.g. `measurementBatteryVoltage`)
`<year>`	4-digit UTC year of the export window
`<month>`	2-digit UTC month of the export window
`<start>`	UTC start timestamp of the export window
`<end>`	UTC end timestamp of the export window

Step 1: Create an IAM Policy

Create a policy that grants read-only access to your organisation’s files. This covers both the versioned and legacy layouts.

Replace <BUCKET_NAME> and {organizationId} with the values provided by Haltian.

In the AWS Console:

Go to IAM → Policies → Create policy.
Switch to the JSON tab and paste the policy below.
Click Next, name the policy (e.g. my-org-parquet-reader-policy), and click Create policy.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ListParquetBucket",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": "arn:aws:s3:::<BUCKET_NAME>",
      "Condition": {
        "StringLike": {
          "s3:prefix": [
            "parquet/{organizationId}",
            "parquet/{organizationId}/*",
            "parquet/*/{organizationId}",
            "parquet/*/{organizationId}/*"
          ]
        }
      }
    },
    {
      "Sid": "GetParquetFiles",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::<BUCKET_NAME>/parquet/{organizationId}/*",
        "arn:aws:s3:::<BUCKET_NAME>/parquet/*/{organizationId}/*"
      ]
    }
  ]
}

Terraform

resource "aws_iam_policy" "parquet_reader" {
  name        = "my-org-parquet-reader-policy"
  description = "Grants read-only access to Parquet files in S3"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "ListParquetBucket"
        Effect = "Allow"
        Action = ["s3:ListBucket"]
        Resource = "arn:aws:s3:::<BUCKET_NAME>"
        Condition = {
          StringLike = {
            "s3:prefix" = [
              "parquet/{organizationId}",
              "parquet/{organizationId}/*",
              "parquet/*/{organizationId}",
              "parquet/*/{organizationId}/*",
            ]
          }
        }
      },
      {
        Sid      = "GetParquetFiles"
        Effect   = "Allow"
        Action   = ["s3:GetObject"]
        Resource = [
          "arn:aws:s3:::<BUCKET_NAME>/parquet/{organizationId}/*",
          "arn:aws:s3:::<BUCKET_NAME>/parquet/*/{organizationId}/*",
        ]
      }
    ]
  })
}

Step 2: Create an IAM Role

Create a role, attach the policy from Step 1, and configure a trust policy that controls who can assume the role.

AWS Console

Go to IAM → Roles → Create role.
Under Trusted entity type, select AWS account.
Select This account (or enter another account ID if applicable).
Click Next.
Find and select the policy you created in Step 1. Click Next.
Name the role (e.g. my-org-parquet-reader) and click Create role.
Open the role, click Trust relationships → Edit trust policy, and paste:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:user/jane.doe"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Replace 123456789012 with your AWS account ID and jane.doe with your IAM username.

Security

Avoid using arn:aws:iam::123456789012:root as the principal — it grants every user and role in your AWS account the ability to assume this role. Scope it to a specific IAM user.

Terraform

data "aws_iam_policy_document" "parquet_reader_trust" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]

    principals {
      type        = "AWS"
      identifiers = ["arn:aws:iam::123456789012:user/jane.doe"]
    }
  }
}

resource "aws_iam_role" "parquet_reader" {
  name               = "my-org-parquet-reader"
  assume_role_policy = data.aws_iam_policy_document.parquet_reader_trust.json
}

resource "aws_iam_role_policy_attachment" "parquet_reader" {
  role       = aws_iam_role.parquet_reader.name
  policy_arn = aws_iam_policy.parquet_reader.arn
}

Advanced: Kubernetes Workloads (EKS / IRSA)

Note

This section only applies if you are setting up access for a workload running inside an EKS cluster. For personal or script-based access, skip to Step 3.

IRSA (IAM Roles for Service Accounts) lets a Kubernetes workload assume an IAM role without static credentials via OIDC federation.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/<YOUR_EKS_OIDC_PROVIDER>"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "<YOUR_EKS_OIDC_PROVIDER>:sub": "system:serviceaccount:<NAMESPACE>:<SERVICE_ACCOUNT_NAME>"
        }
      }
    }
  ]
}

Terraform (IRSA)

data "aws_eks_cluster" "this" {
  name = "<YOUR_EKS_CLUSTER_NAME>"
}

locals {
  oidc_provider = replace(data.aws_eks_cluster.this.identity[0].oidc[0].issuer, "https://", "")
}

data "aws_iam_policy_document" "parquet_reader_trust_irsa" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRoleWithWebIdentity"]

    principals {
      type        = "Federated"
      identifiers = ["arn:aws:iam::123456789012:oidc-provider/${local.oidc_provider}"]
    }

    condition {
      test     = "StringEquals"
      variable = "${local.oidc_provider}:sub"
      values   = ["system:serviceaccount:<NAMESPACE>:<SERVICE_ACCOUNT_NAME>"]
    }
  }
}

resource "aws_iam_role" "parquet_reader" {
  name               = "my-org-parquet-reader"
  assume_role_policy = data.aws_iam_policy_document.parquet_reader_trust_irsa.json
}

resource "aws_iam_role_policy_attachment" "parquet_reader" {
  role       = aws_iam_role.parquet_reader.name
  policy_arn = aws_iam_policy.parquet_reader.arn
}

Even though your role has the right permissions on your side, Haltian must also add your role’s ARN to the bucket policy before access works.

Share your role ARN with your Haltian contact:

arn:aws:iam::123456789012:role/my-org-parquet-reader

Finding the ARN in the console:

Go to IAM → Roles and click on the role name.
The ARN is shown at the top of the summary page — click the copy icon.

Terraform output

output "parquet_reader_role_arn" {
  description = "IAM Role ARN to share with Haltian"
  value       = aws_iam_role.parquet_reader.arn
}

Step 4: Access Your Files

Once Haltian confirms that your role ARN has been added to the bucket policy, you can start accessing files.

Using the AWS CLI

# Assume the role and extract credentials into shell variables
read AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN <<< $(aws sts assume-role \
  --role-arn "arn:aws:iam::123456789012:role/my-org-parquet-reader" \
  --role-session-name "parquet-access-session" \
  --query "Credentials.[AccessKeyId,SecretAccessKey,SessionToken]" \
  --output text)

export AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN

# List your files (versioned layout)
aws s3 ls s3://<BUCKET_NAME>/parquet/v1/{organizationId}/

# List your files (legacy unversioned layout)
aws s3 ls s3://<BUCKET_NAME>/parquet/{organizationId}/

# Download a specific file (versioned layout)
aws s3 cp \
  s3://<BUCKET_NAME>/parquet/v1/{organizationId}/measurementBatteryVoltage/2025/02/2025_02_25_1400_2025_02_25_1500_measurementBatteryVoltage.parquet \
  ./

# Download a specific file (legacy unversioned layout)
aws s3 cp \
  s3://<BUCKET_NAME>/parquet/{organizationId}/measurementBatteryVoltage/2025/02/2025_02_25_14_measurementBatteryVoltage.parquet \
  ./

Using Python (boto3) — SSO Profile

Make sure you are logged in first (see AWS SSO Setup):

aws sso login --profile <YOUR_SSO_PROFILE>

import boto3

session = boto3.Session(profile_name="<YOUR_SSO_PROFILE>", region_name="<AWS_REGION>")

sts_client = session.client("sts")
assumed_role = sts_client.assume_role(
    RoleArn="arn:aws:iam::123456789012:role/my-org-parquet-reader",
    RoleSessionName="parquet-access-session",
)
credentials = assumed_role["Credentials"]

s3_client = session.client(
    "s3",
    aws_access_key_id=credentials["AccessKeyId"],
    aws_secret_access_key=credentials["SecretAccessKey"],
    aws_session_token=credentials["SessionToken"],
)

# List files (versioned layout)
response = s3_client.list_objects_v2(
    Bucket="<BUCKET_NAME>",
    Prefix="parquet/v1/{organizationId}/",
)

for obj in response.get("Contents", []):
    print(obj["Key"])

Tip

If the SSO session has expired you will get a SSOTokenLoadError or UnauthorizedSSOTokenError. See AWS SSO Setup — Troubleshooting for details.

Using Python (boto3) — Without SSO

Use this approach when credentials are available from the environment, an EC2/ECS/Lambda instance role, or ~/.aws/credentials.

import boto3

sts_client = boto3.client("sts", region_name="<AWS_REGION>")
assumed_role = sts_client.assume_role(
    RoleArn="arn:aws:iam::123456789012:role/my-org-parquet-reader",
    RoleSessionName="parquet-access-session",
)
credentials = assumed_role["Credentials"]

s3_client = boto3.client(
    "s3",
    aws_access_key_id=credentials["AccessKeyId"],
    aws_secret_access_key=credentials["SecretAccessKey"],
    aws_session_token=credentials["SessionToken"],
    region_name="<AWS_REGION>",
)

response = s3_client.list_objects_v2(
    Bucket="<BUCKET_NAME>",
    Prefix="parquet/v1/{organizationId}/",
)

for obj in response.get("Contents", []):
    print(obj["Key"])

Using pandas / PyArrow

Load a Parquet file directly into a DataFrame:

import boto3
import pandas as pd

# Log in first (see AWS SSO Setup): aws sso login --profile <YOUR_SSO_PROFILE>
session = boto3.Session(profile_name="<YOUR_SSO_PROFILE>", region_name="<AWS_REGION>")

sts_client = session.client("sts")
assumed_role = sts_client.assume_role(
    RoleArn="arn:aws:iam::123456789012:role/my-org-parquet-reader",
    RoleSessionName="parquet-access-session",
)
credentials = assumed_role["Credentials"]

storage_options = {
    "key": credentials["AccessKeyId"],
    "secret": credentials["SecretAccessKey"],
    "token": credentials["SessionToken"],
}

# Versioned layout
s3_path = "s3://<BUCKET_NAME>/parquet/v1/{organizationId}/measurementBatteryVoltage/2025/02/2025_02_25_1400_2025_02_25_1500_measurementBatteryVoltage.parquet"

df = pd.read_parquet(s3_path, storage_options=storage_options)
print(df.head())

Security Best Practices

Use short-lived credentials. Role assumption gives you temporary credentials that expire automatically (typically after one hour). This is safer than long-lived access keys.
Scope the trust policy to a specific user. Restrict the trust policy to a named IAM user rather than :root.
Rotate long-lived access keys regularly. If using static access keys in ~/.aws/credentials, rotate them regularly in IAM → Users → Security credentials.
Enable audit logging. AWS CloudTrail records every API call including role assumptions and file downloads. Enable it under CloudTrail → Create trail.

Accessing Parquet Files via IAM Role

How It Works

Prerequisites

File Path Structure

Legacy unversioned layout

Versioned layout

Step 1: Create an IAM Policy

Terraform

Step 2: Create an IAM Role

AWS Console

Terraform

Advanced: Kubernetes Workloads (EKS / IRSA)

Terraform (IRSA)

Step 3: Share the Role ARN with Haltian

Terraform output

Step 4: Access Your Files

Using the AWS CLI

Using Python (boto3) — SSO Profile

Using Python (boto3) — Without SSO

Using pandas / PyArrow

Security Best Practices