Accessing Parquet Files via IAM Role

Set up cross-account IAM role access to download Haltian IoT Parquet files from S3

How It Works

Haltian stores data files in a private S3 bucket in Haltian’s AWS account. To let you read those files, you create an IAM role in your AWS account. You then share the role’s ARN (Amazon Resource Name) with Haltian. Haltian adds that ARN to the bucket’s access policy, and from that point your code can assume the role to list and download files.

Prerequisites

Tools:

  • AWS CLI — install from the AWS CLI install guide and configure it with aws configure or your SSO profile.

  • Python 3 — install from python.org (only needed for the Python examples).

  • Python packages — install in a virtual environment:

    python -m venv .venv
    source .venv/bin/activate        # On Windows: .venv\Scripts\activate
    pip install boto3 pandas pyarrow
    

Values provided by Haltian:

ValueDescription
<BUCKET_NAME>The name of the S3 bucket
<AWS_REGION>The AWS region the bucket is in (e.g. eu-west-1)
{organizationId}Your organisation UUID (e.g. 00000000-0000-0000-0000-000000000000)

Your AWS account details:

  • AWS account ID — a 12-digit number. Find it in the top-right corner of the AWS Console, or run:
    aws sts get-caller-identity --query Account --output text
    
  • IAM username — find it under Security credentials in the console, or run:
    aws sts get-caller-identity --query Arn --output text
    
    The username is the last segment of the ARN, e.g. arn:aws:iam::123456789012:user/jane.doe.

File Path Structure

Your files are organised in the bucket under two layouts.

Legacy unversioned layout

parquet/{organizationId}/<query_name>/<year>/<month>/<start>_<query_name>.parquet

Where <start> is a UTC timestamp at hour resolution (YYYY_mm_dd_HH). Example:

parquet/00000000-0000-0000-0000-000000000000/measurementBatteryVoltage/2025/02/2025_02_25_14_measurementBatteryVoltage.parquet

Versioned layout

parquet/v1/{organizationId}/<query_name>/<year>/<month>/<start>_<end>_<query_name>.parquet

Where <start> and <end> are UTC timestamps at minute resolution (YYYY_MM_DD_HHMM). Example:

parquet/v1/00000000-0000-0000-0000-000000000000/measurementBatteryVoltage/2025/02/2025_02_25_1400_2025_02_25_1500_measurementBatteryVoltage.parquet
SegmentDescription
v1Schema version
{organizationId}Your organisation UUID, provided by Haltian
<query_name>The type of measurement or metadata (e.g. measurementBatteryVoltage)
<year>4-digit UTC year of the export window
<month>2-digit UTC month of the export window
<start>UTC start timestamp of the export window
<end>UTC end timestamp of the export window

Step 1: Create an IAM Policy

Create a policy that grants read-only access to your organisation’s files. This covers both the versioned and legacy layouts.

Replace <BUCKET_NAME> and {organizationId} with the values provided by Haltian.

In the AWS Console:

  1. Go to IAM → Policies → Create policy.
  2. Switch to the JSON tab and paste the policy below.
  3. Click Next, name the policy (e.g. my-org-parquet-reader-policy), and click Create policy.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ListParquetBucket",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": "arn:aws:s3:::<BUCKET_NAME>",
      "Condition": {
        "StringLike": {
          "s3:prefix": [
            "parquet/{organizationId}",
            "parquet/{organizationId}/*",
            "parquet/*/{organizationId}",
            "parquet/*/{organizationId}/*"
          ]
        }
      }
    },
    {
      "Sid": "GetParquetFiles",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::<BUCKET_NAME>/parquet/{organizationId}/*",
        "arn:aws:s3:::<BUCKET_NAME>/parquet/*/{organizationId}/*"
      ]
    }
  ]
}

Terraform

resource "aws_iam_policy" "parquet_reader" {
  name        = "my-org-parquet-reader-policy"
  description = "Grants read-only access to Parquet files in S3"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "ListParquetBucket"
        Effect = "Allow"
        Action = ["s3:ListBucket"]
        Resource = "arn:aws:s3:::<BUCKET_NAME>"
        Condition = {
          StringLike = {
            "s3:prefix" = [
              "parquet/{organizationId}",
              "parquet/{organizationId}/*",
              "parquet/*/{organizationId}",
              "parquet/*/{organizationId}/*",
            ]
          }
        }
      },
      {
        Sid      = "GetParquetFiles"
        Effect   = "Allow"
        Action   = ["s3:GetObject"]
        Resource = [
          "arn:aws:s3:::<BUCKET_NAME>/parquet/{organizationId}/*",
          "arn:aws:s3:::<BUCKET_NAME>/parquet/*/{organizationId}/*",
        ]
      }
    ]
  })
}

Step 2: Create an IAM Role

Create a role, attach the policy from Step 1, and configure a trust policy that controls who can assume the role.

AWS Console

  1. Go to IAM → Roles → Create role.
  2. Under Trusted entity type, select AWS account.
  3. Select This account (or enter another account ID if applicable).
  4. Click Next.
  5. Find and select the policy you created in Step 1. Click Next.
  6. Name the role (e.g. my-org-parquet-reader) and click Create role.
  7. Open the role, click Trust relationships → Edit trust policy, and paste:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:user/jane.doe"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Replace 123456789012 with your AWS account ID and jane.doe with your IAM username.

Terraform

data "aws_iam_policy_document" "parquet_reader_trust" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]

    principals {
      type        = "AWS"
      identifiers = ["arn:aws:iam::123456789012:user/jane.doe"]
    }
  }
}

resource "aws_iam_role" "parquet_reader" {
  name               = "my-org-parquet-reader"
  assume_role_policy = data.aws_iam_policy_document.parquet_reader_trust.json
}

resource "aws_iam_role_policy_attachment" "parquet_reader" {
  role       = aws_iam_role.parquet_reader.name
  policy_arn = aws_iam_policy.parquet_reader.arn
}

Advanced: Kubernetes Workloads (EKS / IRSA)

IRSA (IAM Roles for Service Accounts) lets a Kubernetes workload assume an IAM role without static credentials via OIDC federation.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789012:oidc-provider/<YOUR_EKS_OIDC_PROVIDER>"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "<YOUR_EKS_OIDC_PROVIDER>:sub": "system:serviceaccount:<NAMESPACE>:<SERVICE_ACCOUNT_NAME>"
        }
      }
    }
  ]
}

Terraform (IRSA)

data "aws_eks_cluster" "this" {
  name = "<YOUR_EKS_CLUSTER_NAME>"
}

locals {
  oidc_provider = replace(data.aws_eks_cluster.this.identity[0].oidc[0].issuer, "https://", "")
}

data "aws_iam_policy_document" "parquet_reader_trust_irsa" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRoleWithWebIdentity"]

    principals {
      type        = "Federated"
      identifiers = ["arn:aws:iam::123456789012:oidc-provider/${local.oidc_provider}"]
    }

    condition {
      test     = "StringEquals"
      variable = "${local.oidc_provider}:sub"
      values   = ["system:serviceaccount:<NAMESPACE>:<SERVICE_ACCOUNT_NAME>"]
    }
  }
}

resource "aws_iam_role" "parquet_reader" {
  name               = "my-org-parquet-reader"
  assume_role_policy = data.aws_iam_policy_document.parquet_reader_trust_irsa.json
}

resource "aws_iam_role_policy_attachment" "parquet_reader" {
  role       = aws_iam_role.parquet_reader.name
  policy_arn = aws_iam_policy.parquet_reader.arn
}

Step 3: Share the Role ARN with Haltian

Even though your role has the right permissions on your side, Haltian must also add your role’s ARN to the bucket policy before access works.

Share your role ARN with your Haltian contact:

arn:aws:iam::123456789012:role/my-org-parquet-reader

Finding the ARN in the console:

  1. Go to IAM → Roles and click on the role name.
  2. The ARN is shown at the top of the summary page — click the copy icon.

Terraform output

output "parquet_reader_role_arn" {
  description = "IAM Role ARN to share with Haltian"
  value       = aws_iam_role.parquet_reader.arn
}

Step 4: Access Your Files

Once Haltian confirms that your role ARN has been added to the bucket policy, you can start accessing files.

Using the AWS CLI

# Assume the role and extract credentials into shell variables
read AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN <<< $(aws sts assume-role \
  --role-arn "arn:aws:iam::123456789012:role/my-org-parquet-reader" \
  --role-session-name "parquet-access-session" \
  --query "Credentials.[AccessKeyId,SecretAccessKey,SessionToken]" \
  --output text)

export AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN

# List your files (versioned layout)
aws s3 ls s3://<BUCKET_NAME>/parquet/v1/{organizationId}/

# List your files (legacy unversioned layout)
aws s3 ls s3://<BUCKET_NAME>/parquet/{organizationId}/

# Download a specific file (versioned layout)
aws s3 cp \
  s3://<BUCKET_NAME>/parquet/v1/{organizationId}/measurementBatteryVoltage/2025/02/2025_02_25_1400_2025_02_25_1500_measurementBatteryVoltage.parquet \
  ./

# Download a specific file (legacy unversioned layout)
aws s3 cp \
  s3://<BUCKET_NAME>/parquet/{organizationId}/measurementBatteryVoltage/2025/02/2025_02_25_14_measurementBatteryVoltage.parquet \
  ./

Using Python (boto3) — SSO Profile

Make sure you are logged in first (see AWS SSO Setup):

aws sso login --profile <YOUR_SSO_PROFILE>
import boto3

session = boto3.Session(profile_name="<YOUR_SSO_PROFILE>", region_name="<AWS_REGION>")

sts_client = session.client("sts")
assumed_role = sts_client.assume_role(
    RoleArn="arn:aws:iam::123456789012:role/my-org-parquet-reader",
    RoleSessionName="parquet-access-session",
)
credentials = assumed_role["Credentials"]

s3_client = session.client(
    "s3",
    aws_access_key_id=credentials["AccessKeyId"],
    aws_secret_access_key=credentials["SecretAccessKey"],
    aws_session_token=credentials["SessionToken"],
)

# List files (versioned layout)
response = s3_client.list_objects_v2(
    Bucket="<BUCKET_NAME>",
    Prefix="parquet/v1/{organizationId}/",
)

for obj in response.get("Contents", []):
    print(obj["Key"])

Using Python (boto3) — Without SSO

Use this approach when credentials are available from the environment, an EC2/ECS/Lambda instance role, or ~/.aws/credentials.

import boto3

sts_client = boto3.client("sts", region_name="<AWS_REGION>")
assumed_role = sts_client.assume_role(
    RoleArn="arn:aws:iam::123456789012:role/my-org-parquet-reader",
    RoleSessionName="parquet-access-session",
)
credentials = assumed_role["Credentials"]

s3_client = boto3.client(
    "s3",
    aws_access_key_id=credentials["AccessKeyId"],
    aws_secret_access_key=credentials["SecretAccessKey"],
    aws_session_token=credentials["SessionToken"],
    region_name="<AWS_REGION>",
)

response = s3_client.list_objects_v2(
    Bucket="<BUCKET_NAME>",
    Prefix="parquet/v1/{organizationId}/",
)

for obj in response.get("Contents", []):
    print(obj["Key"])

Using pandas / PyArrow

Load a Parquet file directly into a DataFrame:

import boto3
import pandas as pd

# Log in first (see AWS SSO Setup): aws sso login --profile <YOUR_SSO_PROFILE>
session = boto3.Session(profile_name="<YOUR_SSO_PROFILE>", region_name="<AWS_REGION>")

sts_client = session.client("sts")
assumed_role = sts_client.assume_role(
    RoleArn="arn:aws:iam::123456789012:role/my-org-parquet-reader",
    RoleSessionName="parquet-access-session",
)
credentials = assumed_role["Credentials"]

storage_options = {
    "key": credentials["AccessKeyId"],
    "secret": credentials["SecretAccessKey"],
    "token": credentials["SessionToken"],
}

# Versioned layout
s3_path = "s3://<BUCKET_NAME>/parquet/v1/{organizationId}/measurementBatteryVoltage/2025/02/2025_02_25_1400_2025_02_25_1500_measurementBatteryVoltage.parquet"

df = pd.read_parquet(s3_path, storage_options=storage_options)
print(df.head())

Security Best Practices

  • Use short-lived credentials. Role assumption gives you temporary credentials that expire automatically (typically after one hour). This is safer than long-lived access keys.
  • Scope the trust policy to a specific user. Restrict the trust policy to a named IAM user rather than :root.
  • Rotate long-lived access keys regularly. If using static access keys in ~/.aws/credentials, rotate them regularly in IAM → Users → Security credentials.
  • Enable audit logging. AWS CloudTrail records every API call including role assumptions and file downloads. Enable it under CloudTrail → Create trail.