Downloading Parquet files with AWS CLI

Browse and download Haltian IoT Parquet files from S3 using the AWS CLI

ALPHA

This page is in alpha status. The content may change without notice.

Works on Windows, macOS, and Linux

The AWS CLI works identically on all platforms. Install from the AWS CLI install guide. On Windows, use PowerShell or Command Prompt — the aws commands are the same.

What You Need from Haltian

Before you start, contact Haltian to receive the following credentials. All three values are provided by Haltian — you cannot look them up yourself.

Value	Example	Description
Bucket name	`haltian-parquet-exports-prod`	The S3 bucket where your data is stored
Organisation ID	`29e95a47-c992-497a-b78e-072d70aa67a7`	Your organisation UUID
Access method	Access Key, IAM Role, or SSO	How you authenticate (see below)

Access Methods

Haltian provides one of the following access methods depending on your setup:

Method	What Haltian Provides	Setup
Static Access Key	AWS Access Key ID + Secret Access Key	Configure with `aws configure` (see below)
IAM Role	Bucket policy update for your role ARN	IAM Role Access guide — access from your own AWS account
AWS SSO	SSO start URL + role assignment	AWS SSO Setup guide — browser-based login via Microsoft Entra ID

Option A: Static Access Key

If Haltian provided you with an Access Key ID and Secret Access Key, configure them with:

aws configure --profile haltian-parquet

When prompted, enter:

Prompt	Value
AWS Access Key ID	The access key provided by Haltian
AWS Secret Access Key	The secret key provided by Haltian
Default region name	`eu-west-1`
Default output format	`json`

Then use --profile haltian-parquet in all commands below instead of --profile parquet-access-orgname.

Option B: IAM Role

If you have your own AWS account, you can create an IAM role and share its ARN with Haltian. See the IAM Role Access guide for setup instructions.

Option C: AWS SSO

Complete the AWS SSO Setup first — you need a working SSO profile and an active session before running the commands on this page.

1. Understand the S3 data layout

Legacy unversioned layout

Note

This layout is being phased out in favor of the versioned layout below.

s3://<BUCKET_NAME>/parquet/{organizationId}/<table>/<year>/<month>/<start>_<table>.parquet

Where <start> is a UTC timestamp at hour resolution (YYYY_mm_dd_HH). Example:

s3://<BUCKET_NAME>/parquet/{organizationId}/measurementOccupancyStatus/2026/02/2026_02_24_08_measurementOccupancyStatus.parquet

Versioned layout

s3://<BUCKET_NAME>/parquet/v1/{organizationId}/<table>/<year>/<month>/<start>_<end>_<table>.parquet

Where <start> and <end> are UTC timestamps at minute resolution (YYYY_MM_DD_HHMM). The version segment (v1) allows schema changes to be introduced in a new version while keeping the old one available during a transition period. Example:

s3://<BUCKET_NAME>/parquet/v1/{organizationId}/measurementOccupancyStatus/2026/02/2026_02_24_0800_2026_02_24_0900_measurementOccupancyStatus.parquet

Available tables

Category	Table names
Spatial	`space`, `zone`
Devices	`device`, `deviceGroup`, `deviceGroupDevices`, `deviceKeyword`
Occupancy	`measurementOccupancyStatus`, `measurementOccupantsCount`, `measurementOccupancySeconds`
Movement	`measurementDirectionalMovement`, `measurementPosition`, `measurementPositionZone`
Environmental	`measurementAmbientTemperature`, `measurementCO2`, `measurementTVOC`
Device health	`measurementBatteryPercentage`, `measurementBatteryVoltage`, `measurementBootCount`, `measurementDistance`
Other	`organization`, `deviceNote`

2. Browse available data

Tip

Set these variables once at the start of your session so you can copy-paste the commands below directly:

BUCKET="haltian-parquet-exports-prod"    # Replace with your bucket name
ORG_ID="29e95a47-c992-497a-b78e-072d70aa67a7"  # Replace with your org ID
PROFILE="parquet-access-orgname"         # Your AWS CLI profile name

Step 1: List all tables

This shows the table folders available — not individual files:

aws s3 ls "s3://${BUCKET}/parquet/${ORG_ID}/" \
  --profile ${PROFILE} \
  --region eu-west-1

Expected output — a list of table folders:

                           PRE device/
                           PRE deviceGroup/
                           PRE deviceGroupDevices/
                           PRE deviceKeyword/
                           PRE measurementAmbientTemperature/
                           PRE measurementBatteryPercentage/
                           PRE measurementBatteryVoltage/
                           PRE measurementBootCount/
                           PRE measurementCO2/
                           PRE measurementOccupancySeconds/
                           PRE measurementOccupancyStatus/
                           PRE measurementOccupantsCount/
                           PRE organization/
                           PRE space/
                           PRE zone/

PRE means “prefix” — these are folders, not files.

Step 2: List available months for a specific table

Pick a table from the list above and drill into it. Inside each table you will find year folders, and inside those, month folders:

aws s3 ls "s3://${BUCKET}/parquet/${ORG_ID}/measurementOccupancyStatus/" \
  --profile ${PROFILE} \
  --region eu-west-1

Expected output:

                           PRE 2025/
                           PRE 2026/

Drill into a year to see months:

aws s3 ls "s3://${BUCKET}/parquet/${ORG_ID}/measurementOccupancyStatus/2026/" \
  --profile ${PROFILE} \
  --region eu-west-1

                           PRE 01/
                           PRE 02/

Step 3: List individual files in a month

Inside each month folder you will find the actual Parquet files — typically one file per hour:

aws s3 ls "s3://${BUCKET}/parquet/${ORG_ID}/measurementOccupancyStatus/2026/02/" \
  --profile ${PROFILE} \
  --region eu-west-1

Expected output — one file per hour (default export frequency):

2026-02-01 01:15:00      12458 2026_02_01_00_measurementOccupancyStatus.parquet
2026-02-01 02:15:00      11830 2026_02_01_01_measurementOccupancyStatus.parquet
2026-02-01 03:15:00      13102 2026_02_01_02_measurementOccupancyStatus.parquet
...
2026-02-24 09:15:00      14567 2026_02_24_08_measurementOccupancyStatus.parquet

The file name format is YYYY_mm_dd_HH_<table>.parquet where HH is the hour in UTC.

Summary: Folder hierarchy

s3://<BUCKET>/parquet/{organizationId}/
  └── measurementOccupancyStatus/       ← table
        └── 2026/                        ← year
              └── 02/                    ← month
                    ├── 2026_02_01_00_measurementOccupancyStatus.parquet
                    ├── 2026_02_01_01_measurementOccupancyStatus.parquet
                    ├── ...              ← one file per hour
                    └── 2026_02_24_08_measurementOccupancyStatus.parquet

3. Download data

Download everything (all tables, all dates)

aws s3 sync \
  "s3://${BUCKET}/parquet/${ORG_ID}/" \
  "data/parquet/${ORG_ID}/" \
  --profile ${PROFILE} \
  --region eu-west-1

Note

This downloads all tables and all history. Depending on how long exports have been running, this could be a large amount of data. Consider starting with a single table or month instead.

Download a single table

aws s3 sync \
  "s3://${BUCKET}/parquet/${ORG_ID}/measurementOccupancyStatus/" \
  "data/parquet/${ORG_ID}/measurementOccupancyStatus/" \
  --profile ${PROFILE} \
  --region eu-west-1

Download a specific month

aws s3 sync \
  "s3://${BUCKET}/parquet/${ORG_ID}/measurementOccupancyStatus/2026/02/" \
  "data/parquet/${ORG_ID}/measurementOccupancyStatus/2026/02/" \
  --profile ${PROFILE} \
  --region eu-west-1

Download a single file

Use aws s3 cp with the full path to the file (including year, month, and the exact filename you found in Step 3 above):

aws s3 cp \
  "s3://${BUCKET}/parquet/${ORG_ID}/measurementOccupancyStatus/2026/02/2026_02_24_08_measurementOccupancyStatus.parquet" \
  "data/parquet/${ORG_ID}/measurementOccupancyStatus/2026/02/" \
  --profile ${PROFILE} \
  --region eu-west-1

The full S3 path breaks down as:

s3://haltian-parquet-exports-prod/parquet/29e95a47-c992-497a-b78e-072d70aa67a7/measurementOccupancyStatus/2026/02/2026_02_24_08_measurementOccupancyStatus.parquet
     └── bucket ──────────────────┘        └── org ID ────────────────────────┘ └── table ──────────────┘ └──┘ └┘ └── filename ───────────────────────────────────────────┘
                                                                                                         year  month

Download only new/changed files (incremental sync)

aws s3 sync skips files that already exist locally with matching size, so re-running the same command only fetches new data:

aws s3 sync \
  "s3://${BUCKET}/parquet/${ORG_ID}/" \
  "data/parquet/${ORG_ID}/" \
  --profile ${PROFILE} \
  --region eu-west-1

Download multiple specific tables

for TABLE in space zone device measurementOccupancyStatus; do
  echo "Downloading ${TABLE}..."
  aws s3 sync \
    "s3://${BUCKET}/parquet/${ORG_ID}/${TABLE}/" \
    "data/parquet/${ORG_ID}/${TABLE}/" \
    --profile ${PROFILE} \
    --region eu-west-1
done

Troubleshooting

Issue	Solution
`Access Denied`	Verify your credentials are correct. For SSO: re-run `aws sso login`. For access keys: check key/secret with `aws configure list --profile ${PROFILE}`
`NoSuchBucket`	Double-check the bucket name provided by Haltian
`NoSuchKey`	Verify the full file path — use `aws s3 ls` to browse and find the exact filename first
Empty listing	Your org may not have data for the requested table/period yet. Try listing the root `parquet/${ORG_ID}/` first
SSO session expired	Run `aws sso login --profile ${PROFILE}` or see AWS SSO Setup — Troubleshooting
Slow downloads	Add `--only-show-errors` to suppress per-file output for faster syncs
Need to see what would be downloaded	Add `--dryrun` flag to preview without downloading
Windows: `command not found`	Ensure AWS CLI is in your PATH. Restart PowerShell after installation