Every building in the United Statesβ77 million footprints with business names and details extracted from OpenStreetMap. The source data comes from Protomaps PMTiles, which is derived from OSM's global building dataset.
Binary format for offline GPS β building lookup with full polygon footprints.
PTiles v6 compresses the ~130GB US buildings PMTile from protomaps.com into a single ~1.14GB file (99.1% reduction) while preserving full polygon geometry for all 77M+ buildings.
Key techniques enabling this compression:
- Zstd dictionary compression (level 22): Shared dictionary trained on building data
- Delta coordinate encoding: Zigzag + varint for vertex deltas (2-4 bytes/vertex vs 16 bytes raw)
- Delta OSM ID encoding: Sequential IDs within H3 cells compress to 1-2 bytes each
- H3 spatial clustering: Buildings grouped by geographic cell for better compression locality
- Indexed building types: 20 common types as 1-byte indices instead of strings
Single file containing 77M+ US building footprints with names where available. File size: ~1.14 GB (~15 bytes/building average).
| Metric | Value |
|---|---|
| Total buildings | 77,068,235 |
| H3 cells | 380,425 |
| File size | ~1.14 GB |
| Bytes per building | ~15 |
| Compression | zstd level 22 |
| Dictionary size | 512 KB |
| Coordinate precision | 1.1m (10 microdegrees) |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Header (256 bytes) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Zstd Dictionary (512 KB typical) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Spatial Index (H3 cell β block offset/length) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Data Blocks (zstd compressed, one per H3 cell) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| 0 | 8 | bytes | magic | PTILESF\x00 (F = footprints) |
| 8 | 1 | uint8 | version | 6 |
| 9 | 3 | - | reserved | Padding for alignment |
| 12 | 4 | float | min_lat | Bounding box south |
| 16 | 4 | float | min_lon | Bounding box west |
| 20 | 4 | float | max_lat | Bounding box north |
| 24 | 4 | float | max_lon | Bounding box east |
| 28 | 8 | uint64 | poi_count | Total building count |
| 36 | 4 | uint32 | block_count | Number of H3 cell blocks |
| 40 | 8 | uint64 | dict_offset | Byte offset to dictionary |
| 48 | 4 | uint32 | dict_length | Dictionary size in bytes |
| 52 | 8 | uint64 | index_offset | Byte offset to spatial index |
| 60 | 4 | uint32 | index_length | Index size in bytes |
| 64 | 8 | uint64 | blocks_offset | Byte offset to first data block |
| 72 | 184 | - | reserved | Future use (zeroed) |
Byte order: Little-endian throughout.
H3 resolution 7 cells (~5.16 kmΒ² average). Sorted by H3 cell ID for binary search.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β entry_count (4 bytes, uint32) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Entry 0 β
β h3_cell (8 bytes, uint64) - H3 index as integer β
β block_offset (6 bytes) - Absolute byte offset to block β
β block_length (3 bytes) - Compressed block size β
β poi_count (2 bytes, uint16) - Buildings in this cell β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Entry 1...N (19 bytes each) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Entry size: 19 bytes (8 + 6 + 3 + 2)
Why 6-byte offset: Supports files up to 281 TB (2^48 bytes). 4 bytes would limit to 4 GB.
Why 3-byte length: Max compressed block size 16 MB (2^24 bytes). Typical blocks are 2-50 KB.
def find_block_for_cell(index: list[dict], cell: str) -> dict | None:
cell_int = int(cell, 16)
left, right = 0, len(index) - 1
while left <= right:
mid = (left + right) // 2
entry_int = int(index[mid]["h3_cell"], 16)
if entry_int == cell_int:
return index[mid]
elif entry_int < cell_int:
left = mid + 1
else:
right = mid - 1
return NoneEach block is zstd compressed (level 22) with shared dictionary. Contains all buildings whose centroid falls within the H3 cell.
Decompressed format:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Record 0 β
β record_length (4 bytes, uint32) - Size of record data β
β record_data (variable) - Building record β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Record 1...N β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Buildings within a block are sorted by OSM ID for delta encoding efficiency.
The shared dictionary is trained on a representative sample of building data:
- Sample ~10,000 buildings across diverse geographic regions
- Train with
zstd --trainat 512 KB dictionary size - Dictionary captures common coordinate delta patterns and string prefixes
| Field | Encoding | Description |
|---|---|---|
| osm_id | varint (delta) | Delta from previous OSM ID in block |
| vertex_count | uint8 | Polygon vertex count (max 255) |
| first_lon | int32 | First longitude Γ 100,000 |
| first_lat | int32 | First latitude Γ 100,000 |
| deltas | varint pairs | Zigzag-encoded delta lon/lat |
| flags | uint8 | Bit flags for optional fields |
| btype_idx | uint8 | Building type (see table) |
| [btype_str] | uint8 len + UTF-8 | Only if btype_idx = 255 |
| [name] | uint16 len + UTF-8 | Only if flags & 0x01 |
| [category] | uint8 len + UTF-8 | Only if flags & 0x02 |
| [name_src] | uint8 len + UTF-8 | Only if flags & 0x04 |
| [poi_osm_id] | uint64 | Only if flags & 0x08 |
Each subsequent vertex (after the first) is encoded as a pair of zigzag varints.
Coordinates are stored as microdegrees (degrees Γ 100,000):
- 1 microdegree β 1.1m at equator
- int32 range: Β±21,474Β° (covers entire Earth with room to spare)
- Precision: ~1.1m at equator, ~0.7m at 50Β° latitude
-
Calculate delta from previous vertex (in microdegrees):
delta_lon = current_lon - previous_lon delta_lat = current_lat - previous_latTypical building wall deltas: -5000 to +5000 microdegrees (-55m to +55m)
-
Zigzag encoding converts signed integers to unsigned (small magnitudes β small values):
zigzag(n) = (n << 1) ^ (n >> 31) Examples: 0 β 0 -1 β 1 1 β 2 -2 β 3 2 β 4This maps small negative numbers to small positive numbers, improving varint efficiency.
-
Varint encoding (protobuf-style, 7 bits per byte, MSB = continuation):
while value >= 0x80: emit(0x80 | (value & 0x7F)) value >>= 7 emit(value) Byte costs: 0-127: 1 byte 128-16383: 2 bytes 16384-2097151: 3 bytes
| Delta magnitude | Zigzag value | Varint bytes | Typical usage |
|---|---|---|---|
| 0 | 0 | 1 | Repeated coordinate |
| Β±1 to Β±63 | 1-127 | 1 | Very small walls |
| Β±64 to Β±8191 | 128-16383 | 2 | Most building walls |
| Β±8192+ | 16384+ | 3+ | Large buildings |
Typical building: 5-8 vertices, 2-4 bytes per delta pair = 10-32 bytes for geometry.
def decode_varint(data: bytes, pos: int) -> tuple[int, int]:
"""Decode unsigned varint. Returns (value, bytes_consumed)."""
result = shift = 0
start = pos
while True:
b = data[pos]
result |= (b & 0x7F) << shift
pos += 1
if not (b & 0x80):
break
shift += 7
return result, pos - start
def zigzag_decode(n: int) -> int:
"""Decode zigzag unsigned to signed integer."""
return (n >> 1) ^ -(n & 1)
def decode_coordinates(data: bytes, pos: int, first_lon: int, first_lat: int,
vertex_count: int) -> tuple[list, int]:
"""Decode all coordinates from delta-encoded data."""
coords = [[first_lon / 100000, first_lat / 100000]]
prev_lon, prev_lat = first_lon, first_lat
start_pos = pos
for _ in range(vertex_count - 1):
delta_lon_raw, consumed = decode_varint(data, pos)
pos += consumed
delta_lat_raw, consumed = decode_varint(data, pos)
pos += consumed
delta_lon = zigzag_decode(delta_lon_raw)
delta_lat = zigzag_decode(delta_lat_raw)
prev_lon += delta_lon
prev_lat += delta_lat
coords.append([prev_lon / 100000, prev_lat / 100000])
return coords, pos - start_posBuildings sorted by OSM ID within each block. First building stores full ID as varint, subsequent store delta from previous as varint.
Building 1: OSM ID 130905906 β varint(130905906) = 5 bytes
Building 2: OSM ID 130905912 β varint(6) = 1 byte
Building 3: OSM ID 130905915 β varint(3) = 1 byte
Building 4: OSM ID 130905920 β varint(5) = 1 byte
Typical delta: 1-100 (buildings created in sequence) = 1 byte each.
| Bit | Mask | Field Present | Encoding if present |
|---|---|---|---|
| 0 | 0x01 | name | uint16 length + UTF-8 |
| 1 | 0x02 | category | uint8 length + UTF-8 |
| 2 | 0x04 | name_source | uint8 length + UTF-8 |
| 3 | 0x08 | poi_osm_id | uint64 (8 bytes) |
| 4 | 0x10 | height | uint8 (0.5m steps, 0-127.5m) |
| 5-7 | - | reserved | - |
20 most common OSM building=* values encoded as 1-byte index:
| Index | Type | Index | Type |
|---|---|---|---|
| 0 | yes | 10 | shed |
| 1 | house | 11 | detached |
| 2 | residential | 12 | terrace |
| 3 | commercial | 13 | school |
| 4 | industrial | 14 | church |
| 5 | retail | 15 | hospital |
| 6 | garage | 16 | hotel |
| 7 | apartments | 17 | roof |
| 8 | office | 18 | construction |
| 9 | warehouse | 19 | barn |
| 255 | (variable) | - | uint8 len + UTF-8 follows |
Index 255 indicates a custom string follows (rare building types).
- Convert query lat/lng to H3 cell (resolution 7)
- Binary search index for matching H3 cell
- Fetch block at offset (HTTP range request or file seek)
- Decompress with shared dictionary
- Iterate building records, accumulating OSM ID deltas
- Reconstruct polygon from delta coordinates
- Point-in-polygon test against query point
- Return first containing building (or nearest within 50m)
For hosted files, cache header + dictionary + index on client (~1 MB). Each query requires 1 range request for the data block (~2-50 KB compressed).
GET /US.ptiles
Range: bytes=0-786432 # Header + dict + index (once, cached)
GET /US.ptiles
Range: bytes=12345678-12348000 # Single block per query
import struct
BTYPE_REVERSE = {
0: "yes", 1: "house", 2: "residential", 3: "commercial", 4: "industrial",
5: "retail", 6: "garage", 7: "apartments", 8: "office", 9: "warehouse",
10: "shed", 11: "detached", 12: "terrace", 13: "school", 14: "church",
15: "hospital", 16: "hotel", 17: "roof", 18: "construction", 19: "barn",
}
def decode_building_v6(data: bytes, offset: int, prev_osm_id: int = 0):
"""Decode v6 binary building record. Returns (building_dict, bytes_consumed)."""
pos = offset
# OSM ID (delta varint)
osm_id_delta, consumed = decode_varint(data, pos)
pos += consumed
osm_id = prev_osm_id + osm_id_delta
# Vertex count (1 byte)
vertex_count = data[pos]
pos += 1
# First coordinate (8 bytes: int32 lon, int32 lat)
first_lon, first_lat = struct.unpack_from("<ii", data, pos)
pos += 8
# Delta coordinates (varint zigzag pairs)
coords, consumed = decode_coordinates(data, pos, first_lon, first_lat, vertex_count)
pos += consumed
# Flags (1 byte)
flags = data[pos]
pos += 1
has_name = flags & 0x01
has_category = flags & 0x02
has_name_source = flags & 0x04
has_poi_osm_id = flags & 0x08
has_height = flags & 0x10
# Building type (1 byte index or 255 + variable string)
btype_idx = data[pos]
pos += 1
if btype_idx == 255:
btype_len = data[pos]
pos += 1
btype = data[pos:pos + btype_len].decode("utf-8")
pos += btype_len
else:
btype = BTYPE_REVERSE.get(btype_idx, "yes")
# Calculate centroid
lats = [c[1] for c in coords]
lons = [c[0] for c in coords]
building = {
"osm_id": osm_id,
"geometry": {"type": "Polygon", "coordinates": [coords]},
"centroid_lat": round(sum(lats) / len(lats), 6),
"centroid_lon": round(sum(lons) / len(lons), 6),
"building_type": btype,
}
# Optional fields
if has_name:
name_len = struct.unpack_from("<H", data, pos)[0]
pos += 2
building["name"] = data[pos:pos + name_len].decode("utf-8")
pos += name_len
if has_category:
cat_len = data[pos]
pos += 1
building["category"] = data[pos:pos + cat_len].decode("utf-8")
pos += cat_len
if has_name_source:
src_len = data[pos]
pos += 1
building["name_source"] = data[pos:pos + src_len].decode("utf-8")
pos += src_len
if has_poi_osm_id:
building["poi_osm_id"] = struct.unpack_from("<Q", data, pos)[0]
pos += 8
if has_height:
height_byte = data[pos]
pos += 1
building["height_m"] = height_byte * 0.5
return building, pos - offset| Version | Changes |
|---|---|
| 6 | Delta OSM IDs + zigzag varint coords (this) |
| 5 | Varint coords, full OSM IDs |
| 4 | Binary footprints, fixed-size coords (4 bytes/delta) |
| 3 | JSON minimal format (gzip) |
| 1-2 | POI points only (no polygons) |
| Language | File | Notes |
|---|---|---|
| Python | scripts/read_ptiles_footprints.py |
Reader (lines 241-254) |
| Python | scripts/build_ptiles_footprints.py |
Writer (encoder) |
- h3: Hexagonal spatial indexing (Uber H3 library)
- zstandard: Compression with trained dictionary
- shapely: Point-in-polygon tests (reader only)
| Format | Size | Bytes/building |
|---|---|---|
| Protomaps PMTile (source) | ~130 GB | ~1,700 |
| PTiles v4 (fixed coords) | ~2.1 GB | ~28 |
| PTiles v5 (varint coords) | ~1.5 GB | ~20 |
| PTiles v6 (delta IDs) | ~1.14 GB | ~15 |
