Reading part of a massive point cloud file: classification, octrees, and COPC for LAS/LAZ/PTS data
A LiDAR LAS/LAZ scan of a plant or a road corridor can weigh dozens of GB and contain billions of points. Most real tasks — fetching only ground points, only one structure in an area, or a low-resolution preview — don't need the whole file. How LAS/LAZ/PTS organize classification, bounding boxes, and octrees, and how PDAL, Entwine, and COPC exploit them to load only the points you need — cutting system load and speeding up large-scale 3D processing.
A LiDAR LAS/LAZ scan of a factory site or a road corridor can weigh dozens of GB and contain billions of points — each point being an X, Y, Z coordinate along with attributes such as reflectance intensity, return number, and a classification code. Most real tasks — "get only the ground points to build a terrain model", "get only the points within 50m of a utility pole", "load only a low-resolution preview" — don't need to read the whole file. This article breaks down how the LAS, LAZ, and PTS formats organize data by classification, bounding box, and spatial structure (octrees), and how tools like PDAL, Entwine, and COPC exploit them to load only the points you need — reducing system load and speeding up large-scale 3D data processing.
Inside a LAS/LAZ/PTS file: header, point records, and classification
Unlike raster images or tabular data, a point cloud file is a discrete list of points in 3D space — but it still has clear structure that an engine can exploit:
- Header — contains the global bounding box (min/max X, Y, Z), the coordinate reference system (CRS) stored in a Variable Length Record (VLR), the point data record format, and the total point count. Reading just the first few hundred bytes is enough to know the entire spatial extent of the data
- Point record — each point carries coordinates, intensity (laser reflectance), return number (how many times the laser pulse bounced back through tree canopy or off a roof), and GPS time
- Classification — an ASPRS-standard code assigned to each point: 2 = ground, 3-5 = low/medium/high vegetation, 6 = building, 9 = water, 7 = noise... This is the "class" that each point carries as its own label
- PTS — a simpler, text-based format, each line being X Y Z intensity [R G B]. There's no standardized classification field in the file itself, so it's often used as input/output for external classification steps before being loaded into LAS
Filtering by classification — reading only the "type" of point you need
PDAL (Point Data Abstraction Library) lets you build pipelines with filters.range — for example, Classification[2:2] to keep only ground points when building a terrain model, or Classification[6:6] to keep only building points when reconstructing a 3D model of a facility. The catch: if the file is stored sequentially, filtering by classification still forces the engine to read through every point record to check each one — reducing downstream processing load, but not reducing file I/O. To actually reduce I/O, you need an additional spatial index layer.
Octrees and Entwine Point Tile (EPT) — spatial indexing for point clouds
Entwine builds an octree from the source point cloud: 3D space is recursively divided into 8 cells (octants), and each octree node holds a subset of points along with its own bounding box and level of detail (LOD). The Entwine Point Tile (EPT) structure stores these nodes as separate files keyed by (level, x, y, z) — a query by bounding box or resolution only needs to load the nodes that intersect the area of interest, skipping the rest of the octree entirely. This is the hierarchical "layout" point clouds use in place of row groups in tabular data. Potree uses this same octree structure to stream billions of points to a browser, loading only the nodes within the viewport and at the LOD appropriate for the camera distance.
COPC — embedding an octree inside a single LAZ file, read via HTTP range requests
Cloud Optimized Point Cloud (COPC) goes a step further: instead of splitting into thousands of separate EPT files, COPC embeds the entire octree structure inside a single .laz file, as VLRs/EVLRs describing the offset and size of each node. Each octree node is an independently decodable LAZ chunk. When a COPC file sits on S3 or R2, PDAL or lidR reads the header and hierarchy VLR first — just a few dozen KB — then issues HTTP Range requests for exactly the chunks of the nodes it needs, without converting to EPT or downloading the whole file.
Principle: the real cost of a point cloud query isn't the size of the .laz file, but the number of octree nodes (chunks) that actually need to be loaded and decoded to answer the spatial question — the rest of the file is never read from disk or downloaded over the network.
Extracting an "object" — a bounding volume for a specific entity
A point cloud scan of an entire factory might contain billions of points, but a structural engineer may only need the data for one specific tank or truss frame to compare its shape over time (deformation monitoring). Combining all three layers — classification to keep only points belonging to the structure class, a bounding box for that specific object, and an octree to determine which nodes intersect that bounding box — the pipeline only needs to load and decode a tiny fraction of the original file, often under 1% of the total data, to reconstruct exactly the object being monitored.
Applications at KonexForge
In pilots within the IoT & Sensors layer, LiDAR data from drones or ground scanners is one of the largest physical data sources we process — a single periodic scan of a plant can generate dozens of GB per pass. For the structural monitoring system at 4 steel plants, loading only the octree nodes that intersect the structural frame being monitored — instead of reloading the entire point cloud every scan cycle — lets the deformation-comparison pipeline run in minutes instead of hours, and allows scan history to be stored on cheap object storage while still being directly queryable. The filtered point data, after classification and bounding-box filtering, is typically fed onward into the Data Analytics layer to compute deviations and generate alerts.
Limitations — when preprocessing is needed first
Octrees and COPC only pay off once the index structure has already been built — a freshly scanned raw LAS file has no hierarchy yet, so the first pass still requires a one-time index-building step (entwine build, or untwine for COPC), which costs time and compute. Similarly, if classification hasn't been assigned yet (a raw PTS file, or LAS that hasn't been classified), filtering by class won't be available until a classification step runs — typically a ground-filtering algorithm (CSF, PMF) or a machine learning model. This preprocessing cost is one-time, but it needs to be accounted for in the pipeline from the start, not treated as an "optional" step added later.
Conclusion
Just as with tabular data, most of the cost of processing point clouds doesn't come from "needing a bigger machine" — it comes from reading more than necessary. Classification tells you which "type" of point you need, the bounding box tells you which "region" you need, and octrees/COPC tell you exactly which bytes in the file contain those points. Designing a pipeline to exploit these three layers — starting from how data is stored — is the most effective way to keep a large-scale 3D data processing system fast and affordable over time.