How LiDAR works

LiDAR stands for Light Detection And Ranging. The sensor emits very short laser pulses and measures how long each one takes to return after bouncing off a surface. This is called time of flight. The speed of light is known (c ≈ 3×10⁸ m/s), so the round-trip time Δt gives a distance directly:

r = c \cdot Δt / 2

The division by 2 is because the pulse has to travel to the surface and back. Modern automotive LiDARs fire these pulses across a set of beams arranged in vertical layers — typically 16 to 128 of them. A rotating mirror or prism sweeps each beam horizontally through a full 360°. At each moment, the sensor knows its beam's azimuth angle θ (horizontal) and elevation angle φ (vertical). Combined with the measured distance r, this gives a 3D point in Cartesian coordinates:

x = r \cdot cos(φ) \cdot cos(θ) y = r \cdot cos(φ) \cdot sin(θ) z = r \cdot sin(φ)

A sensor spinning at 10 Hz with 32 beam layers and ~1800 horizontal steps per rotation produces roughly 32 × 1800 × 10 = 576,000 points per second, or about 57k per frame. Higher-end sensors with 128 layers reach 2+ million points per second. Each point carries only its coordinates — no color, no surface normal, no connectivity. The result is a raw, unordered scatter of 3D measurements called a point cloud.

One important property: point density falls off with distance. A surface at 5 m returns roughly 16× more points per square meter than the same surface at 20 m, simply because the beams have spread further apart. Distant cones may return only 1–3 points total per scan. This is a key limitation we have to design around.

Step 1 — Voxel grid downsampling

With 50k–200k points per frame, running any geometry algorithm directly on the raw cloud is expensive. Worse, nearby points often measure the same surface patch from slightly different angles — redundant information that slows everything down. The solution is to discretize space into a regular 3D grid and keep only one representative point per occupied cell.

A voxel is a cube of fixed side length l — the 3D analogue of a pixel. Think of it like rounding all coordinates to the nearest grid position. For any point (x, y, z), its voxel index is:

i = ⌊ (x - x min) / l ⌋ j = ⌊ (y - y min) / l ⌋ k = ⌊ (z - z min) / l ⌋

All points that share the same triple (i, j, k) belong to the same voxel. We replace them with their centroid — the coordinate-wise average:

p̄ = (1/n) \cdot Σ m=1..n p m

The centroid is a better representative than just picking one point, because it averages out sensor noise. A LiDAR return isn't perfectly precise — each measurement has a few centimeters of noise. Averaging n points reduces that noise by a factor of √n.

Why l = 0.1 m? A Formula Student cone is 285 mm wide at the base. With 10 cm voxels it spans roughly 2–3 cells across, which is enough for the shape tests later. Going smaller gives a less aggressive reduction with little benefit. Going larger risks merging a cone with its neighboring ground points — the cone disappears into the floor. The animation below shows one full cycle: raw scatter, grid overlay, then each cell collapsing to a single centroid.

⏸ pause

Each voxel cell's raw points (dim) converge to a single centroid (bright). The grid fades between phases. Click to pause.

Step 2 — Ground removal with RANSAC

After downsampling, roughly 60–70% of the remaining points are still ground. The naive approach — discard everything where z < threshold — fails as soon as the road has any slope, the sensor has a few degrees of pitch, or the ground is uneven. We need to fit the ground plane rather than assume it.

RANSAC (Random Sample Consensus) is a general technique for fitting models to data that contains a large fraction of outliers. The core idea is counterintuitive: instead of using all the data, repeatedly try small random subsets and find the one that explains the most data. Here's the full procedure for plane fitting:

Pick 3 points at random from the downsampled cloud.
Fit a plane through them. Two edge vectors from the first point are v₁ = p₂ − p₁ and v₂ = p₃ − p₁. Their cross product gives the plane normal: $n = v₁ \times v₂ = (v₁.y\cdotv₂.z - v₁.z\cdotv₂.y, v₁.z\cdotv₂.x - v₁.x\cdotv₂.z, v₁.x\cdotv₂.y - v₁.y\cdotv₂.x) d = -(n \cdot p₁) \to plane equation: n\cdotp + d = 0$
Count every point within perpendicular distance τ = 0.15 m of this plane as an inlier: $dist(p) = |n \cdot p + d| / ‖n‖$
If this plane has more inliers than the current best, save it.
Repeat for N iterations. Discard all inliers of the best plane.

How many iterations do we need? If the fraction of ground points is ε ≈ 0.7, the probability that a single random draw of 3 points is all-ground is ε³ ≈ 0.34. The probability of never drawing such a triple in N tries is (1 − ε³)^N. We want this failure probability to be small:

P(failure) = (1 - ε³)^N = (1 - 0.34)^N = 0.66^N N = 5 \to P(failure) \approx 13% N = 10 \to P(failure) \approx 1.6% N = 20 \to P(failure) \approx 0.03%

We run 20 iterations. Each takes a handful of microseconds (just a cross product and ~2k distance computations), so the total cost is negligible. After removal the cloud shrinks from ~10k to ~1–2k points, almost entirely vertical objects: cones, barriers, the car itself.

The threshold τ = 0.15 m is generous on purpose. LiDAR returns have 2–5 cm of noise, but the ground also isn't perfectly flat — small bumps, painted lines, and water all add variation. Setting τ too tight leaves ground points that then appear as phantom objects in the next stage.

Step 3 — Euclidean clustering

After ground removal, the remaining points belong to distinct objects. Points that are part of the same physical object are close together; points from different objects are separated by empty space. We exploit this by building a graph where two points are connected if and only if their Euclidean distance is at most d_c = 0.5 m:

‖p i - p j ‖₂ = \sqrt((x i -x j)² + (y i -y j)² + (z i -z j)²) \leq d c

The clusters are then the connected components of this graph. Finding them is a classic BFS (Breadth-First Search): start from any unvisited seed point, find all its neighbors within d_c, add them to the cluster and the queue, then process each neighbor the same way until the queue is empty. Mark the whole cluster as visited and pick a new seed.

The catch is efficiency. For each point we need to find all neighbors within a radius — the naive approach computes the distance to every other point, which is O(n²). With 2k points that's 4 million distance computations per frame. We use a k-d tree instead.

A k-d tree is a binary tree that recursively partitions the point set by alternating splitting axes. At each node, points are split at the median coordinate of the current axis. A radius query then prunes entire subtrees that can't possibly contain neighbors — the search becomes O(log n) per point. The tree is built once per frame in O(n log n) and then queried up to n times, giving total clustering cost O(n log n) instead of O(n²).

After clustering, we filter by size. Clusters with fewer than n_min = 3 points are sensor noise. Clusters with more than n_max = 200 points are large obstacles — barriers, other cars, walls. What remains: typically 5–20 candidates that are the right size to be a cone.

Step 4 — Geometric classification

Each candidate cluster is tested against the known physical dimensions of a Formula Student cone: height H = 325 mm, base diameter D = 285 mm. We first compute the cluster's axis-aligned bounding box — the smallest box with sides parallel to the coordinate axes that contains all the cluster points:

Δx = max(x) - min(x) Δy = max(y) - min(y) Δz = max(z) - min(z)

Three filters run in sequence. If any fails, the cluster is rejected immediately without running the rest:

Height 0.20 ≤ Δz ≤ 0.45 m Lower bound handles partial occlusion; upper bound rules out stacked objects

Width Δx ≤ 0.40 m && Δy ≤ 0.40 m A car is ~1.8 m wide, a barrier ~0.5 m — both fail this immediately

Symmetry |Δx − Δy| / max(Δx, Δy) ≤ 0.4 A cone's cross-section is circular; a wall segment has Δx >> Δy

The symmetry check is the subtlest one. A cone is a rotationally symmetric object, so its footprint from above should be roughly square regardless of which direction the sensor is viewing it from. A thin wall or barrier may have a small footprint in one direction and a large one in the other — the normalized difference catches this. The threshold 0.4 allows for some asymmetry from partial occlusion without being so loose it passes elongated objects.

A fourth optional check: compare the number of points in the upper and lower half of the cluster's height range. Because a cone tapers, the bottom half has more surface area and therefore more LiDAR returns than the top half. If n_lower < n_upper the shape is upside-down or not conical — reject.

The main failure mode is range. Beyond about 15 m, a cone returns only 1–3 points per frame. The height and width measurements become unreliable with so few samples. The symmetry check may pass or fail arbitrarily. At that range you need either a temporal map (accumulate observations across frames), a learning-based detector that was trained on sparse returns, or a higher-resolution sensor.

Why this pipeline works

Each stage does only as much work as necessary and hands off a much smaller problem to the next. Downsampling turns 200k points into 10k. Ground removal turns 10k into 2k. Clustering turns 2k points into ~15 candidate clusters. Classification turns 15 candidates into the handful that are cones. By the time we're doing geometry checks, we're operating on individual bounding boxes — not point clouds.

This matters for latency. The entire pipeline runs in under 10 ms on a laptop CPU, well within the 100 ms budget of a 10 Hz LiDAR. No GPU required. The k-d tree construction is the heaviest single step; the classifier is just arithmetic. The simplicity of the geometric model — three threshold comparisons — is also what makes it robust. There are no learned weights to overfit, no hyperparameters that degrade in new environments, no dependency on lighting or color.

The trade-off is that purely geometric methods have a hard ceiling on what they can detect. This pipeline works reliably up to about 12–15 m. Beyond that, the sparse returns require either denser sensors, multi-frame fusion, or a classifier that can reason about incomplete evidence — the territory of learning-based methods.