6D Pose Estimation with YOLO + Realsense on Jetson

Objective

In this lesson we move from 3D pose estimation (only position) to 6D pose estimation, meaning:

  • Position → X, Y, Z
  • Orientation → quaternion (rotation in space)

The goal is to estimate not only where the capsule is, but also how it is oriented in space.

We do this by combining:

  • YOLO → object detection (2D bounding box)
  • RealSense → depth
  • PCL (Point Cloud Library) → plane segmentation
  • RANSAC → robust plane fitting

At the end of the lesson you will:

  • understand the architecture of the pipeline
  • understand how plane segmentation gives orientation
  • know how to publish 6D poses in ROS2
  • be able to adapt this pipeline to your own application

This is a critical step if you want to go from detection to manipulation.

Code Explanation — What / Why / How

WHAT — System architecture

We now have two ROS2 packages inside the same workspace:

  • inference_running → YOLO inference
  • sixd_pose_pcl → 6D pose estimation using point cloud

You must clone:

cd ~/ros2_ws/src
git clone https://github.com/LearnRoboticsWROS/sixd_pose_pcl.git

Pipeline:

RGB Image
   ↓
YOLO (inference_running)
   ↓
2D Bounding Boxes (/detected_objects/detections_2d)
   ↓
+Depth Image+Camera Info
   ↓
sixd_pose_pcl (this node)
   ↓
Plane segmentation (RANSAC)
   ↓
6D Pose (PoseArray)   
   ↓
Visualization (MarkerArray)


Key idea:

We use the 2D bounding box to extract a 3D region, then we estimate a plane inside that region, and the plane normal gives us orientation.

WHY — Why this approach works

The capsule has a flat surface.

If you can:

  1. extract 3D points from that surface
  2. fit a plane
  3. compute the normal

Then:

  • centroid → gives position
  • normal → gives orientation

This is much more robust than trying to estimate orientation directly from RGB.

This approach is widely used in industrial robotics.

HOW — Node structure

The node:

class SixDPoseEstimationPublisher : public rclcpp::Node

It subscribes to:

/detected_objects/detections_2d   (YOLO bounding boxes)
/camera/aligned_depth_to_color/image_raw
/camera/aligned_depth_to_color/camera_info

It publishes:

/detected_objects/poses_6d        (PoseArray)
/detected_objects/axes_markers    (RViz visualization)

Core Blocks Explained

1. Bounding box → ROI

BBoxXYXY bbox_xyxy;
bbox_xyxy.x1 = bx - 0.5 * bw;
bbox_xyxy.y1 = by - 0.5 * bh;
bbox_xyxy.x2 = bx + 0.5 * bw;
bbox_xyxy.y2 = by + 0.5 * bh;

We convert YOLO output into pixel coordinates.

This defines the Region Of Interest (ROI) where we will extract 3D points.

2. Build point cloud from depth

auto cloud = buildCloudFromROI(...)

Inside:

const double Z = 0.001 * dmm;const double X = (u - cx) * Z / fx;const double Y = (v - cy) * Z / fy;

This converts:

pixel + depth → 3D point

So we build a point cloud only inside the bounding box.

Important:

  • we filter depth range
  • we subsample using grid_step
  • we reject invalid depth

This step is equivalent to creating a local point cloud of the object.

3. Adaptive sampling

if (cloud->size() < min_pts_adaptive) {    int denser = grid_step / 2;    cloud = buildCloudFromROI(...);}

If we don’t have enough points:

→ we sample more densely

Why?

Because RANSAC needs enough points to fit a plane reliably.

4. Plane segmentation (RANSAC)

bool ok = fitPlaneRansac(...)

RANSAC tries to fit a plane:

ax + by + cz + d = 0

Output:

coeff->values[0..3] → plane coefficients

If RANSAC fails:

estimateNormalPCA(...)

Fallback using PCA.

Why PCA?

Because even if RANSAC fails, PCA can estimate the dominant direction.

5. Normal direction

if (n.z() > 0.0) n = -n;

This is critical.

We enforce that the normal points toward the camera.

Why?

Because plane normals have two possible directions:

n or -n

Without this constraint, orientation would flip randomly.

6. Build reference frame

Z = normalX = projection of camera X on planeY = Z × X

R.col(0) = x_axis;
R.col(1) = y_axis;
R.col(2) = z_axis;

This creates a full 3D coordinate frame.

Then:

Eigen::Quaterniond q_new(R);

This gives orientation.

7. Pose creation

pose_raw.position = centroidpose_raw.orientation = quaternion

So now we have a full 6D pose.

8. Stabilization

Same philosophy as previous lesson, but extended to rotation.

pos_alpha_rot_alpha_z_jump_thresh_hold_last_secs_

Position smoothing:

out.position = alpha * new + (1-alpha) * old

Rotation smoothing:

q_out = q_prev.slerp(rot_alpha_, q_meas)

SLERP = smooth interpolation between quaternions.

This is essential to avoid jitter in orientation.

9. Tracking

match_or_create_track_unique(...)

We use IoU to track objects across frames.

Each object gets a stable ID.

Why?

Because stabilization must be applied per object.

10. Publishing PoseArray

pub_pose6d_->publish(pose_arr);

Each detection → one Pose

So:

PoseArray = multiple objects

Each pose contains:

  • position (X, Y, Z)
  • orientation (quaternion)

11. Visualization markers

appendOrientedAxesMarkers(...)

We publish 3 arrows:

  • red → X
  • green → Y
  • blue → Z (normal)

This is extremely useful to visually verify orientation.

Demo 1 — Full pipeline execution

Step 1 — Run container with PCL support

Use your custom Docker image:

docker run -it --rm \
  --runtime=nvidia \
  --net=host \
  --ipc=host \
  -e NVIDIA_DRIVER_CAPABILITIES=all \
  -e ROS_DOMAIN_ID=10 \
  -v ~/models:/models \
  -v ~/runs:/runs \
  -v ~/ros2_ws:/workspace/ros2_ws \
  learnroboticswros/ros2-ultralytics-trt:pcl-snapshot

This image includes:

ROS2

  • Ultralytics
  • OpenCV
  • PCL

Without PCL, this node cannot run.

Step 2 — Run YOLO node

ros2 run inference_running three_pose_estimation_bbox_publisher --ros-args \
  -p output_topic:=/camera/color/pose_estimation_stable \
  -p detection_topic:=/detected_objects/detections_2d \
  -p model_path:=/models/capsules/best.pt

This publishes:

/detected_objects/detections_2d

Step 3 — Run 6D node

ros2 run sixd_pose_pcl sixd_pose_estimation_publisher --ros-args \
  -p detections_topic:=/detected_objects/detections_2d \
  -p depth_topic:=/camera/aligned_depth_to_color/image_raw \
  -p cam_info_topic:=/camera/aligned_depth_to_color/camera_info \
  -p pose6d_topic:=/detected_objects/poses_6d \
  -p markers_topic:=/detected_objects/axes_markers

Step 4 — Visualize

Use RViz or:

ros2 topic echo /detected_objects/poses_6d

You will see:

position + quaternion


In RViz you will see axes:

  • orientation changes as capsule rotates
  • Z axis aligns with plane normal

Key observation:

The normal always points toward the camera → this confirms the orientation is consistent

Demo 2 — Real setup and constraints

Now we move to the real setup.

Key requirements:

1. Camera must be fixed

If the camera moves:

→ pose estimation becomes inconsistent

So:

camera must be rigidly mounted relative to base_link

2. Correct working distance

If you move too close:

  • depth becomes invalid
  • plane segmentation fails

If too far:

  • resolution decreases

So you must find a working range.

3. Lighting matters

Industrial environments introduce:

  • reflections
  • shadows
  • changing illumination

These affect:

  • YOLO detection
  • depth quality

4. Orientation behavior

When you rotate the capsule:

  • the plane normal changes
  • the axes rotate accordingly

This is the key validation:

if axes rotate correctly → 6D pose estimation is working

Key Takeaways

You extended your perception pipeline from 3D to 6D.

You learned how to:

  • use YOLO for 2D detection
  • extract ROI from bounding box
  • build a point cloud from depth
  • use RANSAC to fit a plane
  • extract normal and centroid
  • build a full 3D orientation
  • stabilize position and rotation
  • publish PoseArray in ROS2
  • visualize orientation with markers

Most important insight:

6D pose estimation is not magic. It is geometry + filtering + good assumptions about the object.

You also learned how the architecture is modular:

  • detection node
  • 6D estimation node
  • downstream robot node

This separation is fundamental if you want to scale your system.

Now you have a perception pipeline that can be directly connected to:

  • inverse kinematics
  • motion planning
  • pick-and-place execution

In the next step, you can transform these poses into the robot frame and close the loop with manipulation.

Complete and Continue  
Discussion

0 comments