6D Pose Estimation with YOLO + Realsense on Jetson
Objective
In this lesson we move from 3D pose estimation (only position) to 6D pose estimation, meaning:
- Position → X, Y, Z
- Orientation → quaternion (rotation in space)
The goal is to estimate not only where the capsule is, but also how it is oriented in space.
We do this by combining:
- YOLO → object detection (2D bounding box)
- RealSense → depth
- PCL (Point Cloud Library) → plane segmentation
- RANSAC → robust plane fitting
At the end of the lesson you will:
- understand the architecture of the pipeline
- understand how plane segmentation gives orientation
- know how to publish 6D poses in ROS2
- be able to adapt this pipeline to your own application
This is a critical step if you want to go from detection to manipulation.
Code Explanation — What / Why / How
WHAT — System architecture
We now have two ROS2 packages inside the same workspace:
-
inference_running→ YOLO inference -
sixd_pose_pcl→ 6D pose estimation using point cloud
You must clone:
cd ~/ros2_ws/src git clone https://github.com/LearnRoboticsWROS/sixd_pose_pcl.git
Pipeline:
RGB Image ↓ YOLO (inference_running) ↓ 2D Bounding Boxes (/detected_objects/detections_2d) ↓ +Depth Image+Camera Info ↓ sixd_pose_pcl (this node) ↓ Plane segmentation (RANSAC) ↓ 6D Pose (PoseArray) ↓ Visualization (MarkerArray)
Key idea:
We use the 2D bounding box to extract a 3D region, then we estimate a plane inside that region, and the plane normal gives us orientation.
WHY — Why this approach works
The capsule has a flat surface.
If you can:
- extract 3D points from that surface
- fit a plane
- compute the normal
Then:
- centroid → gives position
- normal → gives orientation
This is much more robust than trying to estimate orientation directly from RGB.
This approach is widely used in industrial robotics.
HOW — Node structure
The node:
class SixDPoseEstimationPublisher : public rclcpp::Node
It subscribes to:
/detected_objects/detections_2d (YOLO bounding boxes) /camera/aligned_depth_to_color/image_raw /camera/aligned_depth_to_color/camera_info
It publishes:
/detected_objects/poses_6d (PoseArray) /detected_objects/axes_markers (RViz visualization)
Core Blocks Explained
1. Bounding box → ROI
BBoxXYXY bbox_xyxy; bbox_xyxy.x1 = bx - 0.5 * bw; bbox_xyxy.y1 = by - 0.5 * bh; bbox_xyxy.x2 = bx + 0.5 * bw; bbox_xyxy.y2 = by + 0.5 * bh;
We convert YOLO output into pixel coordinates.
This defines the Region Of Interest (ROI) where we will extract 3D points.
2. Build point cloud from depth
auto cloud = buildCloudFromROI(...)
Inside:
const double Z = 0.001 * dmm;const double X = (u - cx) * Z / fx;const double Y = (v - cy) * Z / fy;
This converts:
pixel + depth → 3D point
So we build a point cloud only inside the bounding box.
Important:
- we filter depth range
- we subsample using
grid_step - we reject invalid depth
This step is equivalent to creating a local point cloud of the object.
3. Adaptive sampling
if (cloud->size() < min_pts_adaptive) { int denser = grid_step / 2; cloud = buildCloudFromROI(...);}
If we don’t have enough points:
→ we sample more densely
Why?
Because RANSAC needs enough points to fit a plane reliably.
4. Plane segmentation (RANSAC)
bool ok = fitPlaneRansac(...)
RANSAC tries to fit a plane:
ax + by + cz + d = 0
Output:
coeff->values[0..3] → plane coefficients
If RANSAC fails:
estimateNormalPCA(...)
Fallback using PCA.
Why PCA?
Because even if RANSAC fails, PCA can estimate the dominant direction.
5. Normal direction
if (n.z() > 0.0) n = -n;
This is critical.
We enforce that the normal points toward the camera.
Why?
Because plane normals have two possible directions:
n or -n
Without this constraint, orientation would flip randomly.
6. Build reference frame
Z = normalX = projection of camera X on planeY = Z × X R.col(0) = x_axis; R.col(1) = y_axis; R.col(2) = z_axis;
This creates a full 3D coordinate frame.
Then:
Eigen::Quaterniond q_new(R);
This gives orientation.
7. Pose creation
pose_raw.position = centroidpose_raw.orientation = quaternion
So now we have a full 6D pose.
8. Stabilization
Same philosophy as previous lesson, but extended to rotation.
pos_alpha_rot_alpha_z_jump_thresh_hold_last_secs_
Position smoothing:
out.position = alpha * new + (1-alpha) * old
Rotation smoothing:
q_out = q_prev.slerp(rot_alpha_, q_meas)
SLERP = smooth interpolation between quaternions.
This is essential to avoid jitter in orientation.
9. Tracking
match_or_create_track_unique(...)
We use IoU to track objects across frames.
Each object gets a stable ID.
Why?
Because stabilization must be applied per object.
10. Publishing PoseArray
pub_pose6d_->publish(pose_arr);
Each detection → one Pose
So:
PoseArray = multiple objects
Each pose contains:
- position (X, Y, Z)
- orientation (quaternion)
11. Visualization markers
appendOrientedAxesMarkers(...)
We publish 3 arrows:
- red → X
- green → Y
- blue → Z (normal)
This is extremely useful to visually verify orientation.
Demo 1 — Full pipeline execution
Step 1 — Run container with PCL support
Use your custom Docker image:
docker run -it --rm \ --runtime=nvidia \ --net=host \ --ipc=host \ -e NVIDIA_DRIVER_CAPABILITIES=all \ -e ROS_DOMAIN_ID=10 \ -v ~/models:/models \ -v ~/runs:/runs \ -v ~/ros2_ws:/workspace/ros2_ws \ learnroboticswros/ros2-ultralytics-trt:pcl-snapshot
This image includes:
ROS2
- Ultralytics
- OpenCV
- PCL
Without PCL, this node cannot run.
Step 2 — Run YOLO node
ros2 run inference_running three_pose_estimation_bbox_publisher --ros-args \ -p output_topic:=/camera/color/pose_estimation_stable \ -p detection_topic:=/detected_objects/detections_2d \ -p model_path:=/models/capsules/best.pt
This publishes:
/detected_objects/detections_2d
Step 3 — Run 6D node
ros2 run sixd_pose_pcl sixd_pose_estimation_publisher --ros-args \ -p detections_topic:=/detected_objects/detections_2d \ -p depth_topic:=/camera/aligned_depth_to_color/image_raw \ -p cam_info_topic:=/camera/aligned_depth_to_color/camera_info \ -p pose6d_topic:=/detected_objects/poses_6d \ -p markers_topic:=/detected_objects/axes_markers
Step 4 — Visualize
Use RViz or:
ros2 topic echo /detected_objects/poses_6d
You will see:
position + quaternion
In RViz you will see axes:
- orientation changes as capsule rotates
- Z axis aligns with plane normal
Key observation:
The normal always points toward the camera → this confirms the orientation is consistent
Demo 2 — Real setup and constraints
Now we move to the real setup.
Key requirements:
1. Camera must be fixed
If the camera moves:
→ pose estimation becomes inconsistent
So:
camera must be rigidly mounted relative to base_link
2. Correct working distance
If you move too close:
- depth becomes invalid
- plane segmentation fails
If too far:
- resolution decreases
So you must find a working range.
3. Lighting matters
Industrial environments introduce:
- reflections
- shadows
- changing illumination
These affect:
- YOLO detection
- depth quality
4. Orientation behavior
When you rotate the capsule:
- the plane normal changes
- the axes rotate accordingly
This is the key validation:
if axes rotate correctly → 6D pose estimation is working
Key Takeaways
You extended your perception pipeline from 3D to 6D.
You learned how to:
- use YOLO for 2D detection
- extract ROI from bounding box
- build a point cloud from depth
- use RANSAC to fit a plane
- extract normal and centroid
- build a full 3D orientation
- stabilize position and rotation
- publish PoseArray in ROS2
- visualize orientation with markers
Most important insight:
6D pose estimation is not magic. It is geometry + filtering + good assumptions about the object.
You also learned how the architecture is modular:
- detection node
- 6D estimation node
- downstream robot node
This separation is fundamental if you want to scale your system.
Now you have a perception pipeline that can be directly connected to:
- inverse kinematics
- motion planning
- pick-and-place execution
In the next step, you can transform these poses into the robot frame and close the loop with manipulation.
0 comments