Anovate.AiSwap Robotics2025 - Present

Robotic Perception System for Solar Panel Manipulation

Designed and deployed a full perception system for Swap Robotics’ autonomous platform, enabling reliable solar panel picking and placement in harsh outdoor environments.

The system operates under real-world constraints such as strong sunlight reflections, weather variability, and multi-camera geometry inconsistencies, where perception accuracy directly impacts manipulation success.

Core Stack

ROS2ZED StereoC++CUDATensorRTJetson3D Perception
Pick / Place Runtime

System Goals

Robustly detect solar panel structures and clamp endpoints under variable lighting conditions.

Estimate precise 3D positions required for gripper-level manipulation.

Provide stable and precise 3D localization to enable reliable robotic pick-and-place execution.

Run efficiently on Jetson-class edge hardware without sacrificing manipulation reliability.

Key Decisions

Selected Mask-RT-DETR after evaluating multiple instance segmentation models, balancing accuracy, inference latency, and commercial usability constraints through open-source licensing.

Started with a Python pipeline, then re-implemented the runtime entirely in C++ once edge latency constraints made the initial path infeasible for real-time deployment.

Used AWS EC2 (A10/A100) for training while deployment and debugging were performed directly on Jetson via SSH.

Real-World Constraints

Strong sunlight reflections degraded stereo depth on solar panel surfaces.

Point clouds became unstable or incomplete in outdoor failure cases.

The manipulation task required custom clamp-aware geometry rather than generic 3D detection.

System Architecture

The perception system is built as a modular ROS2 pipeline designed for flexibility and real-time execution. It uses six ZED stereo cameras mounted on a dynamic gripper, hardware-synchronized capture for consistent multi-view fusion, and dedicated ROS2 nodes for camera configuration, task switching, perception processing, and custom message handling. The system supports dynamic switching between tasks and camera configurations while the robot is in motion.

3D Placement

2D Perception (Instance Segmentation)

The pipeline begins with instance segmentation to isolate structures such as clamps and panel regions. Multiple instance segmentation models were evaluated before selecting Mask-RT-DETR for its accuracy, inference behavior, and commercially usable open-source license. The model was trained on a custom dataset through roughly eight iterative cycles of collection, augmentation, and fine-tuning. Latency constraints on edge hardware made the initial Python pipeline infeasible for real-time deployment, leading to a full reimplementation in C++.

2D Pick / Place
Snow Pick

Edge Deployment & Optimization

The full system was deployed on an NVIDIA Jetson Orin Nano, which required aggressive optimization. The model was converted to TensorRT, custom CUDA kernels were implemented for preprocessing and postprocessing, and CPU usage was reduced through parallel execution where possible. This resulted in roughly a 15x inference speedup compared to the baseline path. This optimization was critical to achieving the real-time performance required for manipulation tasks. Training ran on AWS EC2 with A10 and A100 GPUs, while deployment and runtime debugging were handled directly on the Jetson over SSH.

3D Perception Pipeline

A custom 3D reasoning pipeline was designed to bridge the gap between noisy stereo depth and precise manipulation requirements. It fuses point clouds from multiple cameras before processing, extracts relevant regions through segmentation masks, filters noise and outliers, and applies classical geometric processing in Open3D to estimate oriented bounding boxes. Unlike generic 3D detection approaches, the pipeline was task-specific, focusing on clamp localization, midpoint estimation between clamps, and alignment constraints required for manipulation.

3D Placement
Snow Place
Placement Variant

Handling Real-World Failures

A major challenge was severe degradation of stereo depth caused by sunlight reflections on solar panels. That produced missing or corrupted depth regions, unstable point clouds, and unreliable localization. To recover robustness, the system used RANSAC-based filtering and reconstruction, then reprojected and rebuilt missing regions while combining geometric constraints with segmentation masks. This significantly improved robustness and reduced failure cases in high-reflection outdoor scenarios.

Multi-Camera Fusion & Synchronization

Accurate 3D perception required strict synchronization across cameras. Without proper synchronization, multi-view fusion introduced geometric inconsistencies that directly degraded manipulation accuracy. The system therefore implemented hardware-level synchronization, fused point clouds from multiple cameras before downstream processing, and maintained spatial consistency for stable 3D estimation.

System Impact

The final system enabled reliable solar panel grasping in outdoor conditions with strong lighting variability. It achieved real-time performance on Jetson Orin Nano through TensorRT and CUDA optimization with roughly a 15x speedup, improved robustness against stereo depth failures through geometric reconstruction techniques, and delivered a production-ready perception pipeline for real robotic manipulation tasks.