Autonomous Drone Inventory Counting System
Real-time Warehouse Management with Drone Swarms and Computer Vision
Project Introduction
Large-scale storage warehouses face a critical challenge in the modern e-commerce era: maintaining real-time inventory accuracy. With the explosive growth of e-commerce and services like Amazon FBA (Fulfillment by Amazon) requiring small-scale vendors to outsource storage and logistics, warehouse inventory management has become increasingly complex and demanding.
The Problem: Traditional inventory management relies on human workers—often forklift drivers—to manually count and update inventory as items move in and out of the warehouse. This approach has several critical flaws:
- •Human errors are extremely common, leading to inventory discrepancies
- •High frequency of items moving in/out makes real-time updates nearly impossible
- •Shortage of human resources: forklift drivers are expensive and in short supply
- •Manual counting is slow, creating bottlenecks in warehouse operations
- •Safety concerns: humans working at heights or in busy warehouse environments
The solution: Deploy a swarm of autonomous drones that fly through the warehouse environment, automatically mapping the space, identifying inventory locations, counting items in real-time using computer vision, and updating the inventory management system via API—all without human intervention.
System Overview
This project delivers a complete end-to-end autonomous drone inventory system encompassing robotics, computer vision, deep learning, real-time control systems, and MLOps infrastructure:
Phase 1: Warehouse Mapping
Autonomous navigation and SLAM-based 3D mapping of warehouse environment using ROS2, Nav2, Rtabmap, and Octomap to create detailed spatial understanding.
Phase 2: Feature Identification
3D point cloud segmentation using Open3D and DBSCAN to identify shelving racks, zones, columns, and spatial features. Automated labeling system for inventory locations.
Phase 3: Object Detection & Counting
Live video streaming with custom YOLO models for real-time object detection. Intelligent counting algorithms for stacked cartons, pallets, tools, motors, and drums.
Phase 4: System Integration
Next.js control panel, live location tracking, drone fleet management, API integration for inventory updates, and MLOps/GitOps pipeline for continuous model improvements.
Phase 1: Warehouse Mapping with SLAM
ROS2 and Nav2 Navigation Stack
Modern robotics framework for autonomous navigation
Why ROS2?
ROS2 (Robot Operating System 2) was chosen over ROS1 for several critical advantages in production environments:
Real-time Performance
DDS middleware provides deterministic real-time communication essential for drone control
Multi-Robot Support
Native support for multiple drones (swarm) without complex workarounds
Security & Production-Ready
Built-in security features and enterprise-grade reliability
Nav2 Navigation Stack
Nav2 is the next-generation navigation framework for ROS2, providing sophisticated autonomous navigation capabilities:
- ✓Behavior Trees: Flexible mission planning and execution
- ✓Multiple navigation algorithms: DWB, TEB, Regulated Pure Pursuit controllers
- ✓Recovery behaviors: Automatic handling of stuck situations
- ✓Waypoint following: Sequential navigation through inventory locations
- ✓Dynamic obstacle avoidance: Real-time path replanning
- ✓Costmap layers: Static map, inflation, obstacle, and voxel layers
SLAM with Rtabmap (RGB-D Graph-Based SLAM)
Real-Time Appearance-Based Mapping for 3D environment reconstruction
What is Rtabmap and How Does It Work?
Rtabmap (Real-Time Appearance-Based Mapping) is a RGB-D Graph-Based SLAM algorithm designed for large-scale and long-term online operation. It's particularly well-suited for warehouse environments with repetitive structures.
Core Concept: Graph-Based SLAM
Rtabmap represents the environment as a graph where:
- • Nodes: Keyframes containing RGB images, depth images, and camera poses
- • Edges: Spatial constraints between nodes (odometry, loop closures)
- • Goal: Optimize the graph to find the most consistent map and trajectory
Rtabmap Pipeline:
1. Sensor Input (RGB-D Camera: Intel RealSense) ↓ 2. Feature Extraction (SURF/ORB/SIFT keypoints from RGB) ↓ 3. Odometry Estimation (Visual Odometry + IMU fusion) ↓ 4. Loop Closure Detection (Bag-of-Words for place recognition) ↓ 5. Graph Optimization (g2o/GTSAM backend) ↓ 6. 3D Point Cloud Generation (from depth + optimized poses) ↓ 7. Occupancy Grid / Octomap Output
Key Rtabmap Features Used:
Loop Closure Detection
Recognizes previously visited locations to correct drift and improve map consistency
Memory Management
Transfers old data to long-term memory to maintain real-time performance in large warehouses
Multi-Session Mapping
Can resume mapping from previous sessions, building incrementally over time
RGB-D + Lidar Fusion
Combines RealSense depth and RP Lidar 2D scans for robust 3D mapping
Odometry and Triangulation Methods
Accurate odometry is crucial for drone localization during mapping:
Visual Odometry (VO):
- • Tracks feature points across consecutive RGB-D frames
- • Uses triangulation to estimate 3D coordinates of features
- • Calculates camera motion (rotation + translation) via PnP (Perspective-n-Point)
- • Rtabmap uses F2M (Frame-to-Map) and F2F (Frame-to-Frame) VO
Triangulation Process:
Triangulation estimates 3D point positions from 2D image observations:
- • Detect same feature in two or more camera views (stereo/temporal)
- • Knowing camera poses and intrinsic parameters, project rays from cameras through feature pixels
- • 3D point is where rays intersect (with noise, use least-squares optimization)
- • Depth from RealSense validates and improves triangulation accuracy
Sensor Fusion: The system fuses Visual Odometry, IMU (Inertial Measurement Unit), and flight controller data using an Extended Kalman Filter (EKF) to provide robust pose estimation even when visual features are temporarily lost.
Octomap: 3D Occupancy Mapping
Efficient 3D representation for navigation and obstacle avoidance
What is Octomap and How Does It Work?
Octomap is a probabilistic 3D occupancy mapping framework based on octrees—a hierarchical tree data structure that efficiently represents 3D space by recursively subdividing it into octants (8 child nodes per parent).
Octree Data Structure:
- • Root Node: Represents entire mapped space
- • Subdivision: Each node can be divided into 8 children (octants)
- • Leaf Nodes: Store occupancy probability (free, occupied, unknown)
- • Pruning: Homogeneous regions collapsed to single nodes, saving memory
- • Resolution: Configurable voxel size (e.g., 10cm cubes)
Occupancy Probability Updates:
Octomap uses a Bayesian update scheme to handle sensor uncertainty:
- • Each voxel has a log-odds occupancy value
- • When a sensor (Lidar/Depth camera) observes a voxel as occupied, increase probability
- • When a ray passes through a voxel (free space), decrease probability
- • Multiple observations integrated over time → confident occupancy map
- • Handles dynamic environments by allowing probability updates
Memory Efficiency
Octree compression reduces memory usage by 10-100x compared to dense 3D grids
Fast Queries
O(log n) lookup time for collision checking and path planning
Integration: Rtabmap generates point clouds → Octomap converts them into 3D occupancy grid → Nav2 uses Octomap for collision avoidance and path planning in 3D warehouse space.
Phase 2: 3D Point Cloud Segmentation & Feature Identification
Point Cloud Processing with Open3D
Extracting meaningful structure from 3D warehouse scans
From Point Cloud to Warehouse Structure
After mapping, we have a massive 3D point cloud representing the entire warehouse. The challenge: automatically identify shelving racks, zones, aisles, and inventory locations.
Open3D Processing Pipeline:
- ▸Point cloud downsampling with voxel grid filter (reduce density while preserving structure)
- ▸Statistical outlier removal to clean noise from sensor data
- ▸Normal estimation for each point (surface orientation)
- ▸Plane segmentation using RANSAC to identify floors, walls, shelves
- ▸Clustering to group points into distinct objects/structures
DBSCAN Clustering for Feature Identification
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm perfect for identifying shelving structures in point clouds.
Why DBSCAN for Warehouses?
- • No predefined cluster count: Automatically discovers number of shelves/racks
- • Handles arbitrary shapes: Shelves can be L-shaped, curved, or irregular
- • Noise robust: Ignores scattered points (forklift, pallets, people)
- • Density-based: Groups dense point regions (shelves) while rejecting sparse areas (aisles)
DBSCAN Process:
- Define ε (epsilon): neighborhood radius around each point
- Define MinPts: minimum points to form dense region
- Classify points: Core (≥MinPts neighbors), Border, Noise
- Connect core points within ε distance → forms clusters
- Each cluster = one rack/shelf structure
Result: Each identified cluster represents a distinct warehouse feature (rack, pallet position, forklift area, etc.). Extracted cluster centroids and bounding boxes provide x,y,z coordinates for navigation.
Automated Labeling System
Once features are identified, the system automatically assigns hierarchical labels:
Spatial Hierarchy:
- • Zones: Large areas (Zone 1, Zone 2, Zone 3, etc.)
- • Aisles: Pathways between racks (Aisle A, B, C...)
- • Racks: Individual shelving units (Rack R1, R2, R3...)
- • Columns: Vertical divisions (Column 1, 2, 3...)
- • Shelves: Height levels (Shelf A, B, C, D...)
Location Format:
Example: Zone2-Aisle-B-R5-Col3-ShelfC
This location code uniquely identifies every inventory position in the warehouse, stored with (x, y, z) coordinates for drone navigation.
Phase 3: Live Object Detection & Intelligent Counting
Custom YOLO Object Detection
Real-time inventory item detection with live video streaming
Live Video Streaming Architecture
As drones navigate through the warehouse, they stream live video to a ground station for real-time object detection:
Video Pipeline:
- • Intel RealSense RGB stream (1920x1080 @ 30fps)
- • H.264 video encoding on drone
- • Low-latency streaming (GStreamer/WebRTC)
- • GPU-accelerated decoding on ground station
- • Frame buffer for YOLO inference
Custom YOLO Training:
- • YOLOv5/v8 architecture optimized for warehouse items
- • Training dataset: cartons, pallets, tools, motors, drums, etc.
- • Data augmentation for varying lighting/angles
- • TensorRT optimization for real-time inference
- • Class confidence thresholding (>0.6)
Intelligent Counting Mechanisms
Different inventory items require different counting strategies. The system adapts based on object type:
Scenario 1: Stacked Cartons on Pallets
Most challenging due to occlusion and stacking patterns.
- • YOLO detects visible cartons in current frame
- • Multiple viewing angles: drone circles pallet
- • 3D spatial tracking: associate detections across frames using depth + pose
- • Counting algorithm: track unique carton positions in 3D space
- • Stacking estimation: measure pallet height with depth camera, estimate layers
- • Confidence scoring: multiple observations → higher confidence
Scenario 2: Individual Items (Tools, Motors, Drums)
Simpler counting for discrete, non-stacked items.
- • YOLO directly detects and counts individual items
- • Bounding box filtering to avoid double-counting
- • Non-maximum suppression (NMS) for overlapping detections
- • Simple summation across frames with deduplication
- • Fast and reliable for clearly visible objects
Scenario 3: Partial Occlusion Handling
When items are partially hidden behind others.
- • Instance segmentation (YOLOv8 segmentation) for precise boundaries
- • Depth-based occlusion reasoning: closer items occlude farther ones
- • Multi-view fusion: combine counts from different angles
- • Probabilistic counting: assign confidence to partially visible items
Real-time Inventory Updates via API
As drones count items, the system immediately updates the inventory management system:
- ✓RESTful API integration with warehouse management system (WMS)
- ✓Structured data payload: location_code, item_type, quantity, confidence, timestamp
- ✓Batch updates to minimize API calls (aggregate multiple counts)
- ✓Conflict resolution: if human entered count differs, flag for verification
- ✓Audit trail: all counts logged with drone ID and video frame reference
- ✓Real-time dashboard updates visible to warehouse managers
Phase 4: Control Panel & MLOps Infrastructure
Next.js Drone Control Panel
Real-time fleet management and monitoring dashboard
Control Panel Features
Live Location Tracking
Real-time 3D visualization of all drone positions overlaid on warehouse map. Shows current mission, battery level, and status.
Mission Planning
Drag-and-drop waypoint editor for creating inventory counting routes. Optimize coverage paths automatically.
Fleet Management
Monitor battery levels, flight hours, maintenance schedules. Assign missions to available drones intelligently.
Video Streaming
Live video feed from selected drone with real-time YOLO detection overlays showing bounding boxes and counts.
Inventory Dashboard
Real-time inventory counts updated as drones complete missions. Historical trends and discrepancy alerts.
Alert System
Notifications for low battery, obstacle detection failures, counting discrepancies, or mission completion.
Technology Stack
MLOps & GitOps Pipeline
Continuous improvement of object detection models
Continuous Model Improvement
As drones collect more data, the YOLO models continuously improve through an automated MLOps pipeline:
Data Collection Loop
- • Drones capture images of inventory items during missions
- • Low-confidence detections flagged for human review
- • Warehouse staff label uncertain images via web interface
- • New labeled data added to training dataset automatically
Automated Training Pipeline
- • GitLab CI/CD triggers training when dataset reaches threshold
- • GPU cluster trains new YOLO model version
- • Automated validation on held-out test set
- • If accuracy improves, promote to staging environment
- • A/B testing: compare new model vs. old on live drone fleet
- • If metrics improve, blue-green deployment to production
GitOps Deployment
- • Model artifacts stored in Git LFS (Large File Storage)
- • Kubernetes deployment manifests define model serving config
- • ArgoCD/Flux monitors Git repo for changes
- • Automated rollout to drone fleet with health checks
- • Rollback capability if new model degrades performance
Hardware Components
Intel RealSense Depth Camera
RGB-D camera providing synchronized color and depth streams. Used for visual odometry, 3D mapping, and object detection. D435/D455 models with up to 90 FPS.
RP Lidar 2D Laser Scanner
360° laser rangefinder for obstacle detection and 2D mapping. Provides long-range obstacle detection beyond RealSense depth range. Essential for safe navigation.
High-End Flight Controller
Pixhawk 4 or similar flight controller running PX4/ArduPilot firmware. Handles low-level flight stabilization, receives waypoints from ROS2 Nav2.
Onboard Computer
NVIDIA Jetson Xavier NX or similar edge AI computer. Runs ROS2, Rtabmap, YOLO inference, video streaming. Low power consumption for extended flight time.
Dynamic Obstacle Avoidance
The system handles dynamic warehouse environments with moving forklifts, people, and changing inventory:
- • Costmap layers: Static map (walls/racks) + dynamic obstacles (Lidar/depth) + inflation layer (safety margin)
- • Real-time replanning: Nav2 DWB controller replans path every 100ms if obstacles detected
- • Sensor fusion: Combines Lidar (long-range, 2D) and RealSense (short-range, 3D) for comprehensive awareness
- • Emergency stop: If obstacle too close, hover in place until path clears
Complete Technology Stack
Robotics Framework
SLAM & Mapping
Computer Vision
Point Cloud
Frontend
MLOps/DevOps
Hardware
Languages
Project Impact & Achievements
Key Achievements
This comprehensive autonomous drone inventory system addresses the critical challenge facing modern warehouses: maintaining real-time inventory accuracy in high-velocity e-commerce environments. By combining ROS2 robotics framework, SLAM-based 3D mapping, advanced computer vision with YOLO, intelligent point cloud processing, and MLOps infrastructure, the system delivers fully autonomous inventory counting that eliminates human error, reduces labor costs, and provides instant inventory visibility—revolutionizing warehouse operations for the Amazon FBA era.