Zum Inhalt springen

CVIWM: A Tightly Coupled State Estimation Method for Poultry House Inspection Robots in Structurally Degraded Environments

Prometheus Redaktion

Open AccessArticle CVIWM: A Tightly Coupled State Estimation Method for Poultry House Inspection Robots in Structurally Degraded Environments 1 College of Engineering, South China Agricultural University, Guangzhou 510642, China 2 State Key Laboratory of Swine and Poultry Breeding Industry, South China Agricultural University, Guangzhou 510642, China 3 National Engineering Research Center for Breeding Swine Industry, Guangzhou 510642, China * Author to whom correspondence should be addressed. Animals 2026, 16(12), 1780; https://doi.org/10.3390/ani16121780 (registering DOI) Submission received: 7 May 2026 / Revised: 29 May 2026 / Accepted: 8 June 2026 / Published: 9 June 2026 Simple Summary Caged chicken houses have long, narrow corridors with repetitive features, making it difficult for robots to navigate accurately. In this study, we developed a new positioning method called CVIWM that helps a robot know its exact location in such structurally degraded environments. The method combines data from a camera, a motion sensor, wheel measurements, and small visual tags placed along the corridor. Tested in an 80 m long commercial poultry house, our method achieved positioning errors of less than 3.3 cm. This accuracy allowed the robot to take well-centered cage images, enabling the detection of 95.7% of dead hens and 98.9% of eggs. This low-cost, easy-to-deploy solution offers a practical step toward automated health monitoring and more efficient poultry farming. Abstract Accurate positioning is essential for inspection robots in caged chicken houses, where long straight corridors, sparse textures, and repetitive structures challenge conventional methods. This paper proposes CVIWM (Coupled Visual-Inertial-Wheel Odometry with Markers), a tightly coupled state estimation method that fuses visual, inertial measurement unit (IMU), wheel odometry (WO), and fiducial marker observations within a factor graph optimization framework. Wheel odometry preintegration suppresses IMU horizontal drift and provides absolute scale, while sparse AprilTag markers (10 m spacing) periodically reset accumulated errors. Experiments in an 80 m corridor of a commercial caged chicken house at 0.116 m/s and 0.232 m/s showed that CVIWM achieves average positioning errors of 2.402 cm and 3.253 cm. This high precision ensured reliable image acquisition (image shift 60 m) and narrow (≈1 m width), flanked by mesh cages that create sparse and repetitive textures. The environment also suffers from uneven lighting, ground stains, and a semi-enclosed structure with a manure pit at the corridor end. These factors collectively degrade the performance of conventional positioning methods: visual odometry (VO) suffers from feature-matching ambiguity, inertial navigation suffers from unbounded drift, and light detection and ranging (LiDAR)-based methods face point cloud sparsity. High-precision positioning is not only a prerequisite for autonomous robot navigation but also a critical factor for effective poultry health monitoring. Tasks such as dead hen detection rely heavily on the spatial consistency of captured images. When a robot is equipped with a camera to monitor caged hens, even small positioning errors can lead to image misalignment, causing the target cage to shift out of the effective field of view. This would seriously compromise the accuracy and reliability of subsequent health assessments. Therefore, developing a positioning method that is robust to structurally degraded environments is essential for ensuring that inspection robots can provide consistent and accurate spatial information for poultry monitoring. 1.2. Related Work According to different measurement principles, robot positioning technologies can be classified into absolute positioning, relative positioning, and multi-sensor fusion positioning. In parallel to geometric approaches, learning-based methods have gained traction. Some works use convolutional neural networks (CNNs) to estimate camera pose directly from a single image [ 14, 15]. Others employ recurrent neural networks (RNNs) to model sequential sensor data for end-to-end visual-inertial odometry (VIO) [ 16, 17]. While these methods show promise in rich, textured environments, their performance in texture-less, repetitive, and structurally degraded environments (e.g., long corridors in caged chicken houses) remains a significant challenge [ 18, 19, 20]. These learning-based approaches often require massive, diverse datasets for training and may lack the generalization capability and physical constraints (e.g., scale consistency) that traditional geometric fusion provides. Our method, in contrast, relies on a tightly coupled geometric optimization with explicit physical constraints from wheel odometry and markers, ensuring robustness in such challenging, domain-specific scenarios. Multi-sensor fusion has become a necessary pathway to achieve robust, high-precision positioning. Loosely coupled methods process observations independently [ 21], while tightly coupled methods jointly optimize raw measurements within a single framework [ 22, 23]. Tightly coupled visual-inertial odometry (VIO), such as visual-inertial navigation system (VINS)-Mono [ 24], has demonstrated robust performance in texture-rich environments. To further enhance localization accuracy in challenging scenes, researchers have incorporated additional sensor modalities. For instance, a visual-marker-inertial fusion system using sliding window optimization integrates AprilTag observations to provide absolute pose corrections [ 25], effectively reducing drift in weakly textured environments. However, this approach still relies solely on visual-inertial-marker fusion and lacks direct motion constraints (e.g., wheel odometry), making it susceptible to residual drift between markers, especially in long straight corridors. 1.3. Research Gap and Proposed Method In summary, single positioning technologies face severe challenges in caged chicken house scenarios. Absolute positioning methods are constrained by deployment costs and electromagnetic interference. Environment-based absolute methods (LiDAR) are prone to degradation in long, repetitive corridors. Relative methods (inertial navigation) suffer from unbounded cumulative error. Multi-sensor fusion methods lack effective solutions for structurally degraded environments. To address these problems, this paper proposes a tightly coupled state estimation method that integrates visual, IMU, wheel odometry, and marker observations within a factor graph optimization framework. The method introduces two key constraints: (1) wheel odometry preintegration provides drift-free planar motion constraints, suppressing IMU horizontal drift and providing absolute scale for monocular vision; (2) sparse fiducial markers serve as absolute pose anchors to periodically reset accumulated errors. All observations are tightly fused within a factor graph optimization framework. 2. Materials and Methods 2.1. System Overview This paper proposes a tightly coupled state estimation method named CVIWM (Coupled Visual-Inertial-Wheel Odometry with Markers). The overall framework is illustrated in Figure 2. The system integrates four types of sensor observations: (1) visual feature points for relative motion estimation; (2) fiducial markers (AprilTags) for absolute pose constraints; (3) IMU preintegration for high-frequency motion prediction; (4) wheel odometry preintegration for drift-free planar displacement. All observations are jointly incorporated into a tightly coupled global objective function, and the robot’s global pose is computed in real time through nonlinear optimization. 2.2. Visual Observation Model Visual observations provide two complementary sources of constraints: natural feature tracking for relative motion estimation between consecutive frames (visual odometry), and AprilTag fiducial markers for absolute pose corrections in the global coordinate system. The coordinate relationship in the positioning system is shown in Figure 3. 2.2.1. Camera Projection Model According to the pinhole camera model, a 3D point with world coordinates P w = [ X w , Y w , Z w ] T is transformed to the camera frame as: P c = R w c P w + t w c (1) where P c = [ X c , Y c , Z c ] T represents the coordinates of the point in the camera coordinate system (with Z c being the depth), and [ R w c | t w c ] is the extrinsic parameter matrix that transforms points from the world frame to the camera frame. The transformation from the camera frame to the image plane is then given by: p = π ( P c ) = 1 Z c K P c (2) where π ( · ) is the projection function, p = [ u , v ] T are the pixel coordinates, and K is the intrinsic matrix: K = f x 0 c x 0 f y c y 0 0 1 (3) The parameters f x , f y are the focal lengths, and c x , c y are the principal point coordinates, all of which are obtained through checkerboard calibration. 2.2.2. Visual Odometry from Natural Features For consecutive keyframes i and j with matched natural features, the relative camera motion can be recovered by solving epipolar geometry constraints. Let the relative rotation and translation be R i j c and t i j c . The reprojection relation is: p ^ j = π ( R i j c ( π − 1 ( p i , d i ) ) + t i j c ) (4) where p ^ j is the predicted pixel coordinates in frame j, p i is the measured pixel coordinates in frame i, d i is the inverse depth of the feature in frame i, and π − 1 back-projects a pixel to a normalized ray. By solving the essential or fundamental matrix, R i j c and t i j c are obtained, forming the relative motion measurement z i j V O = ( R i j c , t i j c ) of the visual odometry. 2.2.3. Uncertainty and Reprojection Residual for Visual Reprojection Visual uncertainty originates from pixel localization errors. Assuming independent Gaussian noise in the u and v directions with standard deviation σ p = 0.5 pixel [ 26], the pixel covariance matrix is: Σ p x = σ p 2 I 2 (5) where I 2 denotes the 2 ୍ଠ 2 identity matrix. This pixel error propagates to the 3D point in the camera frame via linearization of the projection model. The covariance of P c is: Σ P c = Z c 2 · J Σ p n J T (6) where J is the Jacobian mapping the 2D error to the 3D point, and Σ p n is the covariance in the normalized plane. The Jacobian of the projection function π with respect to P c is [ 27]: J π = ∂ π ( P c ) ∂ P c = f x / Z c 0 − f x X c / Z c 2 0 f y / Z c − f y Y c / Z c 2 (7) The visual reprojection residual is defined as the difference between the observed and reprojected pixel coordinates: r V O ( z i j V O , X ) = p j − π ( R i j c ( π − 1 ( p i , d i ) ) + t i j c ) (8) where X represents the state vector to be optimized. The covariance matrix of the visual reprojection residual r V O is obtained by error propagation [ 28]: Σ i j V O = Σ p x + J p r o j Σ P c J p r o j T (9) where J p r o j is the Jacobian of the reprojection residual with respect to the 3D point P c (equivalent to J π ). The information matrix used in optimization is the inverse of the covariance: Ω V O = ( Σ i j V O ) − 1 (10) 2.2.4. Fiducial Marker Definition and Detection Each AprilTag defines four corners in its own coordinate frame t a g . Let the tag side length be L, with the origin at the tag center and the X Y -plane coincident with the tag plane. The 3D coordinates of the four corners in t a g are: Q h t a g ∈ ବ୍ଦ L 2 , ବ୍ଦ L 2 , 0 h = 1 4 (11) where h denotes the index of the h-th tag corner. When the camera observes a tag with known pose, the detection algorithm extracts the pixel coordinates p h h = 1 4 of the four corners. Since all corners lie on the Z t a g = 0 plane, the camera-to-tag pose R tag c ∣ t tag c is obtained by solving a perspective-n-point (PnP) problem. We employ the efficient perspective-n-point (EPnP) algorithm, which uses four virtual control points to build a linear system via weighted summation. The optimal rigid transformation is recovered by singular value decomposition (SVD). 2.2.5. Global Camera Pose Computation from Fiducial Markers Given the prior pose of the tag in the world frame R tag w ∣ t tag w (obtained by prior surveying), the global camera pose is computed as: R c w | t c w = R t a g w | t t a g w · R t a g c | t t a g c − 1 (12) Expanding the above, the camera rotation and translation in the world frame are: R c w = R t a g w ( R t a g c ) T t c w = t t a g w − R t a g w ( R t a g c ) T t t a g c (13) where R c w and t c w denote the camera’s rotation and translation (position) in the world frame. 2.2.6. Uncertainty and Reprojection Residual for Fiducial Marker The uncertainty of AprilTag observations also originates from corner localization errors in the image plane, with standard deviation σ t a g = 0.3 pixel. Assuming independent corner localization errors, the joint covariance matrix for all four corners is: Σ T a g = σ t a g 2 I 8 (14) where I 8 denotes the 8 ୍ଠ 8 identity matrix. For an AprilTag detected in keyframe j with known global pose, the reprojection residual is defined as the stacked vector of projection errors over all four corners: r T a g ( z j T a g , X ) = p h j − π ( T c j w T w t a g P h , t a g ) h = 1 4 (15) where z j T a g represents the AprilTag measurement data in keyframe j, p h j are the measured pixel coordinates of the h-th corner in keyframe j, P h , t a g is the homogeneous coordinate of the h-th corner in the tag frame t a g , T w t a g is the known prior pose of the tag in the world frame, T c j w is the camera pose in the world frame at keyframe j (to be optimized). Due to the high detection accuracy of fiducial markers, the information matrix is assigned a higher weight: Ω T a g = ( Σ T a g ) − 1 (16) In the optimization framework, the AprilTag factor serves as a strong prior constraint, effectively correcting cumulative drift from visual odometry and inertial navigation, and providing a globally consistent absolute pose reference. Natural features provide dense relative motion constraints between consecutive frames, while AprilTag markers provide sparse but globally accurate absolute pose corrections. Their complementary characteristics—high-frequency relative constraints versus drift-free absolute anchors—enable robust and accurate state estimation in the optimization framework. 2.3. IMU Preintegration 2.3.1. IMU Measurement Model and Kinematics IMU provides high-frequency measurements of angular velocity and linear acceleration. Let ω ˜ k I M U and a ˜ k I M U denote the angular velocity and linear acceleration of IMU measurements at time step k. The measurement models are [ 29]: ω ˜ k I M U = ω k + b k g + η g a ˜ k I M U = R b k w ( a k − g w ) + b k a + η a (17) where ω k and a k are the true angular velocity and linear acceleration at time step k, b k g and b k a are the gyroscope and accelerometer biases at time step k, η g and η a are white Gaussian noise, g w is the gravity vector in the world frame, and R b k w represents the rotation matrix from the robot body frame to the world frame at time step k. The continuous-time kinematics of the robot are given by [ 30]: R ˙ b t w = R b t w ω t ୍ଠ v ˙ t w = R b t w a t + g w p ˙ t w = v t w (18) where ω t ୍ଠ denotes the skew-symmetric matrix of the angular velocity vector at time t, ω t and a t are the true angular velocity and linear acceleration at time t, R b t w , v t w , and p t w are the rotation, velocity, and position in the world frame at time t, respectively. 2.3.2. IMU Preintegration Definition Direct integration of IMU measurements between keyframes i and j depends on the initial state R b i w , v i w , p i w , requiring re-integration whenever these states change, which leads to high computational cost [ 24]. To avoid this, we adopt the preintegration technique, which expresses relative motion increments in the local frame of keyframe i. Define the IMU preintegrated measurements z i j I M U between keyframes i and j as: Δ R ˜ i j I M U = ∏ k = i j − 1 e x p ( ( ω ˜ k I M U − b k g ) Δ t ) Δ v ˜ i j I M U = Σ k = i j − 1 Δ R ˜ i k I M U ( a ˜ k I M U − b k a ) Δ t Δ p ˜ i j I M U = Σ k = i j − 1 Δ v ˜ i k I M U Δ t + 1 2 Δ R ˜ i k I M U ( a ˜ k I M U − b k a ) Δ t 2 (19) with initial conditions Δ R ˜ i i I M U = I , Δ v ˜ i i I M U = 0 , Δ p ˜ i i I M U = 0 , where Δ R ˜ i j I M U , Δ v ˜ i j I M U , and Δ p ˜ i j I M U are the preintegrated rotation, velocity, and position measurements from keyframe i to j, Δ t is the sampling time interval, Δ R ˜ i k I M U , Δ v ˜ i k I M U , and Δ p ˜ i k I M U are the preintegrated rotation, velocity, and position increments from keyframe i to k, and e x p ( · ) is the exponential mapping from the Lie algebra to the Lie group. In practice, we use the midpoint integration method for better accuracy: Δ R ˜ k + 1 I M U = Δ R ˜ k I M U e x p ( ( ω ˜ k I M U − b k g ) + ( ω ˜ k + 1 I M U − b k + 1 g ) 2 Δ t ) Δ v ˜ k + 1 I M U = Δ v ˜ k I M U + Δ R ˜ k I M U ( a ˜ k I M U − b k a ) + ( a ˜ k + 1 I M U − b k + 1 a ) 2 Δ t Δ p ˜ k + 1 I M U = Δ p ˜ k I M U + Δ v ˜ k I M U Δ t + 1 2 Δ R ˜ k I M U ( a ˜ k I M U − b k a ) + ( a ˜ k + 1 I M U − b k + 1 a ) 2 Δ t 2 (20) 2.3.3. Uncertainty Propagation and Jacobians The uncertainty of the preintegrated measurements is propagated through an error-state Kalman filter. The linearized error-state dynamics are given by: δ z k + 1 I M U = F k δ z k I M U + G k n k (21) where δ z k I M U and n k are the error-state vector and measurement noise vector of IMU at time step k, and F k and G k are the Jacobian matrices derived from the motion model at time step k. The IMU error-state covariance is propagated as follows: P k + 1 I M U = F k P k I M U F k T + G k Q I M U G k T (22) starting from P i I M U = 0 , where Q I M U is the IMU noise covariance matrix, and P k I M U is the IMU error-state covariance matrix at time step k. The final covariance Σ i j I M U is the 9 ୍ଠ 9 submatrix of P j I M U corresponding to rotation error δ θ , velocity error δ v , and position error δ p . The IMU information matrix used in optimization is the inverse of the covariance: Ω I M U = ( Σ i j I M U ) − 1 (23) The Jacobians of the preintegrated measurements with respect to bias changes are approximated by first-order expansion: Δ R ˜ i j I M U ( b i g ) ≈ Δ R ˜ i j I M U exp ( ∂ Δ R ˜ i j I M U ∂ b g δ b i g ) Δ v ˜ i j I M U ( b i g , b i a ) ≈ Δ v ˜ i j I M U + ∂ Δ v ˜ i j I M U ∂ b g δ b i g + ∂ Δ v ˜ i j I M U ∂ b a δ b i a Δ p ˜ i j I M U ( b i g , b i a ) ≈ Δ p ˜ i j I M U + ∂ Δ p ˜ i j I M U ∂ b g δ b i g + ∂ Δ p ˜ i j I M U ∂ b a δ b i a (24) where b i g and b i a are the gyroscope and accelerometer biases at keyframes i; δ b i g and δ b i a are their corresponding errors. These Jacobians are computed recursively during preintegration and allow efficient bias updates without re-integration. 2.3.4. IMU Preintegration Residual The IMU preintegration residual is then defined as: r I M U ( z i j I M U , X ) = L o g ( ( Δ R ˜ i j I M U ( b g ) ) T ( R i w ) T R j w ) ( R i w ) T ( v j w − v i w − g w Δ t i j ) − Δ v ˜ i j I M U ( b g , b a ) ( R i w ) T ( p j w − p i w − v i w Δ t i j − 1 2 g w Δ t i j 2 ) − Δ p ˜ i j I M U ( b g , b a ) (25) where l o g ( · ) maps a rotation matrix to its corresponding rotation vector. The IMU preintegration technique decouples the relative motion from the absolute initial state, allowing efficient bias updates and avoiding repeated integration. The preintegrated measurements R ˜ b i w , v ˜ i w , p ˜ i w , along with their covariance Σ i j I M U and Jacobians with respect to biases, provide a compact and efficient representation for IMU constraints in the factor graph optimization framework. 2.4. Wheel Odometry Preintegration 2.4.1. Wheel Odometry Measurement Model Wheel odometry provides incremental displacement measurements in the robot body frame. Let Δ x ˜ k b , Δ y ˜ k b , Δ θ ˜ k b denote the raw wheel odometry measurements at time step k. The measurement model is: Δ x ˜ k b = Δ x k b + η x Δ y ˜ k b = 0 + η y Δ θ ˜ k b = Δ θ k b + η θ (26) where Δ x k b and Δ θ k b are the true displacement and heading change at time step k, and η x , η y , η θ are zero-mean Gaussian noise. The assumption Δ y ˜ k b ≈ 0 holds for planar motion on flat ground. 2.4.2. Wheel Odometry Preintegration Definition To avoid re-integration due to state changes and the associated computational cost, similar to IMU preintegration, we define the wheel odometry preintegrated measurements between keyframes i and j as: Δ x ˜ i j b = ∑ k = i j − 1 Δ x ˜ k b Δ y ˜ i j b = ∑ k = i j − 1 Δ y ˜ k b Δ θ ˜ i j b = ∑ k = i j − 1 Δ θ ˜ k b (27) where Δ x ˜ i j b and Δ y ˜ i j b denote the preintegrated forward and lateral displacements, respectively, and Δ θ ˜ i j b denotes the preintegrated heading change, all expressed in the robot body frame b . For notational convenience, we group them into a vector: Δ p ˜ i j b = Δ x ˜ i j b Δ y ˜ i j b Δ θ ˜ i j b (28) where Δ p ˜ i j b is the preintegrated relative position measurement from keyframe i to j in the robot body frame. 2.4.3. Uncertainty Propagation for Wheel Odometry Preintegration The uncertainty of the wheel odometry preintegrated measurements is propagated similarly to IMU. Define the wheel odometry error-state vector at time step k: δ z k W O = δ Δ x k b δ Δ y k b δ Δ θ k b T (29) where δ Δ x k b , δ Δ y k b , and δ Δ θ k b are the errors of the raw wheel odometry measurements at time step k. The wheel odometry error-state propagation is given by: δ z k + 1 W O = F k W O δ z k W O + G k W O n k W O (30) where F k W O and G k W O are the state transition and noise input Jacobian matrices at time step k, and n k W O is the measurement noise vector at time step k. The covariance matrix Σ i j W O is propagated by iteratively applying: P k + 1 WO = F k WO P k WO ( F k WO ) T + G k WO Q WO ( G k WO ) T (31) starting from P i W O = 0 (the covariance matrix of the wheel odometer error state at keyframes i), where Q W O is the noise covariance matrix of the wheel odometry. The wheel odometry information matrix used in optimization is the inverse of the covariance: Ω W O = ( Σ i j W O ) − 1 (32) 2.4.4. Wheel Odometry Preintegration Residual The wheel odometry preintegration residual is then defined as: r W O ( z i j W O , X ) = Δ x ˜ i j b − ( cos ( θ i ) ( x j w − x i w ) + sin ( θ i ) ( y j w − y i w ) ) Δ y ˜ i j b − ( − sin ( θ i ) ( x j w − x i w ) + cos ( θ i ) ( y j w − y i w ) ) Δ θ ˜ i j b − ( θ j w − θ i w ) (33) where ( x i w , y i w , θ i w ) and ( x j w , y j w , θ j w ) are the robot poses in the world frame at keyframes i and j, with c o s θ i and s i n θ i being the elements of the rotation matrix that transforms the global displacement into the body frame. This residual constraint ensures the consistency of the robot’s planar motion. By leveraging the drift-free nature of the wheeled odometer, it effectively suppresses horizontal integration drift in IMU integration and provides an absolute metric scale for the monocular vision system. 2.5. Tightly Coupled Nonlinear Optimization The estimated state vector X includes the position, orientation, velocity, and IMU biases for a series of keyframes. The tightly coupled global objective function is: J ( X ) = ∑ V O ρ ( r V O T Ω V O r V O ) + ∑ T a g ρ ( r T a g T Ω T a g r T a g ) + ∑ I M U ρ ( r I M U T Ω I M U r I M U ) + ∑ W O ρ ( r W O T Ω W O r W O ) (34) where ρ ( · ) is the Huber robust kernel applied to the squared Mahalanobis distance e = r T Ω r , defined as: ρ ( e ) = 1 2 e 2 , e ≤ δ δ ( e − 1 2 δ ) , e > δ (35) and δ is a tuning parameter (set to 1.0 in our experiments). The optimal state X ∗ is obtained by minimizing J ( X ) using the Gauss–Newton method. All observations are tightly fused within a factor graph optimization framework, as illustrated in Figure 4. In this factor graph, circular nodes denote the state variables to be estimated, and square factors represent the various observational constraints. Edges connect state nodes to observed factors, with their strengths determined by the corresponding uncertainty models. Through this optimization framework, the system outputs high-precision robot poses in real time. The proposed framework achieves real-time performance with the following computational complexity: sensor preintegration O ( N ) , where N is the number of sensor measurements between keyframes; visual processing O ( N p i x + N f e a t l o g N f e a t ) , where N p i x is the number of image pixels and N f e a t is the number of Oriented FAST and Rotated BRIEF (ORB) features per frame; and factor graph optimization O ( S 3 ) with bounded S (constant per iteration), where S is the dimension of the sliding window state vector. 2.6. Experimental Setup 2.6.1. Experimental Site and Equipment Experiments were conducted in a commercial caged chicken house in Raoping County, Guangdong Province, China. The house features a typical stacked cage structure containing multiple corridors, each approximately 80 m long and 1 m wide, with metal cages on both sides and a concrete floor. A manure pit is located at the end of each corridor. The experimental platform was built on a Guoxing Intelligent Safari-600T mini tracked chassis. It was equipped with two cameras (C925E, Logitech, Newark, CA, USA), an IMU (A100, Glonavin, Changsha, China), two wheel encoders (BS38H, Bode, Wuxi, China), a microcontroller (STM32F103ZET6, Alientek, Guangzhou, China), and two infrared distance sensors (WT53R, Wit Motion, Shenzhen, China). The upper-level algorithms ran on an HP RMN TPN-C122 computer (Intel Core i7-5500U, 4 GB RAM; HP, Palo Alto, CA, USA). The camera was processed at 10 Hz to balance computational load and real-time performance, which is justified by the robot’s operating speed. The camera was pre-calibrated and time-synchronized with the IMU and wheel encoder at the system level. The sensor configuration of the experimental platform is illustrated in Figure 5. 2.6.2. Data Acquisition According to the operational requirements of poultry monitoring systems, the optimal inspection speed is 0.1–0.2 m/s [ 31, 32]. Therefore, two speeds were selected: 0.116 m/s and 0.232 m/s, covering the typical speed range. The robot traveled from the corridor entrance to the end and back (one round trip). Each speed was repeated 3 times to eliminate random errors. To provide global pose constraints, AprilTags (Tag36h11, 10 cm side length) were attached to the feeder surface at 10 m intervals along the corridor (8 tags in total). The depth axis of each tag corresponds to the corridor width direction, and its lateral axis corresponds to the corridor length direction. The tag plane was kept perpendicular to the ground to ensure detection stability. The ground truth trajectory was obtained by fusing two complementary measurements. Two high-precision infrared distance sensors mounted on the left and right sides of the robot measured the distances to the feeder wall, from which the robot’s lateral position (x-coordinate) was computed via triangulation. The longitudinal position (y-coordinate) was obtained through the detection of AprilTag: when the robot passed by on its side, these tags provided an absolute reference. Fusing these two sources yielded the full planar pose of the robot as the ground truth. Data were collected using ROS Noetic (Ubuntu 20.04) and processed with Python 3.8.12. The data acquisition process is illustrated in Figure 6. 2.6.3. Evaluation Metrics The following metrics were used to evaluate the performance of the proposed CVIWM method. The lateral deviation is defined as the perpendicular distance from the robot to the corridor centerline, which directly reflects the robot’s centering accuracy and is critical for avoiding collisions with cages in the narrow corridor. The longitudinal deviation is defined as the distance error along the corridor direction, which reflects the absolute positioning accuracy and is essential for ensuring the robot stops precisely at the designated cages for reliable image acquisition. The root mean square error (RMSE) of the lateral and longitudinal deviations is also computed to provide a comprehensive measure of positioning accuracy. 3. Results 3.1. Fiducial Marker Positioning Accuracy To verify the reliability of fiducial markers as absolute pose constraints, positioning accuracy was evaluated in the depth direction (corridor width direction, distance 0.2–1.2 m). Results are shown in Figure 7. The average positioning error was 0.257 cm, with a standard deviation of 0.179 cm and RMSE of 0.313 cm. The maximum effective recognition distance reached 6.6 m. Positioning accuracy in the lateral direction (corridor length direction) was comparable, with an average error < 0.31 cm. These results demonstrate that fiducial markers provide sub-centimeter-level, high-confidence absolute pose constraints for the CVIWM method. 3.2. State Estimation Performance in Structurally Degraded Environments 3.2.1. Qualitative Trajectory Analysis The trajectory of the inertial navigation method exhibits clear error accumulation characteristics. This phenomenon primarily stems from the inherent limitations of inertial navigation systems, where heading angle errors accumulate over time. In long straight corridors, even a small heading angle deviation is progressively amplified by integration, ultimately causing severe lateral and longitudinal deviation from the true path. Since the test path involved a round trip, the forward and backward runs had opposite error directions. When switching direction at the end of the corridor, some lateral errors canceled each other out; therefore, the error growth during the round trip was mitigated compared to that of a one-way run. The trajectory of the VINS-Mono method shows severe drift and non-physical jumps (e.g., the longitudinal coordinate jumps from ∼40 m to ∼−30 m). This is attributed to the sparse and repetitive textures on both sides of the caged chicken house corridor, which cause ambiguity in visual feature matching. As a result, the system cannot obtain reliable visual motion estimates and degenerates into pure inertial integration, causing a sharp increase in drift. By incorporating wheel odometry, the VINS-Mono-WO method improves drift to some extent, but its trajectory still deviates considerably from the true path. Moreover, the positioning accuracy of VINS-Mono-WO is even lower than that of the simpler inertial navigation system. This finding indicates that in environments with degraded structural features, poorly qualified visual measurements, if not effectively screened and corrected, introduce errors that far outweigh the benefits they provide. Both of these methods produce trajectories that have completely deviated from the true path, demonstrating that in structurally degraded environments such as caged chicken houses, feature-matching-based localization methods are highly prone to failure when the environment lacks rich and stable features. By contrast, CVIWM overcomes the above limitations by incorporating the drift-free planar constraints of wheel odometry and absolute pose corrections from fiducial markers. Its estimated trajectory closely follows the actual robot path, with both lateral and longitudinal deviations remaining within small bounds, validating the effectiveness and robustness of the method in structurally degraded environments. 3.2.2. Quantitative Error Analysis Quantitative results are presented in Table 1 (0.116 m/s) and Table 2 (0.232 m/s). As speed increased, positioning errors of all methods increased, with CVIWM showing the smallest degradation. All values are averaged over three repeated trials. For the inertial navigation method, the maximum mean lateral and longitudinal deviations reached 155.214 cm and 178.798 cm, respectively (at 0.232 m/s). Such substantial errors far exceed the accuracy requirements for caged chicken house inspection. The VINS-Mono method exhibited even larger positioning errors, with maximum mean lateral and longitudinal deviations of 253.939 cm and 3782.568 cm (at 0.232 m/s), accompanied by significantly high standard deviations, indicating severe fluctuations and a lack of consistency. This effectively renders it unusable in this environment. Notably, the longitudinal deviation far exceeded the lateral deviation, further confirming the trajectory analysis conclusion: in environments with sparse and repetitive textures, visual feature matching fails, causing the system to degrade into pure inertial integration, where longitudinal errors grow rapidly. The VINS-Mono-WO method showed reduced errors (maximum mean lateral deviation of 195.106 cm and maximum mean longitudinal deviation of 1991.345 cm at 0.232 m/s) compared to VINS-Mono, but its positioning accuracy remained far inferior to that of the much simpler inertial navigation system, falling well short of practical requirements. This result further corroborates the earlier conclusion: in structurally degraded environments, poor-quality visual observations introduce errors that outweigh any corrective benefits they bring to the fusion system. The proposed CVIWM method significantly outperformed the comparison methods in terms of positioning accuracy. At 0.116 m/s, the mean lateral and longitudinal deviations were 1.863 cm and 1.516 cm, with standard deviations of 1.403 cm and 1.406 cm, respectively. At 0.232 m/s, the mean lateral and longitudinal deviations were 2.893 cm and 1.487 cm, with standard deviations of 1.814 cm and 1.506 cm, respectively. Based on these mean deviations, the overall mean positioning error at 0.116 m/s and 0.232 m/s was approximately 2.402 cm and 3.253 cm, respectively, calculated as the root mean square of the mean lateral and mean longitudinal deviations. The positioning deviations recorded each time the robot passed a sampling point are shown in Figure 10 and Figure 11. As the traveling speed increases, the lateral deviation shows a marked increase, while the longitudinal deviation remains relatively stable. This is because heading errors accumulate rapidly at higher speeds, exacerbating lateral drift. In contrast, the longitudinal deviation is primarily governed by the wheel odometry scale factor, resulting in relatively linear error growth, and periodic corrections from the fiducial markers effectively suppress longitudinal drift. In terms of RMSE, CVIWM achieves lateral/longitudinal values of 2.332/2.068 cm at 0.116 m/s and 3.414/2.116 cm at 0.232 m/s, demonstrating centimeter-level accuracy and excellent stability. The lateral RMSE increases by 46% (from 2.332 cm to 3.414 cm) when the speed doubles, which is expected as higher speeds amplify residual heading errors. Importantly, even at the higher speed of 0.232 m/s (the upper limit of the practical operating range for poultry inspection [ 31, 32]), the positioning error remains within the centimeter level. Given a typical cage width of 60–70 cm and the camera’s horizontal field of view, this accuracy ensures that the target cage stays within the central field of view, enabling reliable health monitoring. Regarding computational efficiency, the proposed CVIWM system achieved an average processing latency of 88 ms per frame (≈11 Hz), which exceeds the 10 Hz camera input rate, confirming its real-time capability for poultry house inspection tasks. 3.3. Impact of Positioning Accuracy on Poultry Health Monitoring 3.3.1. Effect on Image Acquisition Based on the camera’s parameters (70.5 cm field of view at 50 cm distance, 1280 ୍ଠ 720 resolution), the relationship between lateral deviation and pixel shift is approximately linear: a 3 cm deviation results in a 55-pixel shift, a 5 cm deviation results in a 91-pixel shift, and a 10 cm deviation results in a 182-pixel shift. When the lateral deviation exceeds 10 cm, the corresponding pixel shift surpasses 180 pixels, which is likely to cause the target cage to deviate significantly from the image center, thereby compromising the accuracy of subsequent health detection algorithms. 3.3.2. Actual Detection Performance of CVIWM After the robot stopped at the designated cages guided by the CVIWM-estimated poses, images were captured for health monitoring. Using CVIWM, the average stopping position errors were 2.8 cm (0.116 m/s) and 3.5 cm (0.232 m/s), with a maximum error of <4.5 cm. This high precision corresponds to an image shift of <83 pixels, ensuring the target cage remains well-centered. 4. Discussion 4.1. Comparison with Existing Methods Compared with existing methods, CVIWM demonstrates clear advantages in both accuracy and practicality ( Table 3). For instance, while RFID [ 37] and UWB [ 38] offer accuracies of 6 cm and 17 cm, respectively, they suffer from short reading ranges or electromagnetic interference [ 39]. Vision-based methods, such as Zhang et al.’s [ 7] approach, rely on densely distributed markers (1–5 m spacing) to achieve accuracies of 5.6–7.6 cm. The visual-marker-inertial fusion system using sliding window optimization integrates AprilTag observations to provide absolute pose corrections [ 25], effectively reducing drift in weakly textured environments. However, without wheel odometry constraints, it remains susceptible to residual drift between markers, particularly in long straight corridors. The LIO-SAM+AprilTag method [ 6] can achieve relatively high accuracy (4.1–7.0 cm). However, its performance largely depends on the geometric features extracted from the LiDAR point cloud. In geometrically degraded environments, such as the repetitive corridors of a caged chicken house, relying solely on LiDAR and IMU still makes the system prone to drift. Even with the assistance of AprilTags, point cloud degeneracy can lead to significant lateral drift. Furthermore, the high cost of LiDAR sensors limits its feasibility for large-scale deployment. In contrast, CVIWM integrates vision, IMU, wheel odometry, and AprilTags into a factor graph optimization framework. By tightly coupling these complementary sensors, it uses wheel odometry to suppress IMU drift and vision-based markers to provide absolute corrections, achieving high accuracy (2.4–3.3 cm). Moreover, it requires only sparse AprilTags (10 m spacing) and low-cost vision sensors. Even in weak-texture regions between tags (e.g., chicken coop grids), wheel odometry maintains short-term precision, offering a practical, cost-effective solution for large poultry houses. 4.2. Method Advantages and Innovations CVIWM has three main advantages: (1) wheel odometry preintegration provides drift-free planar constraints, suppressing IMU horizontal drift and providing absolute scale; (2) fiducial markers provide periodic absolute pose corrections, resetting accumulated errors; (3) structural degradation adaptability enables stable positioning in long straight corridors with sparse textures. 4.3. Engineering Value for Smart Farming The engineering value of CVIWM is threefold. First, by ensuring target cages remain in the central region of captured images (image shift < 83 pixels), it directly supports downstream poultry health monitoring tasks, achieving high detection accuracy (as detailed in Section 3.3.2). Second, it offers significant economic and operational advantages, including low-cost visual sensors (instead of LiDAR) and easy deployment with sparse markers (10 m spacing) and no pre-built map requirement. Third, it demonstrates high reliability, with no significant drift over 80 m of continuous operation. These combined attributes make CVIWM a practical and scalable solution for automated inspection in commercial poultry houses. 4.4. Limitations and Future Work While CVIWM demonstrated high positioning accuracy over an 80 m corridor, several limitations remain to be addressed. First, although the current experiments accounted for some dust and lighting variations, different lighting may still degrade performance. Thus, further investigation under more challenging conditions—wet floors, wheel slippage, heavy dust, poor lighting, occluded or damaged AprilTags, varying tag densities, and complex motions—is needed to evaluate system robustness, motivating future work on slip compensation, online calibration [ 40], and multi-modal sensing using IR-reflective or thermal markers [ 41, 42]. Second, dynamic obstacles were not considered. Integrating dynamic obstacle detection and tracking would be necessary for safe operation in human–robot shared environments [ 43]. Third, although several hours of experiments were conducted in commercial poultry houses, the long-term deployment feasibility and reliability of the system require further validation through extended field testing. In addition, the energy consumption of the overall sensor fusion and navigation framework needs to be further evaluated to achieve optimal power management. 5. Conclusions This paper proposed CVIWM, a tightly coupled state estimation method for poultry house inspection robots operating in structurally degraded environments. The method integrates visual, IMU, wheel odometry, and fiducial marker observations within a factor graph optimization framework. Specifically, wheel odometry preintegration provides drift-free planar constraints to suppress IMU horizontal drift and provide an absolute scale for monocular vision, while sparse fiducial markers (AprilTags) serve as absolute pose anchors to periodically reset accumulated errors. Experimental results in an 80 m long poultry house corridor demonstrated that CVIWM achieves centimeter-level positioning accuracy, with average positioning errors of 2.402 cm at 0.116 m/s and 3.253 cm at 0.232 m/s. This high-precision positioning ensured reliable image acquisition (image shift < 83 pixels), enabling subsequent poultry detection models to achieve 95.7% dead hen detection accuracy and 98.9% egg detection accuracy. By providing a low-cost, easy-to-deploy, and high-accuracy state estimation solution, CVIWM serves as a key technological enabler for automated inspection and smart livestock farming. Author Contributions Conceptualization, H.D.; methodology, J.J.; software, C.L.; validation, H.D.; formal analysis, H.D.; investigation, H.D.; resources, H.D.; data curation, H.D.; writing—original draft preparation, H.D.; writing—review and editing, H.D. and C.F.; visualization, H.D.; supervision, T.Z.; project administration, T.Z.; funding acquisition, T.Z. All authors have read and agreed to the published version of the manuscript. Funding This work is supported by National Key Research and Development Plan [grant No. 2021YFD1300101], State Key Laboratory of Swine and Poultry Breeding Industry (PI) research project [grant No. 2025ZQQZ-G47], the second phase of project of the Wenshi Science and Technology Innovation Center at South China Agricultural University, China. Institutional Review Board Statement The experiments were conducted in a commercial caged chicken house under standard farming conditions. The inspection robot performed autonomous navigation and image acquisition without any physical contact with or disturbance to the animals. No invasive procedures were performed, and no animals were handled or sacrificed for this study. Therefore, no specific ethical approval was required for this research. Informed Consent Statement Not applicable. Data Availability Statement The data presented in this study are available on request from the corresponding author. Acknowledgments The authors thank the Lixing Agricultural Development Co., Ltd. for providing the experimental site. Conflicts of Interest The authors declare no conflicts of interest. References Figure 1. Structurally degraded environment of a caged chicken house. ( a) Narrow and long straight corridor; ( b) mesh chicken cage with repeated textures; ( c) manure pit at the corridor end. Figure 1. Structurally degraded environment of a caged chicken house. ( a) Narrow and long straight corridor; ( b) mesh chicken cage with repeated textures; ( c) manure pit at the corridor end. Figure 2. Schematic diagram of the multi-sensor tightly coupled positioning framework of CVIWM. Figure 2. Schematic diagram of the multi-sensor tightly coupled positioning framework of CVIWM. Figure 3. The coordinate relationship in the positioning system. Figure 3. The coordinate relationship in the positioning system. Figure 4. Factor Graph Optimization Framework. Figure 4. Factor Graph Optimization Framework. Figure 5. Experimental platform and sensor configuration. Figure 5. Experimental platform and sensor configuration. Figure 6. Data acquisition process. Figure 6. Data acquisition process. Figure 7. Positioning accuracy of the fiducial marker in the depth direction. Figure 7. Positioning accuracy of the fiducial marker in the depth direction. Figure 8. Trajectory comparison of different positioning methods at 0.116 m/s. Figure 8. Trajectory comparison of different positioning methods at 0.116 m/s. Figure 9. Trajectory comparison of different positioning methods at 0.232 m/s. Figure 9. Trajectory comparison of different positioning methods at 0.232 m/s. Figure 10. Deviation at the speed of 0.116 m/s. Figure 10. Deviation at the speed of 0.116 m/s. Figure 11. Deviation at the speed of 0.232 m/s. Figure 11. Deviation at the speed of 0.232 m/s. Figure 12. Examples of poultry health monitoring results using images acquired by the CVIWM-guided robot. Figure 12. Examples of poultry health monitoring results using images acquired by the CVIWM-guided robot. Table 1. Positioning deviation statistics at 0.116 m/s. Table 1. Positioning deviation statistics at 0.116 m/s. Method Mean Lateral Deviation (cm) Lateral Deviation Std (cm) Mean Longitudinal Deviation (cm) Longitudinal Deviation Std (cm) IMU+WO 60.049 31.373 169.724 123.234 VINS-Mono 184.634 206.940 3724.088 2570.585 VINS-Mono-WO 106.807 104.005 1913.330 1233.156 CVIWM (Ours) 1.863 1.403 1.516 1.406 Table 2. Positioning deviation statistics at 0.232 m/s. Table 2. Positioning deviation statistics at 0.232 m/s. Method Mean Lateral Deviation (cm) Lateral Deviation Std (cm) Mean Longitudinal Deviation (cm) Longitudinal Deviation Std (cm) IMU+WO 155.214 109.714 178.798 127.704 VINS-Mono 253.939 202.898 3782.568 2564.384 VINS-Mono-WO 195.106 125.945 1991.345 1376.138 CVIWM (Ours) 2.893 1.814 1.487 1.506 Table 3. Performance comparison of different positioning methods. Table 3. Performance comparison of different positioning methods. Method Positioning Accuracy Remarks Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. © 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed unde

www.mdpi.com

Zum Originalartikel