详细信息
HD-Fusion: Hierarchical Dynamic Fusion of LiDAR-Camera for Robust 3-D Object Detection ( SCI-EXPANDED收录 EI收录)
文献类型:期刊文献
英文题名:HD-Fusion: Hierarchical Dynamic Fusion of LiDAR-Camera for Robust 3-D Object Detection
作者:Jing, Weiming Chen, Xiyuan Nie, Shuhan Jiao, Zhiyuan Ma, Jianghui
第一作者:Jing, Weiming
通信作者:Chen, XY[1]
机构:[1]Southeast Univ, State Key Lab Comprehens PNT Network & Equipment T, Nanjing 210096, Peoples R China;[2]Southeast Univ, Key Lab Microinertial Instrument & Adv Nav Technol, Minist Educ, Nanjing 210096, Peoples R China;[3]Guizhou Inst Technol, Sch Aerosp Engn, Guiyang 550025, Peoples R China
第一机构:Southeast Univ, State Key Lab Comprehens PNT Network & Equipment T, Nanjing 210096, Peoples R China
通信机构:corresponding author), Southeast Univ, State Key Lab Comprehens PNT Network & Equipment T, Nanjing 210096, Peoples R China.
年份:2026
外文期刊名:IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS
收录:;EI(收录号:20261720593059);Scopus(收录号:2-s2.0-105036686157);WOS:【SCI-EXPANDED(收录号:WOS:001737644000001)】;
基金:This work was supported in part by the National Natural Science Foundation of China under Grant 61873064, and in part by the Guizhou Provincial Key Technology R&D Program under Grant XKBF [2025] 032. Paper no. TII-26-1408.
语种:英文
外文关键词:Aerospace engineering; Feeds; Antennas; Filtering; Filters; Circuits; Circuits and systems; LoRa; High frequency; Location awareness; Autonomous driving; LiDAR-camera fusion; three-dimensional (3-D) object detection
摘要:In autonomous driving, bird's-eye view (BEV) representations have emerged as the dominant approach for 3-D object detection. However, projecting 3-D objects into BEV space can lead to distant and nearby objects appearing similar in size, making it challenging to discern depth relationships between foreground and background objects. Furthermore, inadequate modeling of intermodal discrepancies and correlations hampers effective contextual integration in cross-modal fusion. To address these limitations, we propose hierarchical dynamic fusion (HD-Fusion), a novel end-to-end multimodal fusion framework consisting of a scene-level fusion (SLF) module and a contextual-level fusion (CLF) module. The SLF module fuses depth details from point cloud pillars with image features, generating BEV image representations enhanced with depth cues. The CLF module further enhances the features of LiDAR and cameras with a bidirectional cross-modal attention (BCMA) block and a discrete wavelet transform (DWT) encoder. The BCMA captures long-range interactions between LiDAR and image tokens, while the DWT separates multiscale frequency components to suppress noise and artifacts. Extensive experiments on the nuScenes benchmark show that HD-Fusion achieves 70.5% mAP and 72.9% NDS, improving over the baseline by 12.1 and 6.6 points, respectively. Additional evaluations on rainy/night subsets, simulated camera/LiDAR failures, and cross-dataset transfer from nuScenes to Lyft further demonstrate that HD-Fusion maintains superior performance on small and distant objects and exhibits strong robustness and generalization in challenging autonomous-driving scenarios.
参考文献:
正在载入数据...
