-
Lintang Matahari Hasani, Kasiyah Junus, Lia Sadita, Ayano Ohsaki, Tsuk ...
原稿種別: LETTER
論文ID: 2024EDL8025
発行日: 2025年
[早期公開] 公開日: 2025/06/02
ジャーナル
フリー
早期公開
Learners need to progress through certain inquiry stages to experience a good online discussion. This study analyzes the discussion of two classes that received different preparation: Kit-build concept mapping (KBCM) and summary writing. By using epistemic network analysis, KBCM class showed close to ideal connectivity between the inquiry stages.
抄録全体を表示
-
Ying Liu, Yong Li, Ming Wen, Xiangwei Xu
原稿種別: PAPER
論文ID: 2024EDP7299
発行日: 2025年
[早期公開] 公開日: 2025/06/02
ジャーナル
フリー
早期公開
Federated Learning collaborates with multiple organizations to train machine learning models in a way that does not reveal raw data. As a new learning paradigm, FL suffers from statistical challenges on cross-organizational non-IID data, limiting the global model to provide good performance for each client task. In this paper, we propose a personalized federated meta-learning (EPer-FedMeta) algorithm for heterogeneous clients using q-FedAvg as a model aggregation strategy, which helps the global model to optimize a reasonable representation fairly with multiple client personalized models and introduces a contrast loss in the local training to bring the similarity between meta-learner representations closer. Also noteworthy is the potential cold-start problem for new tasks in PFL (Personalized Federated Learning), where EPer-FedMeta simply uses CondConv to make lightweight modifications to the CNN network for more robust model personalization migration. Our extensive empirical evaluation of the LEAF dataset and the actual production dataset shows that EPer-FedMeta further mitigates the challenges of Non-IID data on FL system communication costs and model accuracy. In terms of performance and optimization, EPer-FedMeta achieves optimal model performance with faster convergence and lower communication overhead compared to the leading optimization algorithms in FL.
抄録全体を表示
-
Makoto NAKATSUJI, Yasuhiro FUJIWARA
原稿種別: PAPER
論文ID: 2024OFP0009
発行日: 2025年
[早期公開] 公開日: 2025/06/02
ジャーナル
フリー
早期公開
Developing personalized chatbots is crucial in the field of AI, particularly when aiming for dynamic adaptability similar to that of human communication. Traditional methods often overlook the importance of both the speaker's and the responder's personalities and their interaction histories, resulting in lower predictive accuracy. Our solution, INTPChat (Interactive Persona Chat), addresses this limitation. INTPChat builds implicit profiles from extensive utterance histories of both speakers and responders and updates these profiles dynamically to reflect current conversational contexts. By employing a co-attention encoding mechanism, INTPChat aligns current contexts with responses while considering historical interactions. This approach effectively mitigates data sparsity issues by iteratively shifting each context backward in time, allowing for a more granular analysis of long-term interactions. Evaluations on long-term Reddit datasets demonstrate that INTPChat significantly enhances response accuracy and surpasses the performance of state-of-the-art persona chat models.
抄録全体を表示
-
Qian Zewen, HAN Zhezhe, Jiang Haoran, Zhang Ziyi, Zhang Mohan, Ma Hao, ...
原稿種別: LETTER
論文ID: 2025EDL8003
発行日: 2025年
[早期公開] 公開日: 2025/06/02
ジャーナル
フリー
早期公開
Identifying the combustion conditions in power-plant furnaces is crucial for optimizing combustion efficiency and reducing pollutant emissions. Traditional image-processing methods heavily rely on prior empirical knowledge, limiting their ability to comprehensively extract features from flame images. To address these deficiencies, this study proposed a novel approach for combustion condition identification through flame imaging and a convolutional autoencoder (CAE). In this approach, the flame images are first preprocessed, then the CAE is established to extract the deep features of the flame image, and finally the Softmax classifier is employed to determine the combustion conditions. Experimental research is carried out on a 600MW opposed wall boiler, and the effectiveness of the proposed method is evaluated using captured flame images. Results demonstrate that the proposed CAE-Softmax model achieves an identification accuracy of 98.2% under the investigated combustion conditions, significantly outperforming traditional models. These findings reveal the method feasibility, offering an intelligent and efficient solution for enhancing the operational performance of power-plant boilers.
抄録全体を表示
-
Jialong LI, Shogo MORITA, Wei WANG, Yan ZHANG, Takuto YAMAUCHI, Kenji ...
原稿種別: LETTER
論文ID: 2025EDL8017
発行日: 2025年
[早期公開] 公開日: 2025/06/02
ジャーナル
フリー
早期公開
Human-robot collaboration has become increasingly complex and dynamic, highlighting the need for effective and intuitive communication. Two communication strategies for robots have been explored: (i) global-perspective strategy to share an overview of task progress, aimed at achieving consensus on completed and upcoming tasks; and (ii) local-perspective strategy to share the robot's intent, aimed at conveying the robot's immediate intentions and next actions. However, existing studies merely rely on the distinct focus to differentiate between the use of different strategies, lacking a deeper exploration of how these strategies affect user perceptions and responses in practice. For example, a possible concern could be which strategy is more likely to inspire human effort in collaboration. To this end, this paper conducts a user experiment (N=15) within a collaborative cooking scenario, and provides design insights into the strengths and weaknesses of each strategy from three dimensions to inform the design of human-sensitive communication.
抄録全体を表示
-
Ziyue WANG, Yanchao LIU, Xina CHENG, Takeshi IKENAGA
原稿種別: PAPER
論文ID: 2025PCP0002
発行日: 2025年
[早期公開] 公開日: 2025/06/02
ジャーナル
フリー
早期公開
Automatically reconstructing structured 3D model of real-world indoor scenes has been an essential and challenging task in indoor navigation, evacuation planning and wireless signal simulation, etc. Despite the increasing demand of updated indoor models, indoor reconstruction from monocular videos is still in an early stage in comparison with the reconstruction of outdoor scenes. Specific challenges are related to the complex building layouts which need long-term video recording, and the high presence of elements such as pieces of furniture causing clutter and occlusions. To accurately reconstruct the large-scale indoor scenes with multiple rooms, this paper designs a large-scale indoor multiple room 3D reconstruction pipeline to explore the topology relation between different rooms from long-term monocular videos. Firstly, semantic door detection based video segmentation is proposed to segment different rooms in video for individual reconstruction to avoid global mismatching noise, and 3D temporal trajectory is proposed to connect different rooms in spatial domain. Secondly, 3D Hough transform and Principal components analysis are utilized to refine the room boundary from reconstructed point clouds, which contributes to the accuracy improvement. Further, an original longterm video dataset for large-scale indoor multiple rooms reconstruction is constructed, which contains 12 real-world videos and 4 virtual videos with 30 rooms. Extensive experiments demonstrate that the proposed method reaches the highest performance of the 3D IoU at 0.70, room distance accuracy at 0.87, and connectivity accuracy at 0.67, which is around 39% better on average compared with various state-of-the-art models.
抄録全体を表示
-
Kosuke KURIHARA, Yoshihiro MAEDA, Daisuke SUGIMURA, Takayuki HAMAMOTO
原稿種別: PAPER
論文ID: 2025PCP0004
発行日: 2025年
[早期公開] 公開日: 2025/06/02
ジャーナル
フリー
早期公開
We propose a non-contact heart rate (HR) estimation method that models weak physiological blood volume pulse (BVP) signals and strong noise signals caused by background illumination. Our method integrates BVP signal extraction based on a physiological model and a flexible RGB/NIR integration scheme based on an illumination model in a unified manner. This unified framework enables accurate extraction of the BVP signal while suppressing noise derived from ambient light, and thus improves HR estimation performance. We demonstrate the effectiveness of our method through experiments using several datasets, including various illumination scenes. Our code will be available on https://github.com/kosuke-kurihara/PhysIllumHR.
抄録全体を表示
-
Zhiyao SUN, Peng WANG
原稿種別: PAPER
論文ID: 2024EDP7289
発行日: 2025年
[早期公開] 公開日: 2025/05/28
ジャーナル
フリー
早期公開
Mobile edge computing (MEC) faces severe challenges in achieving efficient and timely task offloading in heterogeneous network environments. While existing contract-based approaches address incentive compatibility and resource coordination, many either ignore the constraints of age of information (AoI) or suffer from high computational complexity. This paper presents an AoI-guaranteed Optimal Contract (AOC) mechanism that jointly considers information freshness and asymmetric information in MEC systems. We design a three-tier heterogeneous network architecture with non-orthogonal multiple access to enable cooperative task offloading across multiple cells and enhance spectral efficiency. Instead of a model that requires extensive training and is difficult to analyze, our proposed AOC framework uses a lightweight block coordinate descent (BCD) algorithm to solve closed-form contract solutions while ensuring incentive compatibility and individual rationality. Simulation results show that the AOC mechanism significantly improves the utility and AoI performance of the MEC server compared with existing incentive-based methods. In addition, the analysis confirms the robustness and practical deployability of the proposed framework under different system conditions.
抄録全体を表示
-
Qingxia YANG, Deng PAN, Wanlin HUANG, Erkang CHEN, Bin HUANG, Sentao W ...
原稿種別: PAPER
論文ID: 2024EDP7316
発行日: 2025年
[早期公開] 公開日: 2025/05/23
ジャーナル
フリー
早期公開
Ship detection in maritime monitoring is crucial for ensuring public safety in marine environments. However, maritime surveillance faces significant challenges due to weak targets (small, low-contrast objects) caused by complex environments and long distances. To address these challenges, we propose YOLO-MSD, a maritime surveillance detection model based on YOLOv8. In YOLO-MSD, Receptive-Field Attention Convolution (RFAConv) replaces standard convolution, learning attention maps via receptive-field interaction to enhance detail extraction and reduce information loss. The C2f module In the neck integrates Omni-Dimensional Dynamic Convolution (ODConv), which dynamically adjusts convolution kernel parameters to effectively capture contextual information, thereby achieving superior multi-scale feature fusion. We introduce a dedicated detection head specifically for small objects to enhance detection accuracy. Furthermore, to address detection box quality imbalance, we employ Wise-IoU for bounding box regression loss, enhancing multi-scale target localization and accelerating convergence. The model achieves precision, recall and mean average precision (mAP50) rates of 93.0%, 90.05% and 95.0%, respectively, on the self-constructed Maritime Vessel Surveillance Dataset (MVSD), effectively meeting the requirements for maritime target detection. We further conduct comparative experiments on the public McShips dataset, demonstrating YOLO-MSD's broad applicability in ship detection.
抄録全体を表示
-
Mitsuhiro WATANABE, Go HASEGAWA
原稿種別: PAPER
論文ID: 2025EDP7014
発行日: 2025年
[早期公開] 公開日: 2025/05/23
ジャーナル
フリー
早期公開
As the Internet becomes larger-scaled and more diversified, the traditional end-to-end (E2E) congestion control faces various problems such as low throughput on long-delay networks and unfairness among flows with different network situations. In this paper, we propose a novel congestion control architecture, called in-network congestion control (NCC). Specifically, by introducing one or more nodes (NCC nodes) on an E2E network path, we divide the network path into multiple sub-paths and maintain a congestion-control feedback loop on each sub-path. In each sub-path, a specialized congestion control algorithm can be applied according to its network characteristics. This architecture can provide various advantages compared with the traditional E2E congestion control, such as higher data transmission throughput, better per-flow fairness, and incremental deployment nature. In this paper, we describe NCC's advantages and challenges, and clarify its potential performance by evaluation results. We reveal that the E2E throughput improves by as much as 159% by just introducing NCC nodes. Furthermore, increasing the number of NCC nodes improves the E2E throughput and fairness among flows by up to 258% and 151%, respectively.
抄録全体を表示
-
Guanghui CAI, Junguo ZHU
原稿種別: PAPER
論文ID: 2024EDP7292
発行日: 2025年
[早期公開] 公開日: 2025/05/15
ジャーナル
フリー
早期公開
Deep learning has transformed Neural Machine Translation (NMT), but the complexity of these models makes them hard to interpret, thereby limiting improvements in translation quality. This study explores the widely used Transformer model, utilizing linguistic features to clarify its inner workings. By incorporating three linguistic features—part-of-speech, dependency relations, and syntax trees—we demonstrate how the model's attention mechanism interacts with these features during translation. Additionally, we improved translation quality by masking nodes that were identified to have negative effects. Our approach bridges the complex nature of NMT with clear linguistic knowledge, offering a more intuitive understanding of the model's translation process.
抄録全体を表示
-
Shuhei YAMAMOTO, Yasunori AKAGI, Tomu TOMINAGA, Takeshi KURASHIMA
原稿種別: PAPER
論文ID: 2024EDP7248
発行日: 2025年
[早期公開] 公開日: 2025/05/14
ジャーナル
フリー
早期公開
Present bias, the cognitive bias that prioritizes immediate rewards over future ones, is considered one of the factors that can hinder goal achievement. Estimation of present bias enables the development of effective intervention strategies for behavioral change. This paper proposes a novel method using behavior history, captured by wearable devices for estimating the present bias. We employ Transformer due to its proficiency in learning relationships within sequential data like behavioral history, including continuous (e.g., heart rate) and event data (e.g., sleep onset). To allow Transformer to capture behavior patterns affected by present bias from behavior history, we introduce two novel architectures for effectively processing continuous and event data timestamp information in behavioral history: temporal and event encoders (TE and EE). TE discerns the periodic characteristics of continuous data, while EE examines temporal interdependencies in the event data. These encoders enable our proposed model to capture temporally (ir)regular behavioral patterns associated with present bias. Our experiments using the behavior history logs of 257 subjects collected over 28 days demonstrated that our method estimates the subjects' present bias accurately.
抄録全体を表示
-
Shrey SINGH, Prateek KESERWANI, Katsufumi INOUE, MASAKAZU IWAMURA, Par ...
原稿種別: PAPER
論文ID: 2024EDP7297
発行日: 2025年
[早期公開] 公開日: 2025/05/14
ジャーナル
フリー
早期公開
Sign language recognition (SLR) using a video is a challenging problem. In the SLR problem, I3D network, which has been proposed for action recognition problems, is the best performing model. However, the action recognition and SLR are inherently different problems. Therefore, there is room to develop it for the SLR problem to achieve better performance, considering the task-specific features of SLR. In this work, we revisit I3D model to extend its performance in three essential design aspects. They include a better inception module named dilated inception module (DIM) and an attention mechanism-based temporal attention module (TAM) to identify the essential features of signs. In addition, we propose to eliminate a loss function that deteriorate the performance. The proposed method has been extensively validated on WLASL and MS-ASL public datasets. The proposed method has outperformed the state-of-the-art approaches in WLSAL dataset and produced competitive results on MS-ASL dataset, though the results of MS-ASL dataset are indicative due to unavailability of the original data. The Top-1 accuracy of the proposed method on WLASL100 and MS-ASL100 were 79.08% and 82.78%, respectively.
抄録全体を表示
-
Olivier NOURRY, Masanari KONDO, Shinobu SAITO, Yukako IIMURA, Naoyasu ...
原稿種別: LETTER
論文ID: 2025EDL8005
発行日: 2025年
[早期公開] 公開日: 2025/05/14
ジャーナル
フリー
早期公開
[Background] Throughout their lifetime, open-source software systems will naturally attract new contributors and lose existing contributors. Not all OSS contributors are equal, however, as some contributors within a project possess significant knowledge and expertise of the codebase (i.e., core developers). When investigating a project's ability to attract new contributors and how often a project loses contributors, it is therefore important to take into account the expertise of the contributors. [Goal] Since core developers are vital to a project's longevity, we therefore aim to find out: can OSS projects attract new core developers and how often do OSS projects lose core developers? [Results] To investigate core developer contribution patterns, we calculate the truck factor (or bus factor) of over 36,000 OSS projects to investigate how often TF developers join or abandon OSS projects. We find that 89% of our studied projects have experienced losing their core development team at least once. Our results also show that in 70% of cases, this project abandonment happens within the first three years of a project's life. We also find that most OSS projects rely on a single core developer to maintain development activities. Finally, we find that only 27% of projects that were abandoned were able to attract at least one new TF developer.
抄録全体を表示
-
Xingxin WAN, Peng SONG, Siqi FU, Changjia WANG
原稿種別: LETTER
論文ID: 2025EDL8020
発行日: 2025年
[早期公開] 公開日: 2025/05/14
ジャーナル
フリー
早期公開
In ideal facial expression recognition (FER) tasks, the training and test data are assumed to share the same distribution. However, in reality, they are often sourced from different domains, which follow different feature distributions and would seriously impair the recognition performance. In this letter, we present a novel Dynamic Graph-Guided Domain-Invariant Feature Representation (DG-DIFR) method, which addresses the issue of distribution shifts across different domains. First, we learn a robust common subspace to minimize the data distribution differences, facilitating the extraction of invariant feature representations. Concurrently, the retargeted linear regression is employed to enhance the discrimination of the proposed model. Furthermore, a maximum entropy based dynamic graph is further introduced to maintain the topological structure information in the low-dimensional subspace. Finally, numerous experiments conducted on four benchmark datasets confirm the superiority of the proposed method over state-of-the-art methods.
抄録全体を表示
-
Shunya ISHIKAWA, Toru NAKASHIKA
原稿種別: PAPER
論文ID: 2025EDP7029
発行日: 2025年
[早期公開] 公開日: 2025/05/14
ジャーナル
フリー
早期公開
Recent research in chord recognition has utilized machine learning models. However, few models adequately consider harmonic co-occurrence, a known musical feature. Since the harmonic structure is complex and varies with instrument and pitch, the model itself would need to consider harmonics explicitly, but few such methods exist. We propose the classification semi-restricted Boltzmann machine (CSRBM), a machine learning model that can explicitly consider the co-occurrence of any two pitches. A model parameter learns the co-occurrence function to enable chord recognition with flexible consideration of the harmonic structure. We demonstrate how to incorporate the structure as prior knowledge into the model by setting up a prior distribution of the parameter. We also propose weight-sharing CSRBM (WS-CSRBM), an extension of CSRBM that allows time series to be considered. This model enables the CSRBM to consider time series more efficiently not only by arranging some of the CSRBMs in parallel with the number of frames to be considered but also by sharing some of the parameters. Experimental results show that the recognition accuracies of the proposed methods outperform that of a conventional method that considers the co-occurrence of some harmonics. The effectiveness of the CSRBM's parameter in learning pitch co-occurrence, setting up a prior distribution for the parameter, and sharing some parameters in WS-CSRBM are also confirmed.
抄録全体を表示
-
Koji ABE, Ryoma KITANISHI, Hitoshi HABE, Masayuki OTANI, Nobukazu IGUC ...
原稿種別: PAPER
論文ID: 2024EDP7282
発行日: 2025年
[早期公開] 公開日: 2025/05/07
ジャーナル
フリー
早期公開
At fish farms and fish farming facilities, the number of fish is continuously monitored from hatching until shipment. Especially, whenever hatchery-produced juvenile fish are transferred from one indoor aquaculture tank to another, fish farmers who manage the juvenile fish must manually count thousands of the fish, which places a significant burden on them. This paper presents an automated system for counting hatchery-produced juvenile fish in fish farming facilities. This system aims to serve as a foundational technology for aquaculture production management, supporting sustainable production through data-driven aquaculture. In the proposed system, a slide is set up with a video camera positioned above to capture the surface of the slide. The flow of juvenile fish along with water on the slide is recorded, and the number of juvenile fish captured in the video is counted. In every frame of the video, the starting and the ending lines are prepared perpendicular to the direction of fish movement, and the fish regions are tracked between these lines. The count is increased by one when a fish region has crossed the starting line. Subsequently, each fish region is tracked across frames, where the count is increased if a fish region in which an occlusion occurs between multiple fish regions has been separated. Under a custom-built recording setup, experiments were conducted with 10 videos of approximately 200 black medaka being released down the slide, and 2 videos with thousands of hatchery-produced juvenile fish being released down the slide, recorded at an aquaculture facility. The results indicated that the proposed system counted the number of fish accurately in most cases, even in the presence of occlusions.
抄録全体を表示
-
Congda MA, Tianyu ZHAO, Manabu OKUMURA
原稿種別: PAPER
論文ID: 2024EDP7326
発行日: 2025年
[早期公開] 公開日: 2025/05/07
ジャーナル
フリー
早期公開
Due to biases inherently present in data for pre-training, current pre-trained Large Language Models (LLMs) also ubiquitously manifest the same phenomena. Since the bias influences the output from the LLMs across various tasks, the widespread deployment of the LLMs is hampered. We propose a simple method that utilizes structured knowledge to alleviate this issue, aiming to reduce the bias embedded within the LLMs and ensuring they have an encompassing perspective when used in applications. Experimental results indicated that our method has good debiasing ability when applied to existing both autoregressive and masked language models. Additionally, it could ensure that the performances of LLMs on downstream tasks remain uncompromised. Importantly, our method obviates the need for training from scratch, thus offering enhanced scalability and cost-effectiveness.
抄録全体を表示
-
Bin YANG, Mingyuan LI, Yuzhi XIAO, Haixing ZHAO, Zhen LIU, Zhonglin YE
原稿種別: PAPER
論文ID: 2024EDP7152
発行日: 2025年
[早期公開] 公開日: 2025/04/24
ジャーナル
フリー
早期公開
Aiming at the problem that existing graph neural network architectures usually use a single scale to process graph data, which leads to information loss and simplification, this paper proposes a novel graph neural network approach, the M2GNN framework, which aims to enhance the feature learning capability of graph structured data through multi-scale fusion and attention mechanism. In M2GNN, each channel handles graph features at different scales separately, and integrates local and global information using multi-scale fusion methods to capture features at different levels in the graph structure. The learned features from each channel are then weighted and fused using an attention mechanism to extract the most representative feature representation. The experimental results show that compared with the traditional graph neural network approach, M2GNN improves the performance by 0.70% to 54.14%, 0.34% to 54.31%, and 0.68% to 54.40% for the node classification task with different label coverages, which verifies the effectiveness of the multi-channel and multi-scale fusion strategies.
抄録全体を表示
-
Chuanyang LIU, Jingjing LIU, Yiquan WU, Zuo SUN
原稿種別: PAPER
論文ID: 2024EDP7265
発行日: 2025年
[早期公開] 公開日: 2025/04/24
ジャーナル
フリー
早期公開
As a common type of defect, the rust defect of power components is one of the important potential hazards endangering the safe operation of transmission lines. How to quickly and accurately discover and repair the rusted power components is an urgent problem to be solved in power inspection. Aiming at the above problems, this study proposes Rust-Defect YOLO (RD-YOLO) for detecting rust defects in power components of transmission lines. Firstly, the Coordinate Channel Attention Residual Module (CCARM) is proposed to improve the multi-scale detection precision. Secondly, the Receptive Field Block (RFB) and the Efficient Convolutional Block Attention Module (ECBAM) are introduced into the Path Aggregation Network (PANet) to strengthen the fusion of deep and shallow features. Finally, the contrast sample strategy and the Focal loss function are adopted to train and optimize RD-YOLO, and experiments are carried out on a self-built dataset. The experimental results show that the average precision of rust defect detection by RD-YOLO reaches 95%, which is 9% higher than that of the original YOLOX. The comparative experimental results demonstrate that RD-YOLO performs excellently in power components identification and rust defect detection, and has broad application prospects in the future automatic visual inspection of transmission lines.
抄録全体を表示
-
Yuewei ZHANG, Huanbin ZOU, Jie ZHU
原稿種別: LETTER
論文ID: 2024EDL8099
発行日: 2025年
[早期公開] 公開日: 2025/04/23
ジャーナル
フリー
早期公開
Multi-resolution spectrum feature analysis has demonstrated superior performance over traditional single-resolution methods in speech enhancement. However, previous multi-resolution-based methods typically have limited use of multi-resolution features, and some suffer from high model complexity. In this paper, we propose a more lightweight method that fully leverages the multi-resolution spectrum features. Our approach is based on a convolutional recurrent network (CRN) and employs a low-complexity multi-resolution spectrum fusion (MRSF) block to handle and fuse multi-resolution noisy spectrum information. We also improve the existing encoder-decoder structure, enabling the model to extract and analyze multi-resolution features more effectively. Furthermore, we adopt the short-time discrete cosine transform (STDCT) for time-frequency transformation, avoiding the phase estimation problem. To optimize our model, we design a multi-resolution STDCT loss function. Experiments demonstrate that the proposed multi-resolution STDCT-based CRN (MRCRN) achieves excellent performance and outperforms current advanced systems.
抄録全体を表示
-
Trung MINH BUI, Jung-Hoon HWANG, Sewoong JUN, Wonha KIM, DongIn SHIN
原稿種別: PAPER
論文ID: 2024EDP7261
発行日: 2025年
[早期公開] 公開日: 2025/04/23
ジャーナル
フリー
早期公開
This paper develops a grasp pose detection method that achieves high success rates in real-world industrial environments where elongated objects are densely cluttered. Conventional Vision Transformer (ViT)-based methods capture fused feature maps, which successfully encode comprehensive global object layouts, but these methods often suffer from spatial detail reduction. Therefore, they predict grasp poses that could efficiently avoid collisions, but are insufficiently precisely located. Motivated by these observations, we propose Oriented Region-based Vision Transformer (OR-ViT), a network that preserves critical spatial details by extracting a fine-grained feature map directly from the shallowest layer of a ViT backbone and also understands global object layouts by capturing the fused feature map. OR-ViT decodes precise grasp pose locations from the fine-grained feature map and integrates this information into its understanding of global object layouts from the fused map. In this way, the OR-ViT is able to predict accurate grasp pose locations with reduced collision probabilities.
Extensive experiments on the public Cornell and Jacquard datasets, as well as on our customized elongated-object dataset, verify that OR-ViT achieves competitive performance on both public and customized datasets when compared to state-of-the-art methods.
抄録全体を表示
-
Huayang Han, Yundong Li, Menglong Wu
原稿種別: LETTER
論文ID: 2025EDL8004
発行日: 2025年
[早期公開] 公開日: 2025/04/22
ジャーナル
フリー
早期公開
Building damage assessment (BDA) plays a crucial role in accelerating humanitarian relief efforts during natural disasters. Recent studies have shown that the state-space model-based Mamba architecture exhibits significant performance across various natural language processing tasks. In this paper, we propose a new model, OS-Mamba, which utilizes an Overall-Scan Convolution Modules (OSCM) for multidimensional global modeling of image backgrounds, enabling comprehensive capture and analysis of large spatial features from various directions, thereby enhancing the model's understanding and performance in complex scenes. Extensive experiments on the xBD dataset demonstrate that our proposed OS-Mamba model outperforms current state-of-the-art solutions.
抄録全体を表示
-
Hyunsik YOON, Yon Dohn CHUNG
原稿種別: LETTER
論文ID: 2024EDL8071
発行日: 2025年
[早期公開] 公開日: 2025/04/17
ジャーナル
フリー
早期公開
The execution time of an Apache Spark application is heavily influenced by its configuration settings. Accordingly, Bayesian Optimization (BO) is commonly used for automated tuning, employing the acquisition function, Expected Improvement (EI). However, existing works did not compare the performance to the other acquisition functions empirically. In this paper, we show that EI may not work well for Spark applications due to a huge search space compared to the other optimization problems. In addition, we demonstrate the performance of BO based on Probability of Improvement (PI), which achieves exploration via rich random initialization and exploitation via the PI acquisition function. Through the experimental evaluations, we show that the PI-based BO outperforms the EI-based BO in both optimal time and optimization cost.
抄録全体を表示
-
Rihito SHODA, Seiji MIYOSHI
原稿種別: LETTER
論文ID: 2024EDL8105
発行日: 2025年
[早期公開] 公開日: 2025/04/17
ジャーナル
フリー
早期公開
Anomaly detection is essential in a wide range of fields. In this study, we focus on an Efficient GAN applied to anomaly detection, and aim to improve its performance by random erasing data augmentation and enhancing the loss function to incorporate mapping consistency. Experiments using images of normal lemons and damaged lemons reveal that the proposed method significantly improves the anomaly detection performance of Efficient GAN.
抄録全体を表示
-
Duc-Dung NGUYEN
原稿種別: LETTER
論文ID: 2024EDL8107
発行日: 2025年
[早期公開] 公開日: 2025/04/17
ジャーナル
フリー
早期公開
Compared to general object detection problems, the detection of mathematical expressions (MED) in document images has its own challenges, like the small size of inline formulas, the rich set of mathematical symbols, and the similarity between variables and normal text characters. To deal with those challenges, we transform the multi-class MED task into a multi-label semantic segmentation problem. With a basic encoder-decoder structure of 3.9 million parameters and trained from scratch, our proposed MEDNet model can achieve top detection performance on three public datasets: TFD2019, Marmot, and IBEM2021. MEDNet is especially effective in detecting small formulas when achieving the F1 score of 95.40% for the inline and 95.82% for all expressions on the test set of the IBEM2021 competition data.
抄録全体を表示
-
Ruidong CHEN, Baohua QIANG, Xianyi YANG, Shihao ZHANG, Yuan XIE
原稿種別: PAPER
論文ID: 2024EDP7279
発行日: 2025年
[早期公開] 公開日: 2025/04/17
ジャーナル
フリー
早期公開
Image-text retrieval (ITR) aims at querying one type of data based on a given another type of data. The main challenge is mapping images and texts to a common space. Although existing methods obtain excellent performance on ITR tasks, they also have the drawbacks of weak information interaction and insufficient capture of deeper associative relationships. To address these problems, we propose CDISA: a Cross-modal Deep Interaction and Semantic Aligning method by combining vision-language pre-training model with semantic feature extraction capabilities. Specifically, we first design a cross-modal deep interaction module to enhance the interaction of image and text features by performing deep interaction matching computations. Secondly, to align the image and text features, bidirectional cosine matching is proposed to improve the differentiation of bimodal data within the feature space. We propose arguably the extensive experimental evaluation against recent state-of-the-art ITR methods on three datasets which include Wikipedia, Pascal-Sentence and NUS-WIDE.
抄録全体を表示
-
Xichang CAI, Jingxuan CHEN, Ziyi LIU, Menglong WU, HongYang GUO, Xueji ...
原稿種別: LETTER
論文ID: 2024EDL8085
発行日: 2025年
[早期公開] 公開日: 2025/04/14
ジャーナル
フリー
早期公開
In recent years, convolutional recurrent neural networks (CRNNs) have achieved notable success in sound event detection (SED) tasks by leveraging the strengths of convolutional neural networks (CNNs) and recurrent neural networks (RNNs). However, existing models still face limitations in the temporal dimension, resulting in suboptimal temporal localization accuracy for SED. To address this issue, we designed a model called Temporal Enhanced Full-Frequency Dynamic Convolution (TEFFDConv). This model incorporates both temporal and frequency attention mechanisms with the full-dynamic convolution, enhancing the model's ability to localize sound events at the frame level. Experimental results demonstrate that our proposed model significantly improved PSDS1 and CB-F1 and IB-F1, marking a notable advancement compared to similar methods. Additionally, the PSDS2 also showed improvements over most methods. These results show the superior performance of our proposed method in enhancing temporal localization, while also demonstrating the better performance in event classification.
抄録全体を表示
-
ZiYi CHEN, XiaoRan HAO, Ming CHEN, Mao NI, Yi HENG ZHANG
原稿種別: LETTER
論文ID: 2024EDL8098
発行日: 2025年
[早期公開] 公開日: 2025/04/14
ジャーナル
フリー
早期公開
Object detection from a UAV perspective faces challenges such as object occlusion, unclear boundaries, and small target sizes, resulting in reduced detection accuracy. Additionally, traditional object detection algorithms have large parameter counts, making them unsuitable for resource-constrained edge devices. To address this issue, we propose a lightweight small-object detection algorithm: Sky-YOLO. Specifically, we introduce the MSFConv multi-scale feature map fusion convolution module into the backbone network to enhance feature extraction capability. The Neck part is replaced with the L-BiFPN module to reduce parameter count and strengthen feature fusion between layers. Additionally, based on the characteristics of UAV imagery, we incorporate the WIOU loss function, enabling efficient detection of blurred and occluded targets. Experimental results show that the Sky-YOLO model, with 60% fewer parameters than the original model, still achieves 39.7% accuracy on the VisDrone2019 validation dataset, a 6.7% improvement in accuracy over the original model.
抄録全体を表示
-
Ye TIAN, Mei HAN, Jinyi ZHANG
原稿種別: PAPER
論文ID: 2024EDP7246
発行日: 2025年
[早期公開] 公開日: 2025/04/14
ジャーナル
フリー
早期公開
This paper mainly proposes a line segment detection method based on false peak suppression and local Hough transform. It can effectively suppress the impact of noise in binary images on line segment detection performance and solve the problems of short line false detection, missing detection and over-segmentation, and acquire line segment features in pantograph-catenary panoramic images robustly. In the actual pantograph-catenary panoramic images dataset test, the comparison results of the detection accuracy of the pantograph-catenary panoramic images line segments and contact points show that the proposed method is able to reduce the missing detection while improving the detection accuracy of the pantograph-catenary contact point. Moreover, the tests on the public YorkUrban dataset also show that the proposed method in this paper achieves the optimal results in the evaluation of accuracy and processing speed, which also has a strong generalization ability in high-quality natural images.
抄録全体を表示
-
Jinsoo SEO
原稿種別: LETTER
論文ID: 2024EDL8096
発行日: 2025年
[早期公開] 公開日: 2025/04/07
ジャーナル
フリー
早期公開
In many applications, data imbalance and annotation challenges limit the size of training datasets, hindering the ability of deep neural networks to fully leverage their representational capacity. Data augmentation is a widely used countermeasure that generates additional training samples by manipulating existing data. This paper investigates spectral-domain data augmentation methods specifically for cover song identification task, enabling on-the-fly augmentation with minimal computational overhead. We explore various spectral modifications and mixing techniques, applying them directly in the frequency domain, and evaluate their effectiveness on two cover song identification datasets. Among the augmentation methods tested, a mixing approach involving cut-and-paste operations in the spectral domain achieved the highest accuracy, demonstrating the potential of spectral augmentations to enhance the performance of neural networks for cover song identification.
抄録全体を表示
-
Mami TAKEMOTO, Kousei HAYASHI, Sunao HARA, Toru YAMASHITA, Masanobu AB ...
原稿種別: PAPER
論文ID: 2024EDP7285
発行日: 2025年
[早期公開] 公開日: 2025/04/04
ジャーナル
フリー
早期公開
Parkinson's disease (PD) is an intractable neurological disease that affects approximately 100-150 per 100, 000 people in Japan, with more than 95% of these patients aged 60 years or older. The Hoehn and Yahr staging scale (H-Y scale) [1] and Unified Parkinson's Disease Rating Scale (UPDRS) score [2] are typical indicators of PD severity, but subjective aspects cannot be completely excluded. This is due to the limited means to quantitatively measure and evaluate gait as a motor symptom of PD. However, in recent years, human movement has been measured by wearable devices and other devices in daily life. In this paper, we took this wearable-device measurement one step further and aimed to apply insole-type pressure sensors to medical care. We collected data from two patients with PD at H-Y stage 2 and nine patients with PD at H-Y stage 3, then generated a time-series pattern representing their gait patterns. By analyzing these gait patterns, we found that the sum of the three toe sensors and the sum of the two heel sensors provided stable data. The overlap time of the two summed sensor values and the deviations of each summed sensor value were defined as features. Using the features, we proposed PD severity estimation based on the k-means method. Experimental results showed that the estimation of Yahr 3 and healthy individuals was achieved with high accuracy, but Yahr 2 was often classified as a healthy individual.
抄録全体を表示
-
Kazuki SUNAGA, Keisuke SUGIURA, Hiroki MATSUTANI
原稿種別: PAPER
論文ID: 2024EDP7319
発行日: 2025年
[早期公開] 公開日: 2025/04/04
ジャーナル
フリー
早期公開
Recently, graph structures have been utilized in IoT (Internet of Things) environments such as network anomaly detection, smart transportation, and smart grids. A graph embedding is a representation of a graph as a fixed-length, low-dimensional vector, which can concisely represent the characteristics of the graph. node2vec is one of the well-known algorithms for obtaining such a graph embedding by sampling neighboring nodes on a given graph using a random walk technique. However, the original node2vec algorithm relies on a conventional batch training using backpropagation algorithm. In other words, we have to retain the training data to retrain the model, which makes it unsuitable for real-world applications where the graph structure changes after the deployment. To address the changes of graph structures after the IoT devices are deployed in edge environments, this paper proposes a combination of an online sequential training algorithm and node2vec. The proposed model is implemented on an FPGA (Field-Programmable Gate Array) device for efficient sequential training. The proposed FPGA implementation achieves up to a 205.25 times speed improvement compared to the original model on an ARM Cortex-A53 CPU. We also evaluate the performance of the proposed model in the sequential training task from various perspectives. For example, evaluation results on dynamic graphs show that while the accuracy decreases in the original model, the proposed sequential model can obtain better graph embedding that achieves a higher accuracy even when the graph structure changes. In addition, the proposed FPGA implementation is evaluated in terms of the power consumption, and the results show that it significantly improves the power efficiency compared to the CPU and embedded GPU implementations.
抄録全体を表示
-
Mingyang XU, Ao ZHAN, Chengyu WU, Zhengqiang WANG
原稿種別: LETTER
論文ID: 2024EDL8094
発行日: 2025年
[早期公開] 公開日: 2025/04/02
ジャーナル
フリー
早期公開
Recognizing fatigue drivers is essential for improving road safety. Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have been applied to identify the state of drivers. However, these models frequently encounter various challenges, including a vast number of parameters and low detection effectiveness. To address these challenges, we propose Dual-Lightweight-Swin-Transformer (DLS) for driver drowsiness detection. We also propose the Spatial-Temporal Fusion Model (STFM) and Global Saliency Fusion Model (GSFM), where STFM fuses the spatial-temporal features and GSFM fuses the features from different layers of STFM to enhance detection efficiency. Simulation results show that DLS increases accuracy by 0.33% and reduces the computational complexity by 49.3%. The running time per test epoch of DLS is reduced by 33.1%.
抄録全体を表示
-
Zhe ZHANG, Yiding WANG, Jiali CUI, Han ZHENG
原稿種別: PAPER
論文ID: 2024EDP7161
発行日: 2025年
[早期公開] 公開日: 2025/04/02
ジャーナル
フリー
早期公開
Multimodal Emotion Recognition (MER) is a critical task in sentiment analysis. Current methods primarily focus on multimodal fusion and representation of emotions, but they fail to capture the collaborative interaction in modalities effectively. In this study, we propose an MER model with intra-modal enhancement and inter-modal interaction (IEII). Firstly, this model extracts emotion information through RoBERTa, openSMILE, and DenseNet architectures from text, audio and video modalities respectively. The model designs the Large Enhanced Kernel Attention (LEKA) module which utilizes a simplified attention mechanism with large convolutional kernels, enhances intra-modal emotional information, and aligns modalities effectively. Then the multimodal representation space is proposed, which is constructed with transformer encoders to explore intermodal interactions. Finally, the model designs a Dual-Branch Multimodal Attention Fusion (DMAF) module based on grouped query attention and rapid attention mechanisms. The DMAF module integrates multimodal emotion representations and realizes the MER. The experimental results indicate that the model achieves superior overall accuracy and F1-scores on the IEMOCAP and MELD datasets compared to existing methods. It proved that the proposed model effectively enhances intra-modal emotional information and captures inter-modal interactions.
抄録全体を表示
-
Kazuhiro WADA, Masaya TSUNOKAKE, Shigeki MATSUBARA
原稿種別: PAPER
論文ID: 2024EDP7149
発行日: 2025年
[早期公開] 公開日: 2025/03/28
ジャーナル
フリー
早期公開
Citations using URL (URL citations) that appear in scholarly papers can be used as an information source for the research resource search engines. In particular, the information about the types of cited resources and reasons for their citation is crucial to describe the resources and their relations in the search services. To obtain this information, previous studies proposed some methods for classifying URL citations. However, their methods trained the model using a simple fine-tuning strategy and exhibited insufficient performance. We propose a classification method using a novel intermediate task. Our method trains the model on our intermediate task of identifying whether sample pairs belong to the same class before being fine-tuned on the target task. In the experiment, our method outperformed previous methods using the simple fine-tuning strategy with higher macro F-scores for different model sizes and architectures. Our analysis results indicate that the model learns the class boundaries of the target task by training our intermediate task. Our intermediate task also demonstrated higher performance and computational efficiency than an alternative intermediate task using triplet loss. Finally, we applied our method to other text classification tasks and confirmed the effectiveness when a simple fine-tuning strategy does not stably work.
抄録全体を表示
-
Kei KOGAI, Yoshikazu UEDA
原稿種別: PAPER
論文ID: 2024EDP7192
発行日: 2025年
[早期公開] 公開日: 2025/03/28
ジャーナル
フリー
早期公開
Information and control systems operated in the field of social infrastructure are required to enhance their quality in terms of safety and reliability, and model checking is an effective technique to validate their behavior in the design phase. Model checking generates a state transition diagram from a model of system behavior and verifies that the model satisfies a system requirement by exploring the state space. However, as the number of model attributes and attribute value combinations increases, the state space expands, leading to a state explosion that makes completing the search within a realistic time impossible. To solve this problem, methods to reduce the state space by dividing the model are commonly applied, although these methods require human judgment based on knowledge of the system and the designer's experience. The purpose of this paper is to propose a method for partitioning behavioral models of information and control system (ICS) without relying on such judgment. The structures of ICS are represented by attributes, and the behaviors are described by rules using these attributes. The description includes attributes that are characteristics of an ICS. This method extracts dependency relationships between rules from the reference to the attribute and generates the dependency graph. The graph is partitioned by clustering into clusters corresponding to the rules, thus reducing the state space. Clustering partitions the model at points where relationships between clusters, such as rule dependencies, are sufficiently low. Modularity is used as a measure to ensure that the total number of states after partitioning is less than before. The authors will confirm the effectiveness of this method by using the ICS example to show the partitioning of the system using this method, compare the number of states in the behavior models generated from the partitioned system, and show the results of model checking using these behavior models.
抄録全体を表示
-
Keigo WAKAYAMA, Takafumi KANAMORI
原稿種別: PAPER
論文ID: 2024EDP7245
発行日: 2025年
[早期公開] 公開日: 2025/03/28
ジャーナル
フリー
早期公開
Neural architecture search (NAS) is very useful for automating the design of DNN architectures. In recent years, a number of methods for training-free NAS have been proposed, and reducing search cost has raised expectations for real-world applications. In a state-of-the-art (SOTA) training-free NAS based on theoretical background, i.e., NASI, however, the proxy for estimating the test performance of candidate architectures is based on the training error, not the generalization error. In this research, we propose a NAS based on a proxy theoretically derived from the bias-variance decomposition of the normalized generalization error, called NAS-NGE, i.e., NAS based on normalized generalization error. Specifically, we propose a surrogate of the normalized 2nd order moment of Neural Tangent Kernel (NTK) and use it together with the normalized bias to construct NAS-NGE. We use NAS Benchmarks and DARTS search space to demonstrate the effectiveness of the proposed method by comparing it to SOTA training-free NAS in a short search time.
抄録全体を表示
-
Jiajun LI, Qiang LI, Kui ZHENG, JinZheng LU, Lijuan WEI, Qiang XIANG
原稿種別: PAPER
論文ID: 2024EDP7280
発行日: 2025年
[早期公開] 公開日: 2025/03/21
ジャーナル
フリー
早期公開
For landslides, a serious natural disaster, how to accurately locate the landslide area is crucial for disaster mitigation and relief work. In view of the complex situation of landslides and the difficulty of traditional methods in quickly and accurately determining the area where landslides occur, this paper proposes a multi-scale feature recognition algorithm for landslide images (MF-L-UNet++) by analyzing the characteristics of landslides and common semantic segmentation networks. MF-L-UNet++ is based on UNet++ with the following modifications. First, the Dual Large Feature Fusion Selective Kernel Attention (DLFFSKA) module is employed to eliminate the interference of background in model recognition and enhance the accuracy of landslide location capture. Second, the Same Scale Lightweight Kernel Prediction (SSLKP) is designed to achieve a significant reduction in the number of parameters while reducing the loss of convolutional feature information and position offset. Third, Large Kernel Content Aware Recombination Upsample (LKCARU) is presented to enhance the model's capacity to delineate the boundaries and details of the landslide, thereby facilitating more precise segmentation outcomes. Finally, Atrous Spatial Pyramid Pooling (ASPP) is introduced to address the issue of inadequate coverage and fusion of multi-scale information following the utilization of multiple modules, enabling the model to fully integrate global context information. The experimental results showed that on the expanded Bijie Landslide Dataset, the algorithm proposed in this study achieved an improvement of 3.68%, 1.29%, and 1.59% in IoU, Precision, and F1-score, respectively, compared to the UNet++ algorithm, while Params and Loss decreased by 0.86M and 0.05, respectively. Compared to other commonly used segmentation methods, the detection performance of the model in this paper is at the optimal level.
抄録全体を表示
-
Huansha Wang, Qinrang Liu, Ruiyang Huang, Jianpeng Zhang, Hongji Liu
原稿種別: PAPER
論文ID: 2024EDP7173
発行日: 2025年
[早期公開] 公開日: 2025/03/19
ジャーナル
フリー
早期公開
Multi-modal entity alignment (MMEA) endeavors to ascertain whether two multi-modal entities originating from distinct knowledge graphs refer to a congruent real-world object. This alignment is a pivotal technique in knowledge graph fusion, which aims to enhance the overall richness and comprehensiveness of the knowledge base. Existing mainstream MMEA models predominantly leverage graph convolutional networks and pre-trained visual models to extract the structural and visual features of entities, subsequently proceeding to integrate these features and conduct similarity comparisons. However, given the often suboptimal quality of multi-modal information in knowledge graphs, reliance solely on traditional visual feature extraction methods and the extraction of visual and structural features alone may result in insufficient semantic information within the generated multi-modal joint embeddings of entities. This limitation could potentially hinder the accuracy and effectiveness of multi-modal entity alignment. To address the above issues, we propose MSEEA, a Multi-modal Entity Alignment method based on Multidimensional Semantic Extraction. First, MSEEA fine-tunes a large language model using preprocessed entity relationship triples, thereby enhancing its capacity to analyze latent semantic information embedded in structural triples and generate contextually rich entity descriptions. Second, MSEEA employs a combination of multiple advanced models and systems to extract multidimensional semantic information from the visual modality, thereby circumventing the feature quality degradation that can occur with reliance solely on pre-trained visual models. Finally, MSEEA integrates different modal embeddings of entities to generate multi-modal representations and compares their similarities. We designed and executed experiments on FB15K-DB15K/ YAGO15K, and the outcomes demonstrate that MSEEA outperforms traditional approaches, achieving state-of-the-art results.
抄録全体を表示
-
Zhifu TIAN, Tao HU, Chaoyang NIU, Di WU, Shu WANG
原稿種別: PAPER
論文ID: 2024EDP7266
発行日: 2025年
[早期公開] 公開日: 2025/03/19
ジャーナル
フリー
早期公開
The deep unfolding network (DUN) for image compressive sensing (ICS) integrates a traditional optimization algorithm with a neural network, providing clear interpretability and demonstrating exceptional performance. Nevertheless, the inherent paradigm of the DUN lies in the independent proximal mapping between iterations and the limited information flux, potentially constraining the mapping capability of the deep unfolding method. This paper introduces a Feature-Domain FISTA-Inspired Deep Unfolding Network (FDFI-DUN) for ICS. FDFI-DUN comprises a Feature-Domain Nesterov Momentum Module (FNMM), a Feature-Domain Gradient Descent Module (FGDM), and a Two-level Multiscale Proximal Mapping Module (TMPMM). Specifically, the Nesterov momentum term and gradient descent term in the FISTA are tailored to the feature domain, enhancing the information flux of the entire DUN and augmenting the feature information within and between iterations while maintaining clear interpretability. Furthermore, the TMPMM, encompassing intra-stage and inter-stage components, is designed to further augment the information flux and effectively utilize multiscale feature information for reconstructing image details. Extensive experimental results demonstrate that the proposed FDFI-DUN surpasses state-of-the-art methods in both quality and vision. Our codes are available at: https://github.com/giant-pandada/FDFI-DUN.
抄録全体を表示
-
Yongfei WU, Daisuke KATAYAMA, Tetsushi KOIDE, Toru TAMAKI, Shigeto YOS ...
原稿種別: PAPER
論文ID: 2024EDP7283
発行日: 2025年
[早期公開] 公開日: 2025/03/19
ジャーナル
フリー
早期公開
In this paper, we propose an automatic segmentation method for detecting lesion areas from full-screen Narrow Band Imaging (NBI) endoscopic image frames using deep learning for real-time diagnosis support in endoscopy. In existing diagnosis support systems, doctors need to actively align lesion areas to accurately classify lesions. Therefore, we aim to develop a real-time diagnosis support system combining an automatic lesion segmentation algorithm, which can identify lesions in full-screen endoscopic image. We created a dataset of over 8000 images and verified the detection performance of multiple existing segmentation model structures. We realized that there is a serious problem of missing detection dealing with images with small lesion. We analyzed the possible reason and proposed a method of using convolutional backbone network for downsampling to retain effective information, and conducted experiments with a model structure using Dense Block and U-Net. The experimental results showed that the detection performance of our structure showed superiority over other models for small lesions. At the same time, CutMix, a data augmentation method added to the model learning method to further improve detection performance, was proven to be effective. The detection performance achieved an accuracy of 0.8603 ± 0.006 when evaluated using F-measure. In addition, our model showed the fastest processing speed in experimental test, which will be advantageous in the subsequent development of processing system for real-time clinical videos.
抄録全体を表示
-
Jifeng GUO, Yongjie WANG, Jingtan GUO, Shiwei WEI, Xian SHI
原稿種別: PAPER
論文ID: 2024EDP7135
発行日: 2025年
[早期公開] 公開日: 2025/03/11
ジャーナル
フリー
早期公開
The purpose of unsupervised person re-identification (Re-ID) is to improve the recognition performance of the model without using any labeled Re-ID datasets. Recently, camera differences and noisy labels have emerged as critical factors hindering the improvement of unsupervised Re-ID performance. To address these issues, we propose a camera style alignment (CSA) method. In CSA, we first devise the feature mean clustering (FM-clustering) algorithm, which is based on the average features for clustering to mitigate the impact of camera differences on the clustering results. Subsequently, we design dual-cluster consistency refinement (DCR), which assesses the reliability of pseudo-labels from the perspective of clustering consistency, thereby reducing the influence of noisy labels. In addition, we introduce style-aware invariance loss and camera-aware invariance loss to achieve camera style-invariant learning from different aspects. Style-aware invariance loss will improve the similarity between samples and their style-transferred counterparts, and camera-aware invariance loss will improve the similarity between positive samples of different cameras. The experimental results on the Market-1501 and MSMT17 datasets show that the performance of CSA exceeds the existing fully unsupervised Re-ID and unsupervised domain adaptation Re-ID methods.
抄録全体を表示
-
Xinglong PEI, Yuxiang HU, Yongji DONG, Dan LI
原稿種別: LETTER
論文ID: 2024EDL8095
発行日: 2025年
[早期公開] 公開日: 2025/03/10
ジャーナル
フリー
早期公開
We propose a task scheduling method using resource interleaving and Reinforcement Learning (RL) for edge network system. We use resource interleaving to schedule edge node task forwarding to reduce task waiting delay on resources after being forwarded. We formulate a task scheduling optimization problem and use RL to ensure real-time policy. Simulations verify the proposed method's effectiveness in task scheduling.
抄録全体を表示
-
Hee-Suk PANG, Jun-seok LIM, Seokjin LEE
原稿種別: LETTER
論文ID: 2024EDL8086
発行日: 2025年
[早期公開] 公開日: 2025/03/07
ジャーナル
フリー
早期公開
Whereas vibrato is one of the most frequently used techniques to enrich vocal and musical instrument sounds, the performance of fine frequency estimation methods has not been studied much for vibrato tones. We present three models of synthetic vibrato tones and compare three DFT-based fine frequency estimation methods using the models, which are phase difference estimation (PDE), zero-padding method (ZPM), and corrected quadratically interpolated fast Fourier transform (CQIFFT). Experimental results show that CQIFFT and ZPM with a large number of padded zeroes are effective in the fine frequency estimation of vibrato tones. We also show an example of applying each method to a flute vibrato tone. We expect that the results will be helpful in choosing a fine frequency estimation method for DFT-based methods to analyze the frequencies of vibrato tones.
抄録全体を表示
-
Hui Li, Xiaofeng Yang, Zebin Zheng, Jinyi Li, Shengli Lu
原稿種別: LETTER
論文ID: 2024EDL8089
発行日: 2025年
[早期公開] 公開日: 2025/03/07
ジャーナル
フリー
早期公開
Hardware accelerators using fixed-point quantization efficiently run object detection neural networks, but high-bit quantization demands substantial hardware and power, while low-bit quantization sacrifices accuracy. To address this, we introduce an 8-bit quantization scheme, ASPoT8, which uses add/shift operations to replace INT8 multiplications, minimizing hardware area and power consumption without compromising accuracy. ASPoT8 adjusts quantified value distribution to match INT8's accuracy. Tests on YOLOV3 Tiny and MobileNetV2 SSDlite show minimal mAP drops of 0.5% and 0.2%, respectively, with significant reductions in power (76.31%), delay (29.46%), and area (58.40%) over INT8, based on SMIC 40nm.
抄録全体を表示
-
Lei ZHOU, Ryohei SASANO, Koichi TAKEDA
原稿種別: PAPER
論文ID: 2024EDP7126
発行日: 2025年
[早期公開] 公開日: 2025/03/07
ジャーナル
フリー
早期公開
In the Autonomous Driving (AD) scenario, accurate, informative, and understandable descriptions of the traffic conditions and the ego-vehicle motions can increase the interpretability of an autonomous driving system to the vehicle user. End-to-end free-form video captioning is a straightforward vision-to-text task to address such needs. However, insufficient real-world driving scene descriptive data hinders the performance of caption generation under a simple supervised training paradigm. Recently, large-scale Vision-Language Pre-training (VLP) foundation models have attracted much attention from the community. Tuning large foundation models on task-specific datasets becomes a prevailing paradigm for caption generation. However, for the application in autonomous driving, we often encounter large gaps between the training data for VLP foundation models and the real-world driving scene captioning data, which impedes the immense potential of VLP foundation models. In this paper, we present to tackle this problem via a unified framework for cross-lingual cross-domain vision-language tuning empowered by Machine Translation (MT) techniques. We aim to obtain a captioning system for driving scene caption generation in Japanese from a domain-general and English-centric VLP model. The framework comprises two core components: (i) bidirectional knowledge distillation by MT teachers; (ii) fusing objectives for cross-lingual fine-tuning. Moreover, we introduce three schedulers to operate the vision-language tuning process with fusing objectives. Based on GIT [1], we implement our framework and verify its effectiveness on real-world driving scenes with natural caption texts annotated by experienced vehicle users. The caption generation performance with our framework reveals a significant advantage over the baseline settings.
抄録全体を表示
-
Boago OKGETHENG, Koichi TAKEUCHI
原稿種別: PAPER
論文ID: 2024EDP7189
発行日: 2025年
[早期公開] 公開日: 2025/03/07
ジャーナル
フリー
早期公開
Automatic Essay Scoring is a crucial task aimed at alleviating the workload of essay graders. Most of the previous studies have been focused on English essays, primarily due to the availability of extensive scored essay datasets. Thus, it remains uncertain whether the models developed for English are applicable to smaller-scale Japanese essay datasets. Recent studies have demonstrated the successful application of BERT-based regression and ranking models. However, downloadable Japanese GPT models, which are larger than BERT, have become available, and it is unclear which types of modeling are appropriate for Japanese essay scoring. In this paper, we explore various aspects of modeling using GPTs, including the type of model (i.e., classification or regression), the size of the GPT models, and the approach to training (e.g., learning from scratch versus conducting continual pre-training). In experiments conducted with Japanese essay datasets, we demonstrate that classification models combined with soft labels are more effective for scoring Japanese essays compared to the simple classification models. Regarding the size of GPT models, we show that smaller models can produce better results depending on the model, type of prompt, and theme.
抄録全体を表示
-
Zezhong LI, Jianjun MA, Fuji REN
原稿種別: LETTER
論文ID: 2024EDL8062
発行日: 2025年
[早期公開] 公開日: 2025/03/04
ジャーナル
フリー
早期公開
The past decade has witnessed the rapid development of Neural Machine Translation (NMT). However, NMT approaches tend to generate fluent but sometimes unfaithful translations of the source sentences. In response to this problem, we propose a new framework to incorporate the bilingual phrase knowledge into the encoder-decoder architecture, which allows the system to make full use of the phrase knowledge flexibly with no need to design complicated search algorithm. A significant difference to the existing work is that we obtain all the target phrases aligning to any part of the source sentence and learn representations for them before the decoding starts, which alleviates the hurt of invisibility of the future context in the standard autoregressive decoder, so that the generated target words can be decided more accurately with a global understanding. Extensive experiments on Japanese-Chinese translation task show that the proposed approach significantly outperforms multiple strong baselines in terms of BLEU scores, and verify the effectiveness of exploiting bilingual phrase knowledge for NMT.
抄録全体を表示
-
Chong-Hui Lee, Lin-Hao Huang, Fang-Bin Qi, Wei-Juan Wang, Xian-Ji Zhan ...
原稿種別: LETTER
論文ID: 2024EDL8087
発行日: 2025年
[早期公開] 公開日: 2025/03/04
ジャーナル
フリー
早期公開
In recent years, environmental sustainability and the reduction of CO2 emissions have become significant research topics. To effectively reduce CO2 emissions, recent studies have used deep learning models to provide precise estimates, but these models often lack interpretability. In light of this, our study employs an explainable neural network to learn fuel consumption, which is then converted to CO2 emissions. The explainable neural network includes an explainable layer that can explain the importance of each input variable. Through this layer, the study can elucidate the impact of different speeds on fuel consumption and CO2 emissions. Validated with real fleet data, our study demonstrates an impressive mean absolute percentage error (MAPE) of only 3.3%, outperforming recent research methods.
抄録全体を表示