IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
Advance online publication
Displaying 1-50 of 117 articles from this issue
  • Qianying ZHANG, Dongxu JI, Shijun ZHAO, Zhiping SHI, Yong GUAN
    Article type: PAPER
    Article ID: 2024ICP0004
    Published: 2025
    Advance online publication: June 26, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    ARM TrustZone technology is widely used to provide Trusted Execution Environments (TEEs) for sensitive applications. However, most TEE OSes are implemented as monolithic kernels. In such designs, all components run in the kernel which will lead to a big trusted computing base (TCB). It is difficult to guarantee that all components of the kernel have no security vulnerabilities. The functions of trusted computing, such as integrity measurement and data sealing, will provide further security guarantees. This paper presents MicroTEE, a TEE OS with rich trusted computing primitives based on the microkernel architecture. In MicroTEE, the microkernel provides strong isolation for services and applications. The kernel is only responsible for providing core services such as address space management, thread management, and inter-process communication. Other fundamental services, such as trusted service, are implemented as applications at the user layer. Trusted computing primitives provide some security features for trusted applications (TAs), including integrity measurement, data sealing, and remote attestation. Our design avoids the compromise of the whole TEE OS if some kernel service is vulnerable. A monitor has also been added to perform the switch between the secure world and the normal world. Finally, we implemented a MicroTEE prototype on the Freescale i.MX6Q Sabre Lite development board and tested its performance. Evaluation results show that MicroTEE only introduces some necessary and acceptable overhead.

    Download PDF (5531K)
  • Nhu NGUYEN, Hideaki TAKEDA
    Article type: PAPER
    Article ID: 2024EDP7258
    Published: 2025
    Advance online publication: June 24, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Wikipedia stands out as a globally utilized linguistic resource available in over 330 languages, attracting contributions from a diverse group of editors on a global scale. Despite its widespread use, significant disparities persist among language publications, including variations in the number of articles, the spectrum of topics covered, and even the number of contributing community editors. In this paper, we aim to alleviate this gap in the coverage of low-resource languages. Although previous work has focused on multilingual interoperability efforts, the potential of hyperlinks has not been fully realized. Therefore, this study introduces a novel approach focused on hyperlinks, specifically emphasizing hyperlink types derived from Wikidata. We extract and analyze patterns related to these hyperlink types across different languages, using them as recommended solutions to connect the topics of various languages, particularly low-resource languages. Collaborative filtering experiments suggest that using combined languages leads to good overall results while preserving the uniqueness of each language.

    Download PDF (948K)
  • Yikang WANG, Xingming WANG, Chee Siang LEOW, Qishan ZHANG, Ming LI, Hi ...
    Article type: PAPER
    Article ID: 2025EDP7044
    Published: 2025
    Advance online publication: June 24, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Currently, research in deepfake speech detection focuses on the generalization of detection systems towards different spoofing methods, mainly for noise-free clean speech. However, the performance of speech anti-spoofing countermeasure (CM) systems often does not work well in more complicated scenarios, such as those involving noise and reverberation. To address the problem of enhancing the robustness of CM systems, we propose a transfer learning-based hybrid approach with Speech Enhancement front-end and Counter Measure back-end Joint optimization (SECM-Joint), investigating its effectiveness in improving robustness against noise and reverberation. Experimental results show that our SECM-Joint method reduces EER by 19.11% to 64.05% relatively in most noisy conditions and 23.23% to 30.67% relatively in reverberant environments compared to a Conformer-based CM baseline system without pre-training. Additionally, our dual-path U-Net (DUMENet) further enhances the robustness for real-world applications. These results demonstrate that the proposed method effectively enhances the robustness of CM systems in noisy and reverberant conditions. Codes and experimental data supporting this work are publicly available at: https://github.com/ikou-austin/SECM-Joint

    Download PDF (3720K)
  • Reo UENO, Akihiro FUJIWARA
    Article type: LETTER
    Article ID: 2025PAL0001
    Published: 2025
    Advance online publication: June 24, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    In the membrane computing, most of the proposed algorithms for computationally hard problems use an exponential number of membranes, and reduction in the number of membranes must be considered in order to make the membrane computing a more realistic model.

    In the present paper, we propose an asynchronous P system using improved branch and bound to solve the minimum Steiner tree. The experimental results show the validity and efficiency of the proposed P system.

    Download PDF (552K)
  • Takashi YOKOTA, Kanemitsu OOTSU
    Article type: LETTER
    Article ID: 2025PAL0002
    Published: 2025
    Advance online publication: June 24, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Interconnection networks are inevitable in parallel computers. Effectiveness in parallel execution is largely affected by the interconnection network as a communication performance. Especially, collective communication is important since it is frequently executed in parallel programs. To improve the performance of collective communication, one of the promising methods is packet scheduling. This paper addresses a lazy method for packet scheduling. The proposed method is based on an evolutionary idea to find hopeful candidates for injection delays and improvement methods. Preliminary evaluation results reveal that the proposed method outperforms the existing method.

    Download PDF (445K)
  • Cheng XU, Yirong KAN, Renyuan ZHANG, Yasuhiko NAKASHIMA
    Article type: PAPER
    Article ID: 2025PAP0003
    Published: 2025
    Advance online publication: June 24, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    This paper proposes a Field-Programmable Gate Array (FPGA) accelerator for Vision Transformers (ViTs) with quantization and look-up-table (LUT) based operations. First, two improved quantization methods are proposed, achieving comparable performance at lower bit-widths. Furthermore, linear and nonlinear units' designs are proposed to support diverse operations in ViTs models. Finally, the LUT-based accelerator design is implemented and evaluated. Experimental results on the ImageNet dataset demonstrate that our proposed quantization method achieves an accuracy of 80.74% at 2-bit width, outperforming state-of-the-art Vision Transformer quantization methods by 0.1% to 0.5%. The performance of the proposed FPGA accelerator demonstrates a higher energy efficiency, achieving a peak energy efficiency of 7.06 FPS/W and 246 GOPS/W.

    Download PDF (2118K)
  • Aoi KIDA, Hideyuki KAWASHIMA
    Article type: PAPER
    Article ID: 2025PAP0004
    Published: 2025
    Advance online publication: June 24, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    State Machine Replication (SMR) is a fundamental technique for building fault-tolerant distributed systems with strong consistency. Rabia is an SMR protocol that simplifies implementation design through a randomized consensus algorithm. Our analysis reveals a design limitation of the Rabia protocol: under partial network partitioning, replicas can develop inconsistent queue states, leading to a livelock state. We present Qsync, which enhances Rabia's fault tolerance through queue state synchronization mechanisms while preserving its implementation simplicity. Experimental evaluation shows that Qsync maintains stable performance under partial network partitions where the original Rabia throughput drops to zero.

    Download PDF (3371K)
  • Toshiyuki ICHIBA, Yasuhiro WATANABE, Takahide YOSHIKAWA
    Article type: PAPER
    Article ID: 2025PAP0007
    Published: 2025
    Advance online publication: June 24, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Driven by the strong demand for enhanced performance in High-Performance Computing (HPC), Coarse-Grained Reconfigurable Architectures (CGRAs) are promising technologies that offer high performance even under power consumption constraints. Performance on CGRAs is significantly influenced by loop unrolling, a technique that increases computational parallelism by utilizing more processing elements in CGRAs. Determining the optimal loop unrolling factor is challenging in applications with multiple loops. This paper presents a case study demonstrating the determination of optimal loop unrolling factors for an application based on the Lattice Boltzmann Method (LBM). Because the application's process exceeds the capacity of a single CGRA, this paper proposes a method for partitioning the process to fit the CGRA's resources using integer linear programming (ILP). Finally, this paper provides a performance estimation of the CGRAs runtime and demonstrates the effectiveness of CGRAs for HPC.

    Download PDF (1147K)
  • Sho SATO, Shinobu MIWA, Hiroki HONDA, Hayato YAMAKI
    Article type: PAPER
    Article ID: 2025PAP0008
    Published: 2025
    Advance online publication: June 24, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    In recent years, it has become increasingly important to utilize the entire network links more effectively to avoid traffic congestion for Internet Service Providers (ISPs), where link installation costs are high. As a promising approach to address this issue, multipath routing, which distributes traffic across multiple reachable paths to the destination, has getting attention. In multipath routing, congestion can be avoided by using other paths and balancing path loads even if a path is congested. Conventionally, realizing load-aware multipath routing has required both the collection of load metrics to track dynamically changing path loads and the distribution of traffic at an appropriate ratio with fine-grained traffic units such as flowlets. However, in ISP networks, existing methods may fail to balance path loads due to the large path delay and the variation in flow bit rates. In this paper, we propose a novel traffic balancing method suitable for ISP networks. In the proposed method, we first derive a target bandwidth for each path to equalize congestion levels of all paths in multipath, and then decide the distribution ratio by feedback control. In addition to this, the proposed method adopts modified flow-level traffic distribution, which makes flows reselect their paths at a certain time intervals. These approaches enable to balance traffic more evenly in ISP networks than conventional methods. Through network simulations using network topologies assuming ISP networks, including SINET6, we demonstrated that the proposed method can reduce the average flow completion time (FCT) by 16.0%, 44.5%, and 58.4% compared to ECMP, which performs naive traffic distribution, CONGA and W-ECMP, which achieve advanced traffic distribution.

    Download PDF (3116K)
  • Takuya FUTAGAMI, Noboru HAYASAKA
    Article type: PAPER
    Article ID: 2025PCP0006
    Published: 2025
    Advance online publication: June 19, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    This study proposes a knowledge-based handcrafted building region extraction algorithm that can accurately identify the building and its background from street image at pixel level. The proposed algorithm leverages a customized patch-based graph cut inspired by human visual perception mechanisms. At the patch-based graph cut, the similarity of patches is measured by the cutting-edge deep neural networks (DNNs). The graph settings are based on the knowledge that buildings are captured at the center of the image owing to their main subject. Our experiment, which employed 300 images included in well-known open dataset, demonstrated that the proposed method employing GrabCut for a pixel-level segmentation significantly increased the comprehensive accuracy of building region extraction, which is measured by intersection over union (IoU), by 12.29% or more compared with the conventional knowledge-based method using color segmentation. This stems from the fact that the proposed method presents the more accurate building and background candidates by 8.57% or more. In addition, the GrabCut-based proposed method represented a similar accuracy to the state-of-the-art DNN-based semantic segmentation based on a transformer architecture. Further comparisons and discussions are provided in this paper to clarify the effectiveness of the proposed method.

    Download PDF (8780K)
  • Onhi KATO, Akira KUBOTA
    Article type: PAPER
    Article ID: 2025PCP0007
    Published: 2025
    Advance online publication: June 19, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    In recent years, zero-shot learning-based haze removal methods using a single image have been proposed and have gained attention for their effectiveness. However, methods that fuse near-infrared (NIR) and color images have not been sufficiently studied. This paper presents a haze removal method based on zero-shot learning that fuses NIR and color images. The proposed method consists of two steps: haze removal and edge fusion. In the first step, the atmospheric scattering model is adapted to remove haze from NIR and color images. This step restores colors in the color image and enhances edges in the NIR image. In the second step, a new method is introduced to fuse haze-removed NIR and color images. This method preserves the natural color and the luminance of the color image and effectively uses the edges of the NIR image. Specifically, a weight map is generated to adjust for luminance changes and is added to the NIR image. The adjusted NIR image is then multiplied by the lightness image to restore the edges. This process allows for a natural fusion of NIR and lightness images and an effective fusion of detailed edges. Our qualitative and quantitative evaluations demonstrated that our method can restore color and edges more naturally than the conventional methods. Furthermore, it was shown to be effective even for strong haze images.

    Download PDF (8795K)
  • Yan XIANG, Di WU, Yunjia CAI, Yantuan XIAN
    Article type: PAPER
    Article ID: 2024EDP7313
    Published: 2025
    Advance online publication: June 18, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Joint multimodal aspect-based sentiment analysis (JMABSA) aims to extract aspects from multimodal inputs and determine their sentiment polarity. Existing research often faces challenges in effectively aligning aspect features across images and text. To address this, we propose an entity knowledge-guided image-text alignment network that integrates alignment across both modalities, enabling the model to more accurately capture jointly expressed aspect and sentiment information in images and text. Specifically, we introduce an entity class embedding to guide the model in learning entity-related features from text. Additionally, we utilize scene and aspect descriptions in images as entity knowledge, helping the model learn entity-relevant features from visual input. The alignment between entity knowledge in images and the initial text further supports the model in learning consistent aspect and sentiment expressions across modalities. Experimental results on two benchmark datasets demonstrate that our method achieves state-of-the-art performance on two public datasets.

    Download PDF (2586K)
  • Anlin HU, Wenjiang FENG, Xudong ZHU, Junjie WANG, Shaolong LI
    Article type: LETTER
    Article ID: 2025EDL8015
    Published: 2025
    Advance online publication: June 18, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Deep Learning-based Fault Localization (DLFL) uses metamorphic testing to locate faults in the absence of test oracles. However, these approaches face the class imbalance problem, i.e., the violated data (i.e., minority class) is much less than the non-violated data (i.e., majority class). To address this issue, we propose MDAug: Metamorphic Diffusionbased Augmentation for improving DLFL without test oracles. MDAug combines metamorphic testing and diffusion model to generate the data of minority class and acquire class balanced data. We apply MDAug to three state-of-the-art DLFL baselines without test oracles, and the results show that MDAug significantly outperforms all the baselines in the absence of test oracles.

    Download PDF (1895K)
  • Yi LIU, QiaoXing LI, Lu XIAO, Sen ZHANG
    Article type: PAPER
    Article ID: 2025EDP7088
    Published: 2025
    Advance online publication: June 18, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Driver distraction is a primary cause of traffic accidents, and the real-time and effective detection of such behaviors can significantly reduce traffic-related injuries and fatalities. In this paper, we enhance the lightweight YOLOv10n model by integrating the BiFPN structure to bolster its multi-scale feature extraction capabilities. Additionally, we design a CASSA module that combines channel attention, spatial attention, and channel shuffle to strengthen the model's ability to capture long-range dependencies. The model was tested on the CBTDDD dataset, established in this study, which includes data on driver distraction across multiple scenarios involving sedans, passenger buses, and trucks. Compared to the original YOLOv10n model, the proposed model demonstrates a 2.0% improvement in mAP@0.5 and achieves an FPS of 115.3 f/s. These results indicate that the YOLOv10n-BC model developed in this paper is capable of performing real-time and efficient monitoring of driver distraction.

    Download PDF (8761K)
  • Xuemin Huang, Xiaoliang Zhuang, Fangyuan Tian, Zheng Niu, Lin Peng, Qi ...
    Article type: LETTER
    Article ID: 2025EDL8016
    Published: 2025
    Advance online publication: June 10, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    An FPGA-based fire detection system using a back propagation (BP) neural network was designed for early fire detection in key equipment in converter stations. An 8-5-1 BP network structure was trained, achieving a recognition accuracy of 94.08%. Fixed-point data quantization and pipelining were employed to reduce computational complexity, lowering resource consumption and enhancing speed. The FPGA system used 683 LUTs, achieved a 94.6% detection rate, consumed only 1.342 W of power and completed a single detection in 3.25 μs,a significant improvement compared to the 8.56 ms detection time on MATLAB.This system demonstrates excellent reliability, real-time performance, and promising application potential for early fire detection in key equipment in converter stations.

    Download PDF (1373K)
  • Zeyou LIAO, Junguo LIAO
    Article type: PAPER
    Article ID: 2025EDP7001
    Published: 2025
    Advance online publication: June 10, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Object detection in drone-captured scenarios presents significant challenges due to factors such as varying object scales, motion blur, and dense object clusters. Although existing methods, including attention blocks and feature fusion networks, have shown improvements in detection accuracy, they often come with high computational costs, which hinder realtime performance. In this paper, we propose IFN-YOLOv8, an enhanced version of YOLOv8, designed to address these challenges. By integrating the P2 feature scale, IFN-YOLOv8 enhances small object detection through higher-resolution feature maps. Additionally, we introduce a novel convolutional block, RHAConv, to replace traditional convolution layers, improving feature representation in scenes with dense object clusters. A new Information Fusion Module is also proposed to refine object features, reducing both missed and false detections. Experimental results on the VisDrone and DOTA datasets demonstrate that IFN-YOLOv8 outperforms mainstream methods, achieving an mAP@50 of 45.7% and 68.5%, respectively, while maintaining low resource consumption and high detection speed.

    Download PDF (3064K)
  • Zhiwei YU, Weixiang XU, Qianhang DU, Rong-Long WANG, Shangce GAO
    Article type: LETTER
    Article ID: 2024EDL8097
    Published: 2025
    Advance online publication: June 09, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Glaucoma is one of the leading causes of irreversible blindness worldwide. Deep learning methods have made significant strides in predicting glaucoma in recent years. However, existing models continue encountering challenges in extracting complex and subtle pathological features from fundus images associated with glaucoma. To address this limitation, we propose a novel DMNet model, which aims to enhance the integration of input signals by simulating the dendritic neuron model. This approach can improve the capture of fine details within glaucoma images and significantly boost classification performance. Experimental results indicate that DMNet outperforms traditional deep learning models on the glaucoma fundus image dataset, demonstrating its substantial performance advantages.

    Download PDF (1464K)
  • Hanaki YACHI, Wenzhu GU, Zhenyu LEI, Masaaki OMURA, Shangce GAO
    Article type: PAPER
    Article ID: 2024EDP7320
    Published: 2025
    Advance online publication: June 09, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Deep learning has revolutionized complex tasks such as classification, approximation, and prediction, drawing inspiration from mathematical models of the human brain. Among recent breakthroughs, Google's Transformer architecture has established itself as a leading framework in natural language processing. Its adaptation to computer vision, known as the Vision Transformer (ViT), has set new benchmarks for image-based tasks. In this study, we introduce a novel neural network model that integrates the input layer of the ViT with the dendritic neuron model (DNM). This hybrid architecture combines the advanced feature extraction capabilities of ViT with the adaptability and robustness of DNM to enhance performance. The proposed model is applied to the diagnosis of diabetic retinopathy, effectively identifying critical features associated with the condition. The results underscore its potential to improve the accuracy and reliability of medical image analysis, paving the way for advancements in healthcare diagnostics.

    Download PDF (702K)
  • Yuka IKEGAMI, Kento HASEGAWA, Seira HIDANO, Kazuhide FUKUSHIMA, Kazuo ...
    Article type: PAPER
    Article ID: 2024EDP7325
    Published: 2025
    Advance online publication: June 09, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    With the rapid increase in demand for IoT devices, malicious attacks targeting vulnerabilities in IoT devices have been frequent in recent years. It is highly expected that the vulnerabilities can be removed from them through vulnerability assessment. However, the wide variety of IoT devices is not standardized, and it is difficult to set up vulnerability assessment items mechanically for those IoT devices, which causes a major obstacle to automate the vulnerability assessment for IoT devices. In this paper, we propose a method to prioritize vulnerability assessment items for every IoT device by effectively utilizing large language models (LLMs). The proposed method generates the answers that take into account the specifications of individual IoT devices using an LLM by introducing Retrieval Augmented Generation (RAG), and determines how much suitable each vulnerability assessment item is for every IoT device by calculating the suitability using semantic entropy. At that time, the proposed method introduces hybrid search with reranking as a search method for related chunks in RAG. Through binary classification of vulnerability assessment items, the average area under the curve (AUC) of 0.753 was achieved for five IoT devices. We confirmed that the proposed method is more effective in evaluating the suitability of the items to the target device specifications than the methods using keyword search, vector search, and hybrid search with RRF (Reciprocal Rank Fusion).

    Download PDF (4541K)
  • Shigeaki Tanimoto, Yoshinori Fujihira, Toru Kobayashi, Takeshi Yamauch ...
    Article type: LETTER
    Article ID: 2024OFL0001
    Published: 2025
    Advance online publication: June 09, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    We propose “bio-inspired UX,” a new method based on the defense mechanisms of ecosystems, for preventing intentional internal fraud within organizations. The proposed method features a function for sharing UX information within groups, inspired by the signal transmission mechanism between plants.

    Download PDF (1197K)
  • Taishin TAKAHATA, Mitsuharu MATSUMOTO
    Article type: LETTER
    Article ID: 2025EDL8026
    Published: 2025
    Advance online publication: June 09, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Disaster relief robots have been studied extensively as a promising approach to realize lifesaving and goods transportation without the need for manpower. Most disaster relief robots are designed to search for and find a person in need of rescue. However, it is not always easy for a robot to find a person in need of rescue at a disaster site, and the person in need of rescue may not even notice the presence of a robot approaching very close by. In this study, we therefore investigate the effectiveness of smell as a method of communicating the presence of a robot. We conducted a search experiment with and without smell to evaluate whether the sense of smell is useful for search. The results of the experiment confirmed its high effectiveness in searching with smell.

    Download PDF (580K)
  • Kosuke SHIMIZU, Taizo SUZUKI
    Article type: PAPER
    Article ID: 2025PCP0005
    Published: 2025
    Advance online publication: June 09, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    We propose a JPEG format-compliant encryption method in the quantized discrete cosine transform (QDCT) domain for texture protection, called Prediction Error-Propagated Encryption with Modulo Operator (PEPE-MO = WPE-MO, by pronouncing ‘W’ as ‘double’). In the QDCT domain, both the direct current (DC) coefficients, which contain structure information, and alternating current (AC) coefficients, which contain texture information, are encrypted with newly placed prediction, encryption, and reconstruction modules. The resulting propagated prediction error reinforces texture protection. To ensure JPEG compatibility, WPE-MO incorporates a modulo operator into the prediction and reconstruction modules, circulating coefficients within the JPEG-encodable value range. Additionally, to balance attack resilience and coding efficiency, two adjustable parameters are introduced: random length interval (RLI) and random step size (RSS). Experiments on JPEG image encryption demonstrate that WPE-MO exhibits high attack resilience with minimal degradation in coding efficiency. In particular, WPE-MO resists ciphertext-only attacks, including brute-force and replacement attacks, with approximately 19.55 % degradation in coding efficiency, as measured by the Bjøntegaard-delta rate, through careful selection of RLI and RSS.

    Download PDF (1896K)
  • Lintang Matahari Hasani, Kasiyah Junus, Lia Sadita, Ayano Ohsaki, Tsuk ...
    Article type: LETTER
    Article ID: 2024EDL8025
    Published: 2025
    Advance online publication: June 02, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Learners need to progress through certain inquiry stages to experience a good online discussion. This study analyzes the discussion of two classes that received different preparation: Kit-build concept mapping (KBCM) and summary writing. By using epistemic network analysis, KBCM class showed close to ideal connectivity between the inquiry stages.

    Download PDF (713K)
  • Ying Liu, Yong Li, Ming Wen, Xiangwei Xu
    Article type: PAPER
    Article ID: 2024EDP7299
    Published: 2025
    Advance online publication: June 02, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Federated Learning collaborates with multiple organizations to train machine learning models in a way that does not reveal raw data. As a new learning paradigm, FL suffers from statistical challenges on cross-organizational non-IID data, limiting the global model to provide good performance for each client task. In this paper, we propose a personalized federated meta-learning (EPer-FedMeta) algorithm for heterogeneous clients using q-FedAvg as a model aggregation strategy, which helps the global model to optimize a reasonable representation fairly with multiple client personalized models and introduces a contrast loss in the local training to bring the similarity between meta-learner representations closer. Also noteworthy is the potential cold-start problem for new tasks in PFL (Personalized Federated Learning), where EPer-FedMeta simply uses CondConv to make lightweight modifications to the CNN network for more robust model personalization migration. Our extensive empirical evaluation of the LEAF dataset and the actual production dataset shows that EPer-FedMeta further mitigates the challenges of Non-IID data on FL system communication costs and model accuracy. In terms of performance and optimization, EPer-FedMeta achieves optimal model performance with faster convergence and lower communication overhead compared to the leading optimization algorithms in FL.

    Download PDF (1031K)
  • Makoto NAKATSUJI, Yasuhiro FUJIWARA
    Article type: PAPER
    Article ID: 2024OFP0009
    Published: 2025
    Advance online publication: June 02, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Developing personalized chatbots is crucial in the field of AI, particularly when aiming for dynamic adaptability similar to that of human communication. Traditional methods often overlook the importance of both the speaker's and the responder's personalities and their interaction histories, resulting in lower predictive accuracy. Our solution, INTPChat (Interactive Persona Chat), addresses this limitation. INTPChat builds implicit profiles from extensive utterance histories of both speakers and responders and updates these profiles dynamically to reflect current conversational contexts. By employing a co-attention encoding mechanism, INTPChat aligns current contexts with responses while considering historical interactions. This approach effectively mitigates data sparsity issues by iteratively shifting each context backward in time, allowing for a more granular analysis of long-term interactions. Evaluations on long-term Reddit datasets demonstrate that INTPChat significantly enhances response accuracy and surpasses the performance of state-of-the-art persona chat models.

    Download PDF (1337K)
  • Qian Zewen, HAN Zhezhe, Jiang Haoran, Zhang Ziyi, Zhang Mohan, Ma Hao, ...
    Article type: LETTER
    Article ID: 2025EDL8003
    Published: 2025
    Advance online publication: June 02, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Identifying the combustion conditions in power-plant furnaces is crucial for optimizing combustion efficiency and reducing pollutant emissions. Traditional image-processing methods heavily rely on prior empirical knowledge, limiting their ability to comprehensively extract features from flame images. To address these deficiencies, this study proposed a novel approach for combustion condition identification through flame imaging and a convolutional autoencoder (CAE). In this approach, the flame images are first preprocessed, then the CAE is established to extract the deep features of the flame image, and finally the Softmax classifier is employed to determine the combustion conditions. Experimental research is carried out on a 600MW opposed wall boiler, and the effectiveness of the proposed method is evaluated using captured flame images. Results demonstrate that the proposed CAE-Softmax model achieves an identification accuracy of 98.2% under the investigated combustion conditions, significantly outperforming traditional models. These findings reveal the method feasibility, offering an intelligent and efficient solution for enhancing the operational performance of power-plant boilers.

    Download PDF (1344K)
  • Jialong LI, Shogo MORITA, Wei WANG, Yan ZHANG, Takuto YAMAUCHI, Kenji ...
    Article type: LETTER
    Article ID: 2025EDL8017
    Published: 2025
    Advance online publication: June 02, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Human-robot collaboration has become increasingly complex and dynamic, highlighting the need for effective and intuitive communication. Two communication strategies for robots have been explored: (i) global-perspective strategy to share an overview of task progress, aimed at achieving consensus on completed and upcoming tasks; and (ii) local-perspective strategy to share the robot's intent, aimed at conveying the robot's immediate intentions and next actions. However, existing studies merely rely on the distinct focus to differentiate between the use of different strategies, lacking a deeper exploration of how these strategies affect user perceptions and responses in practice. For example, a possible concern could be which strategy is more likely to inspire human effort in collaboration. To this end, this paper conducts a user experiment (N=15) within a collaborative cooking scenario, and provides design insights into the strengths and weaknesses of each strategy from three dimensions to inform the design of human-sensitive communication.

    Download PDF (1365K)
  • Ziyue WANG, Yanchao LIU, Xina CHENG, Takeshi IKENAGA
    Article type: PAPER
    Article ID: 2025PCP0002
    Published: 2025
    Advance online publication: June 02, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Automatically reconstructing structured 3D model of real-world indoor scenes has been an essential and challenging task in indoor navigation, evacuation planning and wireless signal simulation, etc. Despite the increasing demand of updated indoor models, indoor reconstruction from monocular videos is still in an early stage in comparison with the reconstruction of outdoor scenes. Specific challenges are related to the complex building layouts which need long-term video recording, and the high presence of elements such as pieces of furniture causing clutter and occlusions. To accurately reconstruct the large-scale indoor scenes with multiple rooms, this paper designs a large-scale indoor multiple room 3D reconstruction pipeline to explore the topology relation between different rooms from long-term monocular videos. Firstly, semantic door detection based video segmentation is proposed to segment different rooms in video for individual reconstruction to avoid global mismatching noise, and 3D temporal trajectory is proposed to connect different rooms in spatial domain. Secondly, 3D Hough transform and Principal components analysis are utilized to refine the room boundary from reconstructed point clouds, which contributes to the accuracy improvement. Further, an original longterm video dataset for large-scale indoor multiple rooms reconstruction is constructed, which contains 12 real-world videos and 4 virtual videos with 30 rooms. Extensive experiments demonstrate that the proposed method reaches the highest performance of the 3D IoU at 0.70, room distance accuracy at 0.87, and connectivity accuracy at 0.67, which is around 39% better on average compared with various state-of-the-art models.

    Download PDF (1709K)
  • Kosuke KURIHARA, Yoshihiro MAEDA, Daisuke SUGIMURA, Takayuki HAMAMOTO
    Article type: PAPER
    Article ID: 2025PCP0004
    Published: 2025
    Advance online publication: June 02, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    We propose a non-contact heart rate (HR) estimation method that models weak physiological blood volume pulse (BVP) signals and strong noise signals caused by background illumination. Our method integrates BVP signal extraction based on a physiological model and a flexible RGB/NIR integration scheme based on an illumination model in a unified manner. This unified framework enables accurate extraction of the BVP signal while suppressing noise derived from ambient light, and thus improves HR estimation performance. We demonstrate the effectiveness of our method through experiments using several datasets, including various illumination scenes. Our code will be available on https://github.com/kosuke-kurihara/PhysIllumHR.

    Download PDF (5774K)
  • Zhiyao SUN, Peng WANG
    Article type: PAPER
    Article ID: 2024EDP7289
    Published: 2025
    Advance online publication: May 28, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Mobile edge computing (MEC) faces severe challenges in achieving efficient and timely task offloading in heterogeneous network environments. While existing contract-based approaches address incentive compatibility and resource coordination, many either ignore the constraints of age of information (AoI) or suffer from high computational complexity. This paper presents an AoI-guaranteed Optimal Contract (AOC) mechanism that jointly considers information freshness and asymmetric information in MEC systems. We design a three-tier heterogeneous network architecture with non-orthogonal multiple access to enable cooperative task offloading across multiple cells and enhance spectral efficiency. Instead of a model that requires extensive training and is difficult to analyze, our proposed AOC framework uses a lightweight block coordinate descent (BCD) algorithm to solve closed-form contract solutions while ensuring incentive compatibility and individual rationality. Simulation results show that the AOC mechanism significantly improves the utility and AoI performance of the MEC server compared with existing incentive-based methods. In addition, the analysis confirms the robustness and practical deployability of the proposed framework under different system conditions.

    Download PDF (7998K)
  • Qingxia YANG, Deng PAN, Wanlin HUANG, Erkang CHEN, Bin HUANG, Sentao W ...
    Article type: PAPER
    Article ID: 2024EDP7316
    Published: 2025
    Advance online publication: May 23, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Ship detection in maritime monitoring is crucial for ensuring public safety in marine environments. However, maritime surveillance faces significant challenges due to weak targets (small, low-contrast objects) caused by complex environments and long distances. To address these challenges, we propose YOLO-MSD, a maritime surveillance detection model based on YOLOv8. In YOLO-MSD, Receptive-Field Attention Convolution (RFAConv) replaces standard convolution, learning attention maps via receptive-field interaction to enhance detail extraction and reduce information loss. The C2f module In the neck integrates Omni-Dimensional Dynamic Convolution (ODConv), which dynamically adjusts convolution kernel parameters to effectively capture contextual information, thereby achieving superior multi-scale feature fusion. We introduce a dedicated detection head specifically for small objects to enhance detection accuracy. Furthermore, to address detection box quality imbalance, we employ Wise-IoU for bounding box regression loss, enhancing multi-scale target localization and accelerating convergence. The model achieves precision, recall and mean average precision (mAP50) rates of 93.0%, 90.05% and 95.0%, respectively, on the self-constructed Maritime Vessel Surveillance Dataset (MVSD), effectively meeting the requirements for maritime target detection. We further conduct comparative experiments on the public McShips dataset, demonstrating YOLO-MSD's broad applicability in ship detection.

    Download PDF (2746K)
  • Mitsuhiro WATANABE, Go HASEGAWA
    Article type: PAPER
    Article ID: 2025EDP7014
    Published: 2025
    Advance online publication: May 23, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    As the Internet becomes larger-scaled and more diversified, the traditional end-to-end (E2E) congestion control faces various problems such as low throughput on long-delay networks and unfairness among flows with different network situations. In this paper, we propose a novel congestion control architecture, called in-network congestion control (NCC). Specifically, by introducing one or more nodes (NCC nodes) on an E2E network path, we divide the network path into multiple sub-paths and maintain a congestion-control feedback loop on each sub-path. In each sub-path, a specialized congestion control algorithm can be applied according to its network characteristics. This architecture can provide various advantages compared with the traditional E2E congestion control, such as higher data transmission throughput, better per-flow fairness, and incremental deployment nature. In this paper, we describe NCC's advantages and challenges, and clarify its potential performance by evaluation results. We reveal that the E2E throughput improves by as much as 159% by just introducing NCC nodes. Furthermore, increasing the number of NCC nodes improves the E2E throughput and fairness among flows by up to 258% and 151%, respectively.

    Download PDF (3065K)
  • Guanghui CAI, Junguo ZHU
    Article type: PAPER
    Article ID: 2024EDP7292
    Published: 2025
    Advance online publication: May 15, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Deep learning has transformed Neural Machine Translation (NMT), but the complexity of these models makes them hard to interpret, thereby limiting improvements in translation quality. This study explores the widely used Transformer model, utilizing linguistic features to clarify its inner workings. By incorporating three linguistic features—part-of-speech, dependency relations, and syntax trees—we demonstrate how the model's attention mechanism interacts with these features during translation. Additionally, we improved translation quality by masking nodes that were identified to have negative effects. Our approach bridges the complex nature of NMT with clear linguistic knowledge, offering a more intuitive understanding of the model's translation process.

    Download PDF (4232K)
  • Shuhei YAMAMOTO, Yasunori AKAGI, Tomu TOMINAGA, Takeshi KURASHIMA
    Article type: PAPER
    Article ID: 2024EDP7248
    Published: 2025
    Advance online publication: May 14, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Present bias, the cognitive bias that prioritizes immediate rewards over future ones, is considered one of the factors that can hinder goal achievement. Estimation of present bias enables the development of effective intervention strategies for behavioral change. This paper proposes a novel method using behavior history, captured by wearable devices for estimating the present bias. We employ Transformer due to its proficiency in learning relationships within sequential data like behavioral history, including continuous (e.g., heart rate) and event data (e.g., sleep onset). To allow Transformer to capture behavior patterns affected by present bias from behavior history, we introduce two novel architectures for effectively processing continuous and event data timestamp information in behavioral history: temporal and event encoders (TE and EE). TE discerns the periodic characteristics of continuous data, while EE examines temporal interdependencies in the event data. These encoders enable our proposed model to capture temporally (ir)regular behavioral patterns associated with present bias. Our experiments using the behavior history logs of 257 subjects collected over 28 days demonstrated that our method estimates the subjects' present bias accurately.

    Download PDF (1149K)
  • Shrey SINGH, Prateek KESERWANI, Katsufumi INOUE, MASAKAZU IWAMURA, Par ...
    Article type: PAPER
    Article ID: 2024EDP7297
    Published: 2025
    Advance online publication: May 14, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Sign language recognition (SLR) using a video is a challenging problem. In the SLR problem, I3D network, which has been proposed for action recognition problems, is the best performing model. However, the action recognition and SLR are inherently different problems. Therefore, there is room to develop it for the SLR problem to achieve better performance, considering the task-specific features of SLR. In this work, we revisit I3D model to extend its performance in three essential design aspects. They include a better inception module named dilated inception module (DIM) and an attention mechanism-based temporal attention module (TAM) to identify the essential features of signs. In addition, we propose to eliminate a loss function that deteriorate the performance. The proposed method has been extensively validated on WLASL and MS-ASL public datasets. The proposed method has outperformed the state-of-the-art approaches in WLSAL dataset and produced competitive results on MS-ASL dataset, though the results of MS-ASL dataset are indicative due to unavailability of the original data. The Top-1 accuracy of the proposed method on WLASL100 and MS-ASL100 were 79.08% and 82.78%, respectively.

    Download PDF (2879K)
  • Olivier NOURRY, Masanari KONDO, Shinobu SAITO, Yukako IIMURA, Naoyasu ...
    Article type: LETTER
    Article ID: 2025EDL8005
    Published: 2025
    Advance online publication: May 14, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    [Background] Throughout their lifetime, open-source software systems will naturally attract new contributors and lose existing contributors. Not all OSS contributors are equal, however, as some contributors within a project possess significant knowledge and expertise of the codebase (i.e., core developers). When investigating a project's ability to attract new contributors and how often a project loses contributors, it is therefore important to take into account the expertise of the contributors. [Goal] Since core developers are vital to a project's longevity, we therefore aim to find out: can OSS projects attract new core developers and how often do OSS projects lose core developers? [Results] To investigate core developer contribution patterns, we calculate the truck factor (or bus factor) of over 36,000 OSS projects to investigate how often TF developers join or abandon OSS projects. We find that 89% of our studied projects have experienced losing their core development team at least once. Our results also show that in 70% of cases, this project abandonment happens within the first three years of a project's life. We also find that most OSS projects rely on a single core developer to maintain development activities. Finally, we find that only 27% of projects that were abandoned were able to attract at least one new TF developer.

    Download PDF (613K)
  • Xingxin WAN, Peng SONG, Siqi FU, Changjia WANG
    Article type: LETTER
    Article ID: 2025EDL8020
    Published: 2025
    Advance online publication: May 14, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    In ideal facial expression recognition (FER) tasks, the training and test data are assumed to share the same distribution. However, in reality, they are often sourced from different domains, which follow different feature distributions and would seriously impair the recognition performance. In this letter, we present a novel Dynamic Graph-Guided Domain-Invariant Feature Representation (DG-DIFR) method, which addresses the issue of distribution shifts across different domains. First, we learn a robust common subspace to minimize the data distribution differences, facilitating the extraction of invariant feature representations. Concurrently, the retargeted linear regression is employed to enhance the discrimination of the proposed model. Furthermore, a maximum entropy based dynamic graph is further introduced to maintain the topological structure information in the low-dimensional subspace. Finally, numerous experiments conducted on four benchmark datasets confirm the superiority of the proposed method over state-of-the-art methods.

    Download PDF (714K)
  • Shunya ISHIKAWA, Toru NAKASHIKA
    Article type: PAPER
    Article ID: 2025EDP7029
    Published: 2025
    Advance online publication: May 14, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Recent research in chord recognition has utilized machine learning models. However, few models adequately consider harmonic co-occurrence, a known musical feature. Since the harmonic structure is complex and varies with instrument and pitch, the model itself would need to consider harmonics explicitly, but few such methods exist. We propose the classification semi-restricted Boltzmann machine (CSRBM), a machine learning model that can explicitly consider the co-occurrence of any two pitches. A model parameter learns the co-occurrence function to enable chord recognition with flexible consideration of the harmonic structure. We demonstrate how to incorporate the structure as prior knowledge into the model by setting up a prior distribution of the parameter. We also propose weight-sharing CSRBM (WS-CSRBM), an extension of CSRBM that allows time series to be considered. This model enables the CSRBM to consider time series more efficiently not only by arranging some of the CSRBMs in parallel with the number of frames to be considered but also by sharing some of the parameters. Experimental results show that the recognition accuracies of the proposed methods outperform that of a conventional method that considers the co-occurrence of some harmonics. The effectiveness of the CSRBM's parameter in learning pitch co-occurrence, setting up a prior distribution for the parameter, and sharing some parameters in WS-CSRBM are also confirmed.

    Download PDF (4108K)
  • Koji ABE, Ryoma KITANISHI, Hitoshi HABE, Masayuki OTANI, Nobukazu IGUC ...
    Article type: PAPER
    Article ID: 2024EDP7282
    Published: 2025
    Advance online publication: May 07, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    At fish farms and fish farming facilities, the number of fish is continuously monitored from hatching until shipment. Especially, whenever hatchery-produced juvenile fish are transferred from one indoor aquaculture tank to another, fish farmers who manage the juvenile fish must manually count thousands of the fish, which places a significant burden on them. This paper presents an automated system for counting hatchery-produced juvenile fish in fish farming facilities. This system aims to serve as a foundational technology for aquaculture production management, supporting sustainable production through data-driven aquaculture. In the proposed system, a slide is set up with a video camera positioned above to capture the surface of the slide. The flow of juvenile fish along with water on the slide is recorded, and the number of juvenile fish captured in the video is counted. In every frame of the video, the starting and the ending lines are prepared perpendicular to the direction of fish movement, and the fish regions are tracked between these lines. The count is increased by one when a fish region has crossed the starting line. Subsequently, each fish region is tracked across frames, where the count is increased if a fish region in which an occlusion occurs between multiple fish regions has been separated. Under a custom-built recording setup, experiments were conducted with 10 videos of approximately 200 black medaka being released down the slide, and 2 videos with thousands of hatchery-produced juvenile fish being released down the slide, recorded at an aquaculture facility. The results indicated that the proposed system counted the number of fish accurately in most cases, even in the presence of occlusions.

    Download PDF (4218K)
  • Congda MA, Tianyu ZHAO, Manabu OKUMURA
    Article type: PAPER
    Article ID: 2024EDP7326
    Published: 2025
    Advance online publication: May 07, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Due to biases inherently present in data for pre-training, current pre-trained Large Language Models (LLMs) also ubiquitously manifest the same phenomena. Since the bias influences the output from the LLMs across various tasks, the widespread deployment of the LLMs is hampered. We propose a simple method that utilizes structured knowledge to alleviate this issue, aiming to reduce the bias embedded within the LLMs and ensuring they have an encompassing perspective when used in applications. Experimental results indicated that our method has good debiasing ability when applied to existing both autoregressive and masked language models. Additionally, it could ensure that the performances of LLMs on downstream tasks remain uncompromised. Importantly, our method obviates the need for training from scratch, thus offering enhanced scalability and cost-effectiveness.

    Download PDF (954K)
  • Bin YANG, Mingyuan LI, Yuzhi XIAO, Haixing ZHAO, Zhen LIU, Zhonglin YE
    Article type: PAPER
    Article ID: 2024EDP7152
    Published: 2025
    Advance online publication: April 24, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Aiming at the problem that existing graph neural network architectures usually use a single scale to process graph data, which leads to information loss and simplification, this paper proposes a novel graph neural network approach, the M2GNN framework, which aims to enhance the feature learning capability of graph structured data through multi-scale fusion and attention mechanism. In M2GNN, each channel handles graph features at different scales separately, and integrates local and global information using multi-scale fusion methods to capture features at different levels in the graph structure. The learned features from each channel are then weighted and fused using an attention mechanism to extract the most representative feature representation. The experimental results show that compared with the traditional graph neural network approach, M2GNN improves the performance by 0.70% to 54.14%, 0.34% to 54.31%, and 0.68% to 54.40% for the node classification task with different label coverages, which verifies the effectiveness of the multi-channel and multi-scale fusion strategies.

    Download PDF (2311K)
  • Chuanyang LIU, Jingjing LIU, Yiquan WU, Zuo SUN
    Article type: PAPER
    Article ID: 2024EDP7265
    Published: 2025
    Advance online publication: April 24, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    As a common type of defect, the rust defect of power components is one of the important potential hazards endangering the safe operation of transmission lines. How to quickly and accurately discover and repair the rusted power components is an urgent problem to be solved in power inspection. Aiming at the above problems, this study proposes Rust-Defect YOLO (RD-YOLO) for detecting rust defects in power components of transmission lines. Firstly, the Coordinate Channel Attention Residual Module (CCARM) is proposed to improve the multi-scale detection precision. Secondly, the Receptive Field Block (RFB) and the Efficient Convolutional Block Attention Module (ECBAM) are introduced into the Path Aggregation Network (PANet) to strengthen the fusion of deep and shallow features. Finally, the contrast sample strategy and the Focal loss function are adopted to train and optimize RD-YOLO, and experiments are carried out on a self-built dataset. The experimental results show that the average precision of rust defect detection by RD-YOLO reaches 95%, which is 9% higher than that of the original YOLOX. The comparative experimental results demonstrate that RD-YOLO performs excellently in power components identification and rust defect detection, and has broad application prospects in the future automatic visual inspection of transmission lines.

    Download PDF (3805K)
  • Yuewei ZHANG, Huanbin ZOU, Jie ZHU
    Article type: LETTER
    Article ID: 2024EDL8099
    Published: 2025
    Advance online publication: April 23, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Multi-resolution spectrum feature analysis has demonstrated superior performance over traditional single-resolution methods in speech enhancement. However, previous multi-resolution-based methods typically have limited use of multi-resolution features, and some suffer from high model complexity. In this paper, we propose a more lightweight method that fully leverages the multi-resolution spectrum features. Our approach is based on a convolutional recurrent network (CRN) and employs a low-complexity multi-resolution spectrum fusion (MRSF) block to handle and fuse multi-resolution noisy spectrum information. We also improve the existing encoder-decoder structure, enabling the model to extract and analyze multi-resolution features more effectively. Furthermore, we adopt the short-time discrete cosine transform (STDCT) for time-frequency transformation, avoiding the phase estimation problem. To optimize our model, we design a multi-resolution STDCT loss function. Experiments demonstrate that the proposed multi-resolution STDCT-based CRN (MRCRN) achieves excellent performance and outperforms current advanced systems.

    Download PDF (822K)
  • Trung MINH BUI, Jung-Hoon HWANG, Sewoong JUN, Wonha KIM, DongIn SHIN
    Article type: PAPER
    Article ID: 2024EDP7261
    Published: 2025
    Advance online publication: April 23, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    This paper develops a grasp pose detection method that achieves high success rates in real-world industrial environments where elongated objects are densely cluttered. Conventional Vision Transformer (ViT)-based methods capture fused feature maps, which successfully encode comprehensive global object layouts, but these methods often suffer from spatial detail reduction. Therefore, they predict grasp poses that could efficiently avoid collisions, but are insufficiently precisely located. Motivated by these observations, we propose Oriented Region-based Vision Transformer (OR-ViT), a network that preserves critical spatial details by extracting a fine-grained feature map directly from the shallowest layer of a ViT backbone and also understands global object layouts by capturing the fused feature map. OR-ViT decodes precise grasp pose locations from the fine-grained feature map and integrates this information into its understanding of global object layouts from the fused map. In this way, the OR-ViT is able to predict accurate grasp pose locations with reduced collision probabilities.

    Extensive experiments on the public Cornell and Jacquard datasets, as well as on our customized elongated-object dataset, verify that OR-ViT achieves competitive performance on both public and customized datasets when compared to state-of-the-art methods.

    Download PDF (5896K)
  • Huayang Han, Yundong Li, Menglong Wu
    Article type: LETTER
    Article ID: 2025EDL8004
    Published: 2025
    Advance online publication: April 22, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Building damage assessment (BDA) plays a crucial role in accelerating humanitarian relief efforts during natural disasters. Recent studies have shown that the state-space model-based Mamba architecture exhibits significant performance across various natural language processing tasks. In this paper, we propose a new model, OS-Mamba, which utilizes an Overall-Scan Convolution Modules (OSCM) for multidimensional global modeling of image backgrounds, enabling comprehensive capture and analysis of large spatial features from various directions, thereby enhancing the model's understanding and performance in complex scenes. Extensive experiments on the xBD dataset demonstrate that our proposed OS-Mamba model outperforms current state-of-the-art solutions.

    Download PDF (25608K)
  • Hyunsik YOON, Yon Dohn CHUNG
    Article type: LETTER
    Article ID: 2024EDL8071
    Published: 2025
    Advance online publication: April 17, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    The execution time of an Apache Spark application is heavily influenced by its configuration settings. Accordingly, Bayesian Optimization (BO) is commonly used for automated tuning, employing the acquisition function, Expected Improvement (EI). However, existing works did not compare the performance to the other acquisition functions empirically. In this paper, we show that EI may not work well for Spark applications due to a huge search space compared to the other optimization problems. In addition, we demonstrate the performance of BO based on Probability of Improvement (PI), which achieves exploration via rich random initialization and exploitation via the PI acquisition function. Through the experimental evaluations, we show that the PI-based BO outperforms the EI-based BO in both optimal time and optimization cost.

    Download PDF (1555K)
  • Rihito SHODA, Seiji MIYOSHI
    Article type: LETTER
    Article ID: 2024EDL8105
    Published: 2025
    Advance online publication: April 17, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Anomaly detection is essential in a wide range of fields. In this study, we focus on an Efficient GAN applied to anomaly detection, and aim to improve its performance by random erasing data augmentation and enhancing the loss function to incorporate mapping consistency. Experiments using images of normal lemons and damaged lemons reveal that the proposed method significantly improves the anomaly detection performance of Efficient GAN.

    Download PDF (2190K)
  • Duc-Dung NGUYEN
    Article type: LETTER
    Article ID: 2024EDL8107
    Published: 2025
    Advance online publication: April 17, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Compared to general object detection problems, the detection of mathematical expressions (MED) in document images has its own challenges, like the small size of inline formulas, the rich set of mathematical symbols, and the similarity between variables and normal text characters. To deal with those challenges, we transform the multi-class MED task into a multi-label semantic segmentation problem. With a basic encoder-decoder structure of 3.9 million parameters and trained from scratch, our proposed MEDNet model can achieve top detection performance on three public datasets: TFD2019, Marmot, and IBEM2021. MEDNet is especially effective in detecting small formulas when achieving the F1 score of 95.40% for the inline and 95.82% for all expressions on the test set of the IBEM2021 competition data.

    Download PDF (1228K)
  • Ruidong CHEN, Baohua QIANG, Xianyi YANG, Shihao ZHANG, Yuan XIE
    Article type: PAPER
    Article ID: 2024EDP7279
    Published: 2025
    Advance online publication: April 17, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    Image-text retrieval (ITR) aims at querying one type of data based on a given another type of data. The main challenge is mapping images and texts to a common space. Although existing methods obtain excellent performance on ITR tasks, they also have the drawbacks of weak information interaction and insufficient capture of deeper associative relationships. To address these problems, we propose CDISA: a Cross-modal Deep Interaction and Semantic Aligning method by combining vision-language pre-training model with semantic feature extraction capabilities. Specifically, we first design a cross-modal deep interaction module to enhance the interaction of image and text features by performing deep interaction matching computations. Secondly, to align the image and text features, bidirectional cosine matching is proposed to improve the differentiation of bimodal data within the feature space. We propose arguably the extensive experimental evaluation against recent state-of-the-art ITR methods on three datasets which include Wikipedia, Pascal-Sentence and NUS-WIDE.

    Download PDF (9576K)
  • Xichang CAI, Jingxuan CHEN, Ziyi LIU, Menglong WU, HongYang GUO, Xueji ...
    Article type: LETTER
    Article ID: 2024EDL8085
    Published: 2025
    Advance online publication: April 14, 2025
    JOURNAL FREE ACCESS ADVANCE PUBLICATION

    In recent years, convolutional recurrent neural networks (CRNNs) have achieved notable success in sound event detection (SED) tasks by leveraging the strengths of convolutional neural networks (CNNs) and recurrent neural networks (RNNs). However, existing models still face limitations in the temporal dimension, resulting in suboptimal temporal localization accuracy for SED. To address this issue, we designed a model called Temporal Enhanced Full-Frequency Dynamic Convolution (TEFFDConv). This model incorporates both temporal and frequency attention mechanisms with the full-dynamic convolution, enhancing the model's ability to localize sound events at the frame level. Experimental results demonstrate that our proposed model significantly improved PSDS1 and CB-F1 and IB-F1, marking a notable advancement compared to similar methods. Additionally, the PSDS2 also showed improvements over most methods. These results show the superior performance of our proposed method in enhancing temporal localization, while also demonstrating the better performance in event classification.

    Download PDF (966K)
feedback
Top