In recent years, growing attention has been directed toward data trading markets. However, because the value of data strongly depends on its intended use and is difficult to assess in advance, most datasets are traded under fixed pricing schemes. Under fixed pricing, the choice of the initial price significantly affects the provider's revenue, creating substantial uncertainty and risk in forecasting returns. To address this issue, this study proposes a dynamic pricing mechanism in which the price increases in accordance with the disscusion rate of the data. We compare the proposed approach with a fixed pricing scheme using agent-based modeling. By observing changes in provider revenue and market participants'surplus, our results suggest that the proposed dynamic pricing mechanism mitigates revenue variability caused by the selection of initial prices.
This study introduces utility estimation via inverse reinforcement learning and agent strategy formulation via reinforcement learning to construct a realistic market simulation environment. Within this environment, comparative experiments are conducted between existing and new reputation systems by varying parameters such as market size and the proportion of attackers (non-reviewers and false evaluators). This allows us to verify system robustness by examining changes in data price, quality, transaction success rate, market profitability, and provider Gini coefficient. After analyzing the characteristics of each reputation system, we propose the optimal reputation system for data trading markets.
Automated data trading between organizations requires both flexible discovery of relevant datasets and rigorous verification of contract conditions. This paper proposes a hybrid approach that combines a Large Language Model (LLM) for semantic matching of data requirements with formal policy verification based on the Open Digital Rights Language (ODRL). The LLM identifies candidate datasets by interpreting user intents expressed in natural language, while ODRL-based verification ensures that contract terms are formally consistent. We formulate contract feasibility as a binary classification problem and compare three methods: Naive LLM, ODRL-only, and the proposed LLM+ODRL hybrid. Experimental results on 50 test cases show that Naive LLM achieves the best overall performance (Accuracy=0.86, F1=0.88), while the proposed method does not demonstrate substantial improvement over the other methods. In particular, Recall remains comparable to ODRL-only, suggesting that the formal verification component introduces conservative rejection behavior. We discuss the implications and identify improvements to the ODRL matching rules as a key direction for future work.
As data from diverse domains continue to increase, analysts combine multiple datasetsrather than relying on a single data source. While prior research has focused on improving techniques such as schema matching and entity resolution, relatively little attention has been paid to how datasets are actually combined in real-world analytical practice. This study investigates data integration structures observed in large-scale analytical archives, using Kaggle notebooks as a case study. We extract relationships among datasets from notebooks that use multiple datasets and represent these relationships as networks. Focusing on three-dataset notebooks, we classify integration patterns—such as serial, parallel, and mediating structures—based on network motifs, and examine their frequency of occurrence. Although thirteen three-node structures are theoretically possible, only a limited subset appears frequently in practice. Furthermore, by assigning domain labels to datasets based on tag information, we analyze how structural roles (Start, Middle, End) are associated with domain characteristics. The results suggest that certain domains are more likely to assume specific structural roles, such as mediators or endpoints, within integration processes.
多様な分野におけるデータの急速な増加は,データ駆動型意思決定に大きな可能性をもたらしている一方で,多くのデータセットはいまだ十分に活用されていない.その主な要因は,データセットを理解し効果的に適用するために高度な専門知識が求められる点にある.近年,AIを用いた自然言語要約によってデータセットの解釈性を向上させる手法が提案されているが,これらは統計的特性や局所的なパターンに焦点を当てる傾向が強く,データセット全体を実務にどのように活用できるかについての示唆は限定的である.本論文では,データセットの内容・構造・品質および潜在的な利用方法を要約するデータセットプロファイル生成を通じた自動データセット理解に着目する.文の一貫性と関連性に基づいてプロファイルを生成・順位付けするLLM ベースのフレームワークを提案し,その自動評価の限界を検討するため,10 名の参加者を対象に半構造化インタビューによる質的評価を行った.参加者は,ユニモーダルおよびマルチモーダルを含む4つのデータセットに対するプロファイルを評価した.その結果,本フレームワークは,未知のデータセットに対する内容理解や利用可能性の検討を支援する,初期段階のセンスメイキングのためのツールとして有用であると評価された.また,自動評価と人間の判断との間に乖離が確認され,参加者は言語的一貫性よりも,具体性,解釈性,信頼性,実現可能性を重視していることを明らかにした.