Customer segmentation is a process that divides a business's total customers into groups according to their diversity of purchasing behavior and characteristics. The data mining clustering technique can be used to accomplish this customer segmentation. This technique clusters the customers in such a way that the customers in one group behave similarly when compared to the customers in other groups. The customer related data are categorical in nature. However, the clustering algorithms for categorical data are few and are unable to handle uncertainty. Rough set theory (RST) is a mathematical approach that handles uncertainty and is capable of discovering knowledge from a database. This paper proposes a new clustering technique called MADO (Minimum Average Dissimilarity between Objects) for categorical data based on elements of RST. The proposed algorithm is compared with other RST based clustering algorithms, such as MMR (Min-Min Roughness), MMeR (Min Mean Roughness), SDR (Standard Deviation Roughness), SSDR (Standard deviation of Standard Deviation Roughness), and MADE (Maximal Attributes DEpendency). The results show that for the real customer data considered, the MADO algorithm achieves clusters with higher cohesion, lower coupling, and less computational complexity when compared to the above mentioned algorithms. The proposed algorithm has also been tested on a synthetic data set to prove that it is also suitable for high dimensional data.
In recent times, the mining of association rules from XML databases has received attention because of its wide applicability and flexibility. Many mining methods have been proposed. Because of the inherent flexibility of the structures and the semantics of the documents, however, these methods are challenging to use. In order to accomplish the mining, an XML document must first be converted into a relational dataset, and an index table with node encoding is created to extract transactions and interesting items. In this paper, we propose a new method to mine association rules from XML documents using a new type of node encoding scheme that employs a Unique Identifier (UID) to extract the important items. The node scheme modified with UID encoding speeds up the mining process. A significance measure is used to identify the important rules found in the XML database. Finally, the mining procedure calculates the confidence that the identified rules are indeed meaningful. Experiments are conducted using XML databases available in the XML data repository. The results illustrate that the proposed method is efficient in terms of computation time and memory usage.
A Materials Engineering Application (MEA) has been presented as a solution for the problems of materials design, solutions simulation, production and processing, and service evaluation. Large amounts of data are generated in the MEA distributed and heterogeneous environment. As the demand for intelligent engineering information applications increases, the challenge is to effectively organize these complex data and provide timely and accurate on-demand services. In this paper, based on the supporting environment of Open Cloud Services Architecture (OCSA) and Virtual DataSpace (VDS), a new semantic-driven knowledge representation model for MEA information is proposed. Faced with the MEA constantly changing user requirements, this model elaborates the semantic representation of data, services and their relationships to support the construction of domain knowledge ontology. Then, based on the ontology modeling in VDS, the semantic representations of association mapping, rule-based reasoning, and evolution tracking are analyzed to support MEA knowledge acquisition. Finally, an application example of knowledge representation in the field of materials engineering is given to illustrate the proposed model, and some experimental comparisons are discussed for evaluating and verifying the effectiveness of this method.
There is a clear need for a public domain data set of road networks with high special accuracy and global coverage for a range of applications. The Global Roads Open Access Data Set (gROADS), version 1, is a first step in that direction. gROADS relies on data from a wide range of sources and was developed using a range of methods. Traditionally, map development was highly centralized and controlled by government agencies due to the high cost or required expertise and technology. In the past decade, however, high resolution satellite imagery and global positioning system (GPS) technologies have come into wide use, and there has been significant innovation in web services, such that a number of new methods to develop geospatial information have emerged, including automated and semi-automated road extraction from satellite/aerial imagery and crowdsourcing. In this paper we review the data sources, methods, and pros and cons of a range of road data development methods: heads-up digitizing, automated/semi-automated extraction from remote sensing imagery, GPS technology, crowdsourcing, and compiling existing data sets. We also consider the implications for each method in the production of open data.
A peer review scheme comparable to that used in traditional scientific journals is a major element missing in bringing publications of raw data up to standards equivalent to those of traditional publications. This paper introduces a quality evaluation process designed to analyse the technical quality as well as the content of a dataset. This process is based on quality tests, the results of which are evaluated with the help of the knowledge of an expert. As a result, the quality is estimated by a single value only. Further, the paper includes an application and a critical discussion on the potential for success, the possible introduction of the process into data centres, and practical implications of the scheme.
This article documents a systematic bias in surface wind directions between the TAO buoy measurements at 0°, 170°W and the ECMWF analysis and forecasts. This bias was of the order 10° and persisted from November 2008 to January 2010, which was consistent with a post-recovery calibration drift in the anemometer vane. Unfortunately, the calibration drift was too time-variant to be used to correct the data so the quality flag for this deployment was adjusted to reflect low data quality. The primary purpose of this paper is to inform users in the modelling and remote-sensing community about this systematic, persistent wind directional bias, which will allow users to make an educated decision on using the data and be aware of its potential impact to their downstream product quality. The uncovering of this bias and its source demonstrates the importance of continuous scientific oversight and effective user-data provider communication in stewarding scientific data. It also suggests the need for improvement in the ability of buoy data quality control procedures of the TAO and ECMWF systems to detect future wind directional systematic biases such as the one described here.
Modern science is increasingly data-intensive, multidisciplinary, and network-centric. There is an emerging consensus among the members of the academic research community that the practices of this new science paradigm should be congruent with “open science”. This entails that the bonanza of research data, the wide availability of algorithms, data tools, and data services produced by the members of the research community must be discoverable, understandable, and usable by overcoming all kinds of heterogeneity and logical inconsistencies. The main concept for coping with the many dimensions of heterogeneity and logical inconsistency is mediation. Mediation is achieved by mediators or brokers. These are software modules that exploit encoded knowledge about certain datasets, data services, and user needs in order to implement an intermediary service. A mediating environment is an environment that provides a core set of intermediary services. Mediation should be a distinct functionality of future research data infrastructures. This paper surveys the different levels of interoperability, i.e., exchangeability, compatibility, and usability, their properties and relationships, mediation concepts, functions, and intermediary services. The current interoperability landscape is also illustrated. Finally, the paper advocates the need for mediating environments to be supported by future research data infrastructures and envisions that one of the most important features of future research data infrastructures will be mediation software.
A high-dimensional feature selection having a very large number of features with an optimal feature subset is an NP-complete problem. Because conventional optimization techniques are unable to tackle large-scale feature selection problems, meta-heuristic algorithms are widely used. In this paper, we propose a particle swarm optimization technique while utilizing regression techniques for feature selection. We then use the selected features to classify the data. Classification accuracy is used as a criterion to evaluate classifier performance, and classification is accomplished through the use of k-nearest neighbour (KNN) and Bayesian techniques. Various high dimensional data sets are used to evaluate the usefulness of the proposed approach. Results show that our approach gives better results when compared with other conventional feature selection algorithms.
The Semantic Web (Web 3.0) has been proposed as an efficient way to access the increasingly large amounts of data on the internet. The Linked Open Data Cloud project at present is the major effort to implement the concepts of the Seamtic Web, addressing the problems of inhomogeneity and large data volumes. RKBExplorer is one of many repositories implementing Open Data and contains considerable bibliographic information. This paper discusses bibliographic data, an important part of cloud data. Effective searching of bibiographic datasets can be a challenge as many of the papers residing in these databases do not have sufficient or comprehensive keyword information. In these cases however, a search engine based on RKBExplorer is only able to use information to retrieve papers based on author names and title of papers without keywords. In this paper we attempt to address this problem by using the data mining algorithm Association Rule Mining (ARM) to develop keywords based on features retrieved from Resource Description Framework (RDF) data within a bibliographic citation. We have demonstrate the applicability of this method for predicting missing keywords for bibliographic entries in several typical databases. −−−−− ¹ Paper presented at 1st International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2014) March 27-28, 2014. Organized by VIT University, Chennai, India. Sponsored by BRNS.
The associative classification method integrates association rule mining and classification. Constructing an efficient classifier with a small set of high quality rules is a highly important but indeed a challenging task. The lazy learning associative classification method successfully removes the need for a classifier but suffers from high computation costs. This paper proposes a Compact Highest Subset Confidence-Based Associative Classification scheme that generates compact subsets based on information gain and classifies the new samples without constructing classifiers. Experimental results show that the proposed system out performs both the traditional and the existing lazy learning associative classification methods. −−−−− ¹ Paper presented at 1st International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2014) March 27-28, 2014. Organized by VIT University, Chennai, India. Sponsored by BRNS.
Today, science is passing through an era of transformation, where the inundation of data, dubbed data deluge is influencing the decision making process. The science is driven by the data and is being termed as data science. In this internet age, the volume of the data has grown up to petabytes, and this large, complex, structured or unstructured, and heterogeneous data in the form of “Big Data” has gained significant attention. The rapid pace of data growth through various disparate sources, especially social media such as Facebook, has seriously challenged the data analytic capabilities of traditional relational databases. The velocity of the expansion of the amount of data gives rise to a complete paradigm shift in how new age data is processed. Confidence in the data engineering of the existing data processing systems is gradually fading whereas the capabilities of the new techniques for capturing, storing, visualizing, and analyzing data are evolving. In this review paper, we discuss some of the modern Big Data models that are leading contributors in the NoSQL era and claim to address Big Data challenges in reliable and efficient ways. Also, we take the potential of Big Data into consideration and try to reshape the original operationaloriented definition of “Big Science” (Furner, 2003) into a new data-driven definition and rephrase it as “The science that deals with Big Data is Big Science.”
Meta-analyses are studies that bring together data or results from multiple independent studies to produce new and over-arching findings. Current data curation systems only partially support meta-analytic research. Some important meta-analytic tasks, such as the selection of relevant studies for review and the integration of research datasets or findings, are not well supported in current data curation systems. To design tools and services that more fully support meta-analyses, we need a better understanding of meta-analytic research. This includes an understanding of both the practices of researchers who perform the analyses and the characteristics of the individual studies that are brought together. In this study, we make an initial contribution to filling this gap by developing a conceptual framework linking meta-analyses with data paths represented in published articles selected for the analysis. The framework focuses on key variables that represent primary/secondary datasets or derived socio-ecological data, contexts of use, and the data transformations that are applied. We introduce the notion of using variables and their relevant information (e.g., metadata and variable relationships) as a type of currency to facilitate synthesis of findings across individual studies and leverage larger bodies of relevant source data produced in small science research. Handling variables in this manner provides an equalizing factor between data from otherwise disparate data-producing communities. We conclude with implications for exploring data integration and synthesis issues as well as system development.
IBAMar is a regional database that puts together all the physical and biochemical data provided by multiparametric probes and water sample analysis taken during the cruises managed by the Balearic Oceanographic Center of the Instituto Español de Oceanografía (COB-IEO) during the last four decades. Initially, it integrated data from hydrographic profiles obtained from CTDs (conductivity, temperature, depth) equipped with several sensors, but it has been recently extended to incorporate data obtained with hydrocasts using oceanographic Niskin or Nansen bottles. The result is an extensive regional resource database that includies physical hydrographic data such as temperature (T), salinity (S), dissolved oxygen (DO), fluorescence, and turbidity, as well as biochemical data, specifically dissolved inorganic nutrients (phosphate, nitrate, nitrite, and silicate) and chlorophyll-a. Different technologies and methodologies were used by independent teams during the four decades of data sampling. However in the IBAMar database, data have been reprocessed using the same protocols and a standard quality control (QC) methodology has been applied to each variable. The result is a homogeneous and quality-controlled data. IBAMar database at standard levels is freely available for exploration and download from http://www.ba.ieo.es/ibamar/.
With the growing importance of data to the scholarly record and the critical role journals play in facilitating data sharing, the complex landscape of scholarly journal data publication policies has become an obstacle for research. This paper outlines Data-PE, a framework for evaluating these policies. It takes the form of a conceptual foundation, comprising twelve criteria for evaluation, operationalized through an evaluation tool. Its objective is to function as a flexible means for a variety of stakeholders to appraise individual policies. Examples of the use of the framework are provided and means for the validation of the tool are discussed.
Making scientific data openly accessible and available for re-use is desirable to encourage validation of research results and/or economic development. Understanding what users may, or may not, do with data in online data repositories is key to maximizing the benefits of scientific data re-use. Many online repositories that allow access to scientific data indicate that data is “open,” yet specific usage conditions reviewed on 40 “open” sites suggest that there is no agreed upon understanding of what “open” means with respect to data. This inconsistency can be an impediment to data re-use by researchers and the public.
This paper presents a stewardship maturity assessment model in the form of a matrix for digital environmental datasets. Nine key components are identified based on requirements imposed on digital environmental data and information that are cared for and disseminated by U.S. Federal agencies by U.S. law, i.e., Information Quality Act of 2001, agencies’ guidance, expert bodies’ recommendations, and users. These components include: preservability, accessibility, usability, production sustainability, data quality assurance, data quality control/monitoring, data quality assessment, transparency/traceability, and data integrity. A five-level progressive maturity scale is then defined for each component associated with measurable practices applied to individual datasets, representing Ad Hoc, Minimal, Intermediate, Advanced, and Optimal stages. The rationale for each key component and its maturity levels is described. This maturity model, leveraging community best practices and standards, provides a unified framework for assessing scientific data stewardship. It can be used to create a stewardship maturity scoreboard of dataset(s) and a roadmap for scientific data stewardship improvement or to provide data quality and usability information to users, stakeholders, and decision makers.
The rapid growth in the volume of remote sensing data and its increasing computational requirements bring huge challenges for researchers as traditional systems cannot adequately satisfy the huge demand for service. Cloud computing has the advantage of high scalability and reliability, which can provide firm technical support. This paper proposes a highly scalable geospatial cloud platform named the Geospatial Data Cloud, which is constructed based on cloud computing. The architecture of the platform is first introduced, and then two subsystems, the cloud-based data management platform and the cloud-based data processing platform, are described. ––– This paper was presented at the First Scientific Data Conference on Scientific Research, Big Data, and Data Science, organized by CODATA-China and held in Beijing on 24-25 February, 2014.
Highlights of the 2013 International Forum on 'Polar Data Activities in Global Data Systems'.
The Polar Data Catalogue (PDC) is a growing Canadian archive and public access portal for Arctic and Antarctic research and monitoring data. In partnership with a variety of Canadian and international multi-sector research programs, the PDC encompasses the natural, social, and health sciences. From its inception, the PDC has adopted international standards and best practices to provide a robust infrastructure for reliable security, storage, discoverability, and access to Canada’s polar data and metadata. Current efforts focus on developing new partnerships and incentives for data archiving and sharing and on expanding connections to other data centres through metadata interoperability protocols.
Scientific data management is performed to ensure that data are curated in a manner that supports their qualified reuse. Curation usually involves actions that must be performed by those who capture or generate data and by a facility with the capability to sustainably archive and publish data beyond an individual project’s lifecycle. The Australian Antarctic Data Centre is such a facility. How this centre is approaching the administration of Antarctic science data is described in the following paper and serves to demonstrate key facets necessary for undertaking polar data management in an increasingly connected global data environment.
Korea implemented its Antarctic research program in 1987 and diversified to the Arctic in 2002. Since the development of the Joint Committee on Antarctic Data Management, Korea has acknowledged the importance of data management. The launch of the Korea Polar Research Institute in 2004 also saw establishment of the Korea Polar Data Center (KPDC), which outlines and executes a Polar Data Management Policy. KPDC has set up an Information Technology infrastructure and has developed a metadata management system. However, there is still a long way to go, especially in terms of raising researcher recognition for improving data registration and sharing.
Data generated by environmental research in Antarctica are essential in evaluating how its biodiversity and environment are affected by global-scale changes triggered by ever-increasing human activities. In this work, we describe BrAntIS, the Brazilian Information System on Antarctic Environmental Research, which enables the acquiring, storing, and querying of research data generated by the Brazilian National Institute for Science and Technology on Antarctic Environmental Research. BrAntIS' data model reflects data acquisition and analysis conducted by scientists and organized around field expeditions. We describe future functionalities, such as the use of linked data techniques and support for scientific workflows.
The Polar Data Centre of the National Institute of Polar Research has had the responsibility to manage the data for Japan as a National Antarctic Data Centre for the last two decades. During the International Polar Year (IPY) 2007–2008, a considerable number of multidisciplinary metadata that mainly came from IPY-endorsed projects involving Japanese activities were compiled by the data centre. Although long-term stewardship of those amalgamated metadata falls to the data centre, the efforts are in collaboration with the Global Change Master Directory, the Polar Information Commons, and the newly established World Data System of the International Council for Science.
Polar information falls into at least six categories: information about researchers, organizations, research facilities, research projects, research datasets, and publications. The management of polar research datasets has been the focus of significant attention in recent years, but it is only one piece of the polar information world. The other information types are needed to provide context to, and extract knowledge from, the raw data.Here, I discuss the possibilities for linking the various types of information categories in Canada to create a truly holistic view of Canadian Arctic research.
An overview of the Interuniversity Upper atmosphere Global Observation NETwork (IUGONET) project is presented. This Japanese program is building a meta-database for ground-based observations of the Earth’s upper atmosphere, in which metadata connected with various atmospheric radars and photometers, including those located in both polar regions, are archived. By querying the metadata database, researchers are able to access data file/information held by data facilities. Moreover, by utilizing our analysis software, users can download, visualize, and analyze upper-atmospheric data archived in or linked with the system. As a future development, we are looking to make our database interoperable with others.
Ionospheric Prediction Services (IPS) has an extensive collection of data from Antarctic field instruments, the oldest being ionospheric recordings from the 1950s. Its sensor network (IPSNET) spans Australasia and Antarctica collecting information on space weather. In Antarctica, sensors include ionosondes, magnetometers, riometers, and cosmic ray detectors. The (mostly) real-time data from these sensors flow into the IPS World Data Centre at Sydney, where the majority are available online to clients worldwide. When combined with other IPSNET-station data, they provide the basis for Antarctic space weather reports. This paper summarizes the datasets collected from Antarctica and their data management within IPS.
A system to optimize the management of global space-weather observation networks has been developed by the National Institute of Information and Communications Technology (NICT). Named the WONM (Wide-area Observation Network Monitoring) system, it enables data acquisition, transfer, and storage through connection to the NICT Science Cloud, and has been supplied to observatories for supporting space-weather forecast and research. This system provides us with easier management of data collection than our previously employed systems by means of autonomous system recovery, periodical state monitoring, and dynamic warning procedures. Operation of the WONM system is introduced in this report.
The regional HydroMeteorological DataBase (HMDB) was designed for easy access to climate data via the Internet. It contains data on various climatic parameters (temperature, precipitation, pressure, humidity, and wind strength and direction) from 190 meteorological stations in Russia and bordering countries for a period of instrumental observations of over 100 years. Open sources were used to ingest data into HMDB. An analytical block was also developed to perform the most common statistical analysis techniques.
Arctic Options: Holistic Integration for Arctic Coastal-Marine Sustainability is a new three-year research project to assess future infrastructure associated with the Arctic Ocean regarding: (1) natural and living environment; (2) built environment; (3) natural resource development; and (4) governance. For the assessments, Arctic Options will generate objective relational schema from numeric data as well as textual data. This paper will focus on the ‘long tail of smaller, heterogeneous, and often unstructured datasets’ that ‘usually receive minimal data management consideration’,as observed in the 2013 Communiqué from the International Forum on Polar Data Activities in Global Data Systems.
The Arctic Ocean boundary monitoring array has been maintained over many years by six research institutes located worldwide. Our approach to Arctic Ocean boundary measurements is generating significant scientific outcomes. However, it is not always easy to access Arctic data. On the basis of our last five years’ experience of assembling pan-Arctic boundary data, and considering the success of Argo, I propose that Arctic data policy should be driven by specific scientific-based requirements. Otherwise, it will be hard to implement the International Polar Year data policy. This approach would also help to establish a consensus of future Arctic science.
The legacy of the International Polar Year 2007–2008 (IPY) includes advances in open data and meaningful progress towards interoperability of data, systems, and standards. Enabled by metadata brokering technologies and by the growing adoption of international metadata standards, federated data search welcomes diversity in Arctic data and recognizes the value of expertise in community data repositories. Federated search enables specialized data holdings to be discovered by broader audiences and complements the role of metadata registries such as the Global Change Master Directory, providing interoperability across the Arctic web-of-repositories.
The Inter-university Consortium for Political and Social Research (ICPSR), a domain repository with a 50-year track record of archiving social and behavioural science data, applied for—and acquired—the Data Seal of Approval (DSA) in 2010. DSA is a non-intrusive, straightforward approach to assessing organizational, technical, and operational infrastructure, and signifies a basic level of accreditation. DSA assessment helped ICPSR become more transparent, monitor and improve archival processes and procedures, and raise awareness within the organization and beyond about best practices for repositories. We relate our experiences with the DSA process, and describe challenges and opportunities associated with DSA assessment.
The research data landscape of the last International Polar Year was dramatically different from its predecessors. Data scientists documented lessons learned about management of large, diverse, and interdisciplinary datasets to inform future development and practices. Improved, iterative, and adaptive data curation and system development methods to address these challenges will be facilitated by building collaborations locally and globally across the ‘data ecosystem’, thus, shaping and sustaining an international data infrastructure to fulfil modern scientific needs and societal expectations. International coordination is necessary to achieve convergence between domain-specific data systems and hence enable multidisciplinary approaches needed to solve the Global Challenges.
Data management is integral to sound polar science. Through analysis of documents reporting on meetings of the Arctic data management community, a set of priorities and strategies are identified. These include the need to improve data sharing, make use of existing resources, and better engage stakeholders. Network theory is applied to a preliminary inventory of polar and global data management actors to improve understanding of the emerging community of practice. Under the name the Arctic Data Coordination Network, we propose a model network that can support the community in achieving their goals through improving connectivity between existing actors.