This article proposes an innovative utility sentient approach for the mining of interesting association patterns from transaction databases. First, frequent patterns are discovered from the transaction database using the FP-Growth algorithm. From the frequent patterns mined, this approach extracts novel interesting association patterns with emphasis on significance, utility, and the subjective interests of the users. The experimental results portray the efficiency of this approach in mining utility-oriented and interesting association rules. A comparative analysis is also presented to illustrate our approach's effectiveness.
By comparing a hard real-time system and a soft real-time system, this article elicits the risk of over-design in soft real-time system designing. To deal with this risk, a novel concept of statistical design is proposed. The statistical design is the process accurately accounting for and mitigating the effects of variation in part geometry and other environmental conditions, while at the same time optimizing a target performance factor. However, statistical design can be a very difficult and complex task when using clas-sical mathematical methods. Thus, a simulation methodology to optimize the design is proposed in order to bridge the gap between real-time analysis and optimization for robust and reliable system design.
It is important to improve data reliability and data access efficiency for data-intensive applications in a data grid environment. In this paper, we propose an Information Dispersal Algorithm (IDA)-based parallel storage scheme for massive data distribution and parallel access in the Scientific Data Grid. The scheme partitions a data file into unrecognizable blocks and distributes them across many target storage nodes according to user profile and system conditions. A subset of blocks, which can be downloaded in parallel to remote clients, is required to reconstruct the data file. This scheme can be deployed on the top of current grid middleware. A demonstration and experimental analysis show that the IDA-based parallel storage scheme has better data reliability and data access performance than the existing data replication methods. Furthermore, this scheme has the potential to reduce considerably storage requirements for large-scale databases on a data grid.
In some areas of science, sophisticated web services and semantics underlie "cyberinfrastructure". However, in "small science" domains, especially in field sciences such as archaeology, conservation, and public health, datasets often resist standardization. Publishing data in the small sciences should embrace this diversity rather than attempt to corral research into "universal" (domain) standards. A growing ecosystem of increasingly powerful Web syndication based approaches for sharing data on the public Web can offer a viable approach. Atom Feed based services can be used with scientific collections to identify and create linkages across different datasets, even across disciplinary boundaries without shared domain standards.
We have rich information resources for materials science and engineering - raw measurement data, computational simulation methods, digitized handbooks, and digital libraries. However, these resources have a wide variety of formats, terminologies, and concepts, which makes it difficult to find appropriate information for materials design, development, and evaluation. One solution to this problem is to integrate these resources into a computer readable concept map, called a domain ontology, which describes concepts and relationships among the concepts in materials science and engineering. This paper describes a trial that constructs a standard of metadata description using ontology language and demonstrates the validity of this construction through data exchange among heterogeneous material databases. "Materials Ontology," which consists of several sub ontologies corresponding to substance, process, environment, and property, is developed using the ontology language of the Semantic Web, OWL, which enables the definition of a flexible and detailed structure of materials information. A versatile "materials data format" is built on the Materials Ontology as a component of the materials information platform and is applied to exchange data among three different thermal property databases, maintained by two major materials science research institutes in Japan.
Micro data is a valuable source of information for research. However, publishing data about individuals for research purposes, without revealing sensitive information, is an important problem. The main objective of privacy preserving data mining algorithms is to obtain accurate results/rules by analyzing the maximum possible amount of data without unintended information disclosure. Data sets for analysis may be in a centralized server or in a distributed environment. In a distributed environment, the data may be horizontally or vertically partitioned. We have developed a simple technique by which horizontally partitioned data can be used for any type of mining task without information loss. The partitioned sensitive data at 'm' different sites are transformed using a mapping table or graded grouping technique, depending on the data type. This transformed data set is given to a third party for analysis. This may not be a trusted party, but it is still allowed to perform mining operations on the data set and to release the results to all the 'm' parties. The results are interpreted among the 'm' parties involved in the data sharing. The experiments conducted on real data sets prove that our proposed simple transformation procedure preserves one hundred percent of the performance of any data mining algorithm as compared to the original data set while preserving privacy.
The use of scientific data is becoming increasingly dependent on the software that fosters such use. As the ability to reuse software contributes to capabilities for reusing software-dependent data, instruments for measuring software reusability contribute to the reuse of software and related data. The development and current state of a proposed set of Reuse Readiness Levels (RRLs) are summarized, and potential uses of the software reusability measures are described, along with proposed use cases to support sponsorship of software projects, software production, software adoption, and data stewardship during the systems development lifecycle and the data lifecycle.
An alternative ratio-cum-product estimator of population mean using the coefficient of kurtosis for two auxiliary variates has been proposed. The proposed estimator has been compared with a simple mean estimator, the usual ratio estimator, a product estimator, and estimators proposed by Singh (1967) and Singh et al. (2004). An empirical study is also carried out in support of the theoretical findings.
NRC-CISTI serves Canada as its National Science Library (as mandated by Canada's Parliament in 1924) and also provides direct support to researchers of the National Research Council of Canada (NRC). By reason of its mandate, vision, and strategic positioning, NRC-CISTI has been rapidly and effectively mobilizing Canadian stakeholders and resources to become a lead player on both the Canadian national and international scenes in matters relating to the organization and management of scientific research data. In a previous communication (CODATA International Conference, 2008), the orientation of NRC-CISTI towards this objective and its short- and medium-term plans and strategies were presented. Since then, significant milestones have been achieved. This paper presents NRC-CISTI's most recent activities in these areas, which are progressing well alongside a strategic organizational redesign process that is realigning NRC-CISTI's structure, mission, and mandate to better serve its clients. Throughout this transformational phase, activities relating to data management remain vibrant.
Work towards creation of a knowledge sharing system for sustainability science through the application of semantic data modeling is described. An ontology grounded in description logics was developed based on the ISO 15926 data model to describe three types of sustainability science conceptualizations: situational knowledge, analytic methods, and scenario frameworks. Semantic statements were then created using this ontology to describe expert knowledge expressed in research proposals and papers related to sustainability science and in scenarios for achieving sustainable societies. Semantic matching based on logic and rule-based inference was used to quantify the conceptual overlap of semantic statements, which shows the semantic similarity of topics studied by different researchers in sustainability science, similarities that might be unknown to the researchers themselves.
Privacy protection is indispensable in data mining, and many privacy-preserving data mining (PPDM) methods have been proposed. One such method is based on singular value decomposition (SVD), which uses SVD to find unimportant information for data mining and removes it to protect privacy. Independent component analysis (ICA) is another data analysis method. If both SVD and ICA are used, unimportant information can be extracted more comprehensively. Accordingly, this paper proposes a new PPDM method using both SVD and ICA. Experiments show that our method performs better in preserving privacy than the SVD-based methods while also maintaining data utility.
"Proceedings of the International Symposium: Fifty Years after IGY - Modern Information Technologies and Earth and Solar Sciences -" (Eds. Iyemori, T. et al.) Part 2
Astronomical Virtual Observatories (VOs) are emerging research environment for astronomy, and 16 countries and a region have funded to develop their VOs based on international standard protocols for interoperability. The 16 funded VO projects have established the International Virtual Observatory Alliance (http://www.ivoa.net/) to develop the standard interoperable interfaces such as registry (meta data), data access, query languages, output format (VOTable), data model, application interface, and so on. The IVOA members have constructed each VO environment through the IVOA interfaces. National Astronomical Observatory of Japan (NAOJ) started its VO project (Japanese Virtual Observatory - JVO) in 2002, and developed its VO system. We have succeeded to interoperate the latest JVO system with other VOs in the USA and Europe since December 2004. Observed data by the Subaru telescope, satellite data taken by the JAXA/ISAS, etc. are connected to the JVO system. Successful interoperation of the JVO system with other VOs means that astronomers in the world will be able to utilize top-level data obtained by these telescopes from anywhere in the world at anytime. System design of the JVO system, experiences during our development including problems of current standard protocols defined in the IVOA, and proposals to resolve these problems in the near future are described.
The data center of our institute distributes solid earth science data obtained by the Ocean Hemisphere Project (OHP) network through the website of Pacific 21. We have developed Java-based software "GDSClient", which enables us to collect not only the data of the OHP network but also those distributed from other data centers by means of the web service technology. It is possible to request the data controlling parameters such as data centers, observatories, a data period, and other auxiliary detailed parameters. It is unnecessary to know differences between data centers with preparing a WSDL (Web Services Description Language) file, in which information of user interface is described in XML format. The latest GDSClients are released from the website of Pacific 21.
Information Technology Challenges in Earth and Solar Sciences (Part 2)
The Space Physics Archive Search and Extract (SPASE) project has developed an information model for interoperable access and retrieval of data within the Heliophysics (also known as space and solar physics) science community. The diversity of science data archives within this community has led to the establishment of many virtual observatories to coordinate the data pathways within Heliophysics subdisciplines, such as magnetospheres, waves, radiation belts, etc. The SPASE information model provides a semantic layer and common language for data descriptions so that searches might be made across the whole of the heliophysics data environment, especially through the virtual observatories.
As informatics becomes embedded in the scientific method, workload shifts from the user to the provider of data and information services and systems. Yet there is little incentive for research scientists to devote time to data management and system development. Our reward system can be adjusted to encourage responsible data management and open access practices, as well as motivate people to develop systems and services for the common good. At the same time, the status and professional infrastructure for those engaged in informatics needs to match traditional scientific and technical disciplines and create an attractive, competitive career path. Five readily achievable steps can be taken to redress these imbalances.
Sea Floor ElectroMagnetic Stations (SFEMSs) are now operating at two deep seafloor sites called the 'WPB' and the 'NWP' in the West Philippine Basin and the Northwest Pacific Basin, respectively. One of the main objectives of the SFEMSs is to detect the geomagnetic secular variations on the deep seafloor where long-term geomagnetic observations have not so far been achieved. SFEMSs can measure the absolute geomagnetic total force as well as the geomagnetic vector field with precise attitude monitoring systems. The vector geomagnetic time-series that was observed for more than 5 years revealed that the westward drift of the equatorial dipole dominates in the geomagnetic secular variation at the NWP.
We modified further our extended dynamic model of a geyser induced by an inflow of gas, by taking into consideration the effects during spouting of an elbow shape, pairs of sudden expansions and contractions, and repeats of this shape in an underground watercourse. Through numerical simulations of this extended dynamic model, we see that a large number of sudden expansions and contractions or a large angular elbow in the underground watercourse greatly affects the spouting dynamics of the geyser.
The European e-infrastructure is the ICT support for research although the infrastructure will be extended for commercial/business use. It supports the research process across funding agencies to research institutions to innovation. It supports experimental facilities, modelling and simulation, communication between researchers, and workflow of research processes and research management. We propose the core should be CERIF: an EU recommendation to member states for exchanging research information and for homogeneous access to heterogeneous information. CERIF can also integrate associated systems (such as finance, human resource, project management, and library services) and provides interoperation among research institutions, research funders, and innovators.
2 The need for a CRIS. Structure and Use of a CRIS - The Common European Research Information Format Model (CERIF)
A CERIF-CRIS consists of base entities with records describing components of the research and link entities describing relationships among records in the base entities. As an example, three base entities may contain records describing a person, a publication and a project while two link entities relate respectively the person to the publication in role author and the person to the project in role project leader. This powerful linking or inter-relating capability includes temporal as well as role aspects and inter-relates dynamically and flexibly all the components of R&D. The CERIF model can be extended to inter-relate appropriate information from legacy information systems in an organisation, such as those covering accounting, human resources, project management, assets, stock control, etc. A CERIF-CRIS can thus provide a flexible low-cost integration comparable with an ERP (Enterprise Resource Planning) System, particularly in an organisation with R&D as its primary business.
CRIS (Current Research Information Systems) provide researchers, research managers, innovators, and others with a view over the research activity of a domain. IRs (institutional repositories) provide a mechanism for an organisation to showcase through OA (open access) its intellectual property. Increasingly, organizations are mandating that their employed researchers deposit peer-reviewed published material in the IR. Research funders are increasingly mandating that publications be deposited in an open access repository: some mandate a central (or subject-based) repository, some an IR. In parallel, publishers are offering OA but replacing subscription-based access with author (or author institution) payment for publishing. However, many OA repositories have metadata based on DC (Dublin Core) which is inadequate; a CERIF (Common-European Research Information Format) CRIS provides metadata describing publications with formal syntax and declared semantics thus facilitating interoperation or homogeneous access over heterogeneous sources. The formality is essential for research output metrics, which are increasingly being used to determine future funding for research organizations.
With increased computing power more data than ever are being and will be produced, stored and (re-) used. Data are collected in databases, computed and annotated, or transformed by specific tools. The knowledge from data is documented in research publications, reports, presentations, or other types of files. The management of data and knowledge is difficult, and even more complicated is their re-use, exchange, or integration. To allow for quality analysis or integration across data sets and to ensure access to scientific knowledge, additional information - Research Information - has to be assigned to data and knowledge entities. We present the metadata model CERIF to add information to entities such as Publication, Project, Organisation, Person, Product, Patent, Service, Equipment, and Facility and to manage the semantically enhanced relationships between these entities in a formalized way. CERIF has been released as an EC Recommendation to European Member States in 2000. Here, we refer to the latest version CERIF 2008-1.0.
CRISs (Current Research Information Systems) are becoming increasingly important for organizations that are related to research, such as funding organisations, universities, and ministries. A CRIS holds information on research activities, results of research, and competence. A CRIS is useful for assessing a person or department, to show the institution's activity, to monitor scholarly activities, and as a base for the development of research strategy. This could be from a local CRIS, national CRIS, or from interoperable CRISs. A CRIS will be really useful if it is structured and can interoperate with other CRISs. The CERIF model (Current European Research Information Model) is a structured model and is able to give statistics for planning, evaluation, and assessment within an institution or benchmarking among institutions. The CERIF CRISs are able to give multiple views, such as a researcher's CV and an overview of an institution's projects (ongoing or ended) with project partners on an organizational or personal level. The output publications of a project are given for an individual researcher or institution, with linkage to the full text (in the local repository) and a list of journals where researchers or organizations are publishing, events, and an annual report on an individual researcher. A CERIF CRIS is recommended by the EU for interoperability among CRISs. A CERIF provides a one stop shop for users and gives uniform access to full text publications and scientific data. A partial model for people, organisation, and results, not projects, can be used. It is recommended, however, to implement the full model. To secure consistent information, it is also recommended to establish authority lists for people (unique ID, name, organization, position, age, sex, etc.) organsations (name, acronym, address, etc.), journals (title, acronym, publisher, URL, etc), and books (publisher, acronym, address, county, etc.) in the CERIF CRIS.
4 CRIS and the European e-Infrastructure, Enabling European Research
The ESFRI Roadmap marked a turning point in the evolution of European thinking on research facilities, providing a catalogue of such facilities with their characteristics. In parallel, the ESF (European Science Foundation) completed a questionnaire-based survey of research facilities. Finally, the ERF (European Research Facilities) consortium representing national facilities with international access was formed to parallel EIROForum (the European laboratories funded by international subscriptions). It is becoming increasingly clear that management of these facilities and management of the research process require extensive ICT: for research managers that is provided by CRIS (Current Research Information Systems) and to give researchers additionally access to facilities to control experiments with associated modelling and simulation and access to research datasets and software.
The end-user demands low effort threshold access to systems providing e-information, e-business, and e-entertainment. Innovators and entrepreneurs require also equally low-energy access to heterogeneous information homogenised to a form and language familiar to them. On top of that, decision-makers, whether in a control room or government strategic planning, demand equally easy access to information that is statistically or inductively enhanced to knowledge and access to modelling or simulation systems to allow 'what if?' requests. Researchers and technical workers have an additional requirement for rapid integration of information with statistical, induction, modelling, and simulation systems to generate and verify hypotheses so generating data and information, to be used by others, which in turn advances knowledge. Access is required, and can now be provided, anytime, anyhow, anywhere through ambient computing technology. A new paradigm, GRIDs, provides the architectural framework.
Scientific research is supported by infrastructure, and e-infrastructure is one part of this. Repositories of data are a part of the e-infrastructure and have their own particular needs arising from the requirement for permanence of their data holdings. There are many threats to permanence, and there is a growing awareness of these threats and how they may be countered. Current Research Information Systems and other support to the research lifecycle, while focused on facilitating research activities in the present, will have a role in the preservation of the outputs of research into the future.
Scholarly publications are a major part of the research infrastructure. One way to make output available is to store the publications in Open Access Repositories (OAR). A Current Research Information System (CRIS) that conforms to the standard CERIF (Common European Research Information Format) could be a key component in the e-infrastructure. A CRIS provides the structure and makes it possible to interoperate the CRIS metadata at every stage of the research cycle. The international DRIVER projects are creating a European repository infrastructure. Knowledge Exchange has launched a project to develop a metadata exchange format for publications between CRIS and OAR systems.
Academic and industrial users are increasingly facing the challenge of petabytes of data, but managing and analyzing such large data sets still remains a daunting task. The 4th Extremely Large Databases workshop was organized to examine the needs of communities under-represented at the past workshops facing these issues. Approaches to big data statistical analytics as well as emerging opportunities related to emerging hardware technologies were also debated. Writable extreme scale databases and the science benchmark were discussed. This paper is the final report of the discussions and activities at this workshop.