CALIFORNIA STATE UNIVERSITY, NORTHRIDGE
MASTER PROJECT PROPOSAL
DATABASE COMMUNICATION IN A HETEROGENOUS ENVIRONMENT
A graduate project submitted in partial satisfaction of
the requirements for the degree of Master of Science in
Computer Science
by
Son Ky Dao
August 1984
The graduate project of Son Ky Dao is approved:
Rein Turn
Russell J. Abbott, Chairman
Date
California State University, Northridge
ACKNOWLEDGMENTS
Quite a few people assisted me in the completion of
this project, and general thanks is given to all of
them.
I would specifically like to acknowledge
Professor Russ Abbott, my wife, Holly and my daughter,
Jasmin for their support and inspiration.
LIST OF FIGURES:
Figure 1:
Distributed Database Network Configuration
Figure 2:
Functional Data Model Description of a University Database
Figure 3:
MULTIBASE Schema Architecture
Figure 4:
MULTIBASE Component Architecture
Figure 5:
GDM Component Architecture
Figure 6:
Local Database Interface Architecture
Figure 7:
UCLA System Architecture
Figure 8:
UCLA Layered Architecture
Figure 9:
Communication Flow of a User Query through the Four
Translation Levels to the Local Databases
TABLE OF CONTENTS
Acknowledgement
List of Figures
Abstract
SECTION I
HETEROGENEOUS DISTRIBUTED DATABASE ENVIRONMENT
1.1
Introduction
1.2
Reasons for Using Distributed Database
1.2.1
Data Resources Utilization
1.2.2
Data Resources Availability
1.2.3
Data Resources Integrity
1.3
Goals
1.3.1
Adaptability
1.3.2
DBMS Transparency
1.3.3
Existing Network Support
1.3.4
Functionality
1.4
Problem Areas
1.4.1
Distributed Databases
1.4.2
Modeling
1.4.3
Language Support
1.4.3.1
Programming Language Support
1.4.3.2
Data Manipulation Support
1.4.4
Functionality
1.4.5
Translation
1.4.5.1
Data Translation
page 1
1.4.5.2
1.5
1.5.1
Query Translation
Approaches
Heterogeneous Distributed Database Management
System
1.5.2
SECTION II
Distributed Application Generator
HETEROGENEOUS DISTRIBUTED DATABASE
page 13
MANAGEMENT SYSTEM
2.1
MULTI BASE
2.1.1
Introduction
2.1.2
Capabilities and Design Objectives
2.1.2.1
Capabilities
2.1.2.2
Design Objectives
2.1.3
DAPLEX Language and the Functional Data Model
2.1.3.1
Functional Data Model
2.1.3.2
DAPLEX Language
2.1.4
MULTIBASE Architecture
2.1.4.1
Component Architecture
2.1.4.2
Schema Architecture
2.1.5
Query Processing Architecture
2.1.5.1
Global Data Manager
2.1.5.2
Local Database Interface
2.1.6
View Definition and Generalization for Database
Integration
2.1.6.1
Mapping Between Local Schema and DAPLEX
Local Schema
vi
Integration of DAPLEX Local Schema Using
2.1.6.2
View and Generalization
2.1. 7
Query Modifications
2.1.8
Database Integration and Incompatibility Handling
2.1.8.1
Schema Integration
2.1.8.2
Data Integration
2.2
Four Levels Architecture Approach (UCLA)
2.2.1
Introduction
2.2.2
Concept and Architecture
2.2.2.1
System Architecture
2.2.2.2
Four Translation Layers
2.2.3
Conceptual Model Using E-R Model
2.2.3.1
E-R Model
2.2.3.2
Representation of Layers' Conceptual Model
Representation of Global Conceptual Model
2.2.4
SECTION III
3.1
page 80
SUMMARY
Differences Between the Two Approaches
3.1.1
Data Model
3.1.2
Language Support
3.1.3
Architectural
3.1.4
Functionality
3.1.5
Translation
3.2
References
Figures
Conclusion
ABSTRACT
DATABASE COMMUNICATION IN A HETEROGENOUS ENVIRONMENT
By
Son Ky Dao
Master of Science in Computer Science
In a computer network environment, the problem of communicating
between different database management systems is essentially the same
whether or not the systems are physically distributed in a network
environment.
In fact,
the distributed case generates extra
difficulties which make any solution much more complicated and hard to
achieve.
Since different databases in general have different
schemata, different data models, and different retrieval languages,
many difficulties arise in formulating and implementing retrieval
requests (or queries} that require data from more than one database.
The solution to this problem is still in the research stage.
A survey
of current research will be performed for this project.
The first part of the project introduces and describes the
problems associated with building a Heterogenous Distributed Database
viii
Management System, and the reasons for using distributed database.
Next, two major approaches: MULTIBASE and UCLA four translation layers
architecture are explained in detail.
They are described and
contrasted with respect to architecture, data modeling, language
support, functionality, and translation.
Several similarities and
differences are illustrated between the two projects.
SECTION I
HETEROGENOUS DISTRIBUTED DATABASE ENVIRONMENT
2
1.1 INTRODUCTION
Generally speaking, a distributed database can be defined as a
database which has its data contents in separate locations.
According
to this definition, there are two different kinds of distributed
database environment:
- A database with two or more separate files on one computer
system.
- A database with data distributed between distinct processor
nodes (or CPUs) in a network.
The second is the kind of distributed database discussed in this
paper.
That is, a distributed database has its data contents spread
throughout several independent processors in a network configuration.
(Figure 1)
The need for communication between these databases has been
recognized in recent years as a growing problem (CHU 79).
The
recognition that the data may be interrelated within different
databases has become a major thrust to the need of exploiting
interdatabase relationships (CHEN 76).
Communication between distinct
databases is now considered a valuable and necessary tool.
These
databases can be represented in the same or different data models and
can be managed by the same or different 'database management systems.
(CHU 79) classifies these distributed databases in two different
categories:
Homogeneous databases.
Heterogeneous databases.
3
Homogeneous and Heterogenous Distributed Databases
In a homogeneous environment, only one kind of database management
system is used at all sites of the computer network.
For example, IMS
is the Database Management System {DBMS) used for all the databases in
the network.
In this case, each independent database has identical
data model {hierarchical) and identical retrieval language {DL/I).
In the heterogeneous environment, different types of databases are
used in the network.
These databases are represented in different
data models (i.e., hierarchical, network, relational, etc.) and are
managed by different database management systems, perhaps on different
hardware
(i.e.,
(CARDENAS 79).
IMS,
IDMS, TOTAL, SYSTEM/2000, SYSTEM P).
Each independent data has its own schema, is expressed
in its own data model, and can only be accessed by its own schema, is
expressed in its own data model, and can only be accessed by its own
data retrieval language (i.e., DL/I, SEQUEL, etc.).
Intuitively, homogeneous distributed database can be considered as
a subset of heterogeneous distributed database.
This paper only focuses on the heterogeneous distributed database
(HDDB).
Today, distributed heterogeneous databases occur in many
organizations for the following reasons:
Many databases were created before the benefits of data
integration were well understood.
The lack of a central database administrator (DBA) has
encouraged individual groups within an organization to create
and maintain separated databases that satisfy their own
4
requirements.
Typically, these individual databases are not
globally integrated.
The distribution of data across several sites is essential in
some environments to handle large volumes of data processing.
Currently, there is a lot of research being performed concerning
heterogeneous database environment communication (PIRA 79), (SMIT 81),
(ADIBA 82), (Glorieux 82).
Two approaches discussed in Section II are
the MULTIBASE and UCLA four Translation Layers projects.
In Section I, reasons for using distributed database and
associated problems are explained.
The system architecture, query
translation, view definition, and database integrations are major
topics in Section II.
Also, in Section II, similarities and
differences between the two approaches are described with respect to
data models, architecture, functionality, query translation, and
language support.
This is then followed by a brief remark about the
heterogeneous distributed database approaches.
1.2
Reasons for Using Distributed Database
The reasons and advantages of using distributed database
processing are discussed in this subsection.
The primary reason for employing a distributed database is to
improve the capabilities of data resources such as: the utilization,
availability, and integrity of database resources (BRAY 76).
0
•
5
1.2.1
Data Resources Utilization
Within the distributed environment, the utilization of data
resources is improved in two distinct ways.
These are the efficient
use of the available storage medium and the reduced workload for a
given CPU.
The available storage medium can be optimized by providing several
nodes access to a common database.
This also reduces data redundancy.
CPU workload is equally divided among the nodes in the network as
compared to the workload in non-distributed database.
Ideally, the
major workload of the CPU is concerned only with its local database,
occasionally exchanging information to and from other nodes in the
network.
1.2.2
Data Resources Availability
In a distributed environment, a number of processors are utilized
to handle user needs and a number of separated files in the network
are linked together to form a distributed database.
The sharing of
data between nodes and the availability of a larger view of the entire
database improve data availability.
Availability of data can also be
increased by incurring some redundancy between nodes.
For example,
more than one node may contain the same information.
In this case,
the disadvantage of data redundancy is that consistency of
intersection data between nodes is in jeopardy.
The enhanced data
availability in a distributed database increases the overall
productivity of the individual database constructing the distributed
database.
This is accomplished through the ability to develop more
applications concerning the entire database.
6
1.2.3
Data Resources Integrity
Integrity constraints for the distributed database will be
maintained automatically by the distributed database system.
distributed environment,
In the
if two local databases containing
intersection data are linked together and the information in one
database is modified, then the system will automatically modify the
information in other databases.
This same function is handled by the
Database Administrator (DBA) in a non-distributed environment.
Of course, these above gains can only be achieved if a proper
distributed database design is accomplished.
1.3
Goals
HDDB has been a growing topic in recent research (GLIG 84. LAND
82, PIRA 80).
The need for HDDB management has arisen due to the
proliferation of databases and differing DBMs during the past fifteen
years.
As DBMSs began to emerge, organizations began implementing
databases with the DBMS which best suited their needs.
Frequently, a
medium to large scale organization had several databases implemented
across their corporation locations.
These databases were quite
frequently implemented without consideration for one another.
The
result is the databases are unable to share information and keep
common information consistent without complicated mechanisms.
This is
a result of the differing DBMS, differing database designs, and
distribution.
A fundamental need exists for automation of HDDB management a
means for an integrated uniform access.
This need arises because HDDB
users cannot be expected to learn the use of the many different DBMSs
7
and the operational differences between them.
This need implies several interesting goals to achieve an
acceptable solution.
These goals can be categorized as follows (GLIG
84) :
1.
transparency of the existing DBMSs;
2.
adaptability to the current environment;
3.
existing network support; and
4.
functionality.
1.3.1
Adaptability
The automation of an HDDB environment should not interfere with
the existing database environments.
Application programs currently
operational on the individual databases should not have to be
modified.
Similarly, individual databases and database definitions
should not be altered.
1.3.2
Local site autonomy must be maintained.
DBMS Transparency
To achieve DBMS transparency is to allow for integrated uniform
processing of the HDDB.
In this manner, users of the HDDB need not be
concerned with the underlying DBMS.
They are concerned with the HDDB
as if it were a stand-alone database.
1.3.3
Existing Network Support
To allow for a large user population, a goal is to allow HDDB
automation in existing network environments.
For example, existing
public and production networks must be considered.
This support would
be provided at the application layer of the ISO Reference Model for
Open Systems Interconnection (ZIMM 80).
8
1.3.4
Functionality
A major objective is to allow as much functional support on the
HDDB as possible.
At best, the functional support of an HDDB should
be the same as the functional support of the individual databases.
This is to allow integrated access and maintenance of as much data as
possible.
1.4
Problem Areas
To achieve these goals involves the solution of several complex
problems.
These problem areas are classified into five categories:
1.
distributed. database;
2.
modeling;
3.
language support;
4.
functionality; and
5.
translation.
1.4.1
Distributed Database
Since an HDDB is a distributed database, all problems associated
with the general topic of distributed database management must be
considered.
These problems include directory design, concurrency
control, security, deadlock, and reliability issues.
Since most of the proposed solutions to these problems for
homogeneous distributed database management could be extended for the
management of an HDDB, they are not discussed in any detail in this
paper.
1.4.2
For a discussion of these problems, refer to (CHU 79).
Modeling
A data model must be provided for integration and uniformity of
the HDDB.
The model must be able to encompass all DBMS data models
9
supported, including relational, network, hierarchical, and file.
Examples of these are SQL, IDMS, IMS (often referred to as DL/I), and
VSAM, respectively.
These data models have several inherent differences in structure
and use.
Because of these differences, the model chosen for the HDDB
must be highly data independent.
The model should encompass the
relationships of the data items in a semantically meaningful way.
1.4.3
Language Support
Language support is divided into two categories.
1.
programming language support; and
2.
data manipulation language support.
1.4.3.1
They are:
Programming Language Support
Programming language support implies the ability to access the
HDDB using languages such as COBOL or PL/I.
In this manner,
application programmers can use the programming language to which they
are accustomed.
1.4.3.2
Data Manipulation Language Support
A Data Manipulation Language (DML)
is a language provided by a
DBMS which allows access to a database.
suggested for the DML supported by HDDB.
Two approaches have been
One approach is to define a
language which allows manipulation of the HDDB data model.
the approach used in MULTIBASE
(LAND 82) •
This is
A second approach is to
intercept the DML of existing DBMSs and transform them into queries
posed against the HDDB.
This is the goal of the UCLA research project
(PIRA 80).
These two approaches are discussed in Section II.
10
1.4.4
Functionality
Several inherent differences exist in today's various DBMS.
For
example, a delete in a relational database deletes a set of tuples
which satisfy some selection criterion, whereas a delete in a
hierarchical database deletes a segment and all children of that
segment.
In a
relational database, there is no such thing as
children, although relationships between tuples may be established by
view mechanisms.
The result of these differences is that update
semantics across DBMSs are different.
Another major difference is the mechanism for establishing
relationships between groups of data
items.
For network and
hierarchical structures, the relationships are implicit through
pointer mechanisms.
With relational databases, explicit relationships
are established by equating data values in the DML.
The most ambitious goal would be to provide full functional
support of the HDDB.
This implies processing against the HDDB should
be as powerful as DBMS processing against a single database.
However,
problems such as update semantics and structural differences will
always impose limitations.
The exact limitations imposed have not been fully researched yet.
This is a major problem in the development of HDDB automation.
1.4.5
Translation
Probably the most discussed issue in the literature is the
translation problems involved in HDDB automation.
This is because
translation is the backbone of HDDB automation as well as the most
interesting problem.
11
0 .
Translation falls into two categories.
1.
query translation; and
2.
data translation.
1.4.5.1
They are:
Query Translation
Queries posed upon the HDDB must be decomposed into queries on the
individual DBMSs.
This is an extremely difficult problem when one
considers the structural differences of the different DBMSs and the
dispersal of data elements across databases.
These data elements may
be either replicated across the network, or, when joined together,
form a logical entity of some sort.
1.4.5.2
Data Translation
Once queries have been decomposed, the retrieved data must be
coordinated and compiled into a consistent format.
date has several representations.
For example, a
One is "yy/mrn/dd" where yy is the
year, mrn is the month, and dd is the day in the month.
Another is
"yy/nnn" where yy is as above and nnn is the day number of a day in a
year.
However, a user who wishes to retrieve a date may want it in a
format "Month dd, year" such as February 29, 1984.
It is the
responsibility of the HDDB automation to transform the data into the
expected format of the original query and user view.
1.5.
Approaches
As of today,
the approaches to HDDB automation have been
considered in two areas.
1.
These are:
Heterogenous Distributed Database Management Systems
(HDDBMS) ; and
2.
Distributed Application Generators (DAG).
12
1.5.1
Heterogenous Distributed Database Management System
The HDDBMS approach is to provide a DBMS which manages the HDDB
while providing transparency of heterogeneity and distribution to the
users and application programmers.
This is the approach of the MULTIBASE system (LAND 82) by Computer
Corporation Of America (CCA) and the UCLA project (PIRA 80).
These
two approaches differ in the layer of transparency provided to the
users, the data model provided for the unification of the individual
databases, and the overall functional objectives.
With either
approach, several limitations exist due to the complexity of data and
query translation.
Typically, as a system provides a more high-level
interface, these limitations will most likely decrease.
The HDDBMS approach is discussed in much more detail in Section
II.
In Section III, the two projects mentioned above are illustrated
and compared.
1.5.2
Distributed Application Generator
The DAG approach differs substantially from the HDDBMS approach in
that a DBMS is not the overall goal.
In this approach, a Fourth-
Generation language compiler provides the application programmer with
the tools for writing distributed applications involving HDDBs and
heterogenous teleprocessing monitors (e.g., IMS/DC and CICS), without
the application programmer worrying about distribution and
heterogeneity.
of the compiler.
Distribution and heterogeneity are the responsibility
This approach will not be discussed in this paper.
SECTION II
HETEROGENEOUS DISTRIBUTED DATABASE MANAGEMENT SYSTEM
13
\
14
2.1
MULTIBASE
2.1.1 Introduction
MULTIBASE is a prototype implementation of a HDDBMS designed by
Computer Corporation of America.
It is a software system that provides a uniform retrieval
interface through a single query language and database schema to data
in pre-existing, heterogeneous, distributed databases.
With one query
language over one database description (schema), MULTIBASE efficiently
allows the user to reference data from different databases having
different schemata, data models, and query languages.
With this
approach, the user needs only to be familiar with one integrated
schema and a single query language, instead of numerous local system
interfaces.
CCA also claims that MULTIBASE does not require any
changes to the pre-existing databases, DataBase Management Systems
(DBMS), and application programs.
To Formulate and implement queries retrieving data from more than
one database, the software must achieve several things.
It must have
access to the location of all the data and the relationship between
them, master all the necessary interfaces, and correctly combine
partial results from individual queries into a single answer.
2.1.2
Capabilities and Design Objectives
2.1.2.1
Capabilities
MULTIBASE is designed to provide a logically integrated,
retrieval-only user-interface to a physically non-integrated
environment.
It provides a single-interface through a single query
language to all data in all databases.
Through this interface,
15
MULTIBASE presents the user with the illusion of a single, integrated,
non-distributed database.
The system assumes complete responsibility
for knowing the location of local databases, for accessing the data
through each of the local DBMSs, resolving data incompatibilities, and
combining the data to produce a single result.
In order to efficiently and correctly execute multi-database
queries, MULTIBASE has to solve the following problems:
The user's query expressed in a single global query language
must be transformed to a set of subqueries.
These subqueries
must also be expressed in different languages
(data
manipulation languages) supported by the local DBMSs.
Example:
Local DBMS
Data Manipulation Language (DML)
IMS
DL/1
SYSTEM R
SEQUEL
TOTAL
IDMS
etc •••
Formulating and organizing an efficient plan to execute a
sequence of subqueries and data movement steps.
Implementing and optimizing the programs for accessing the data
at the local sites.
Moving the results of subqueries among local sites, in case the
local sites need the results for processing.
Resolving incompatibilities between the databases, such as
differences in data types and conflicting schema names.
16
Resolving inconsistencies in copies of duplicated information
that are stored in different databases.
Finally, combining all the data received from different
databases into a single result.
Local databases should be connected to a communication network
before accessing data through MULTIBASE.
This communication network
can be locally or geographically distributed.
MULTIBASE is also
connected through the same communication network.
In this environment, a global user can access data from different
databases through a single interface (or a global database) using a
single query language (DAPLEX).
This language is specially designed
to be used with MULTIBASE and will be discussed in a later section.
Each local site maintains autonomy for local database updates.
Local applications can continue to operate under their existing local
interface, as before.
MULTIBASE does not provide the capability to update data in the
local databases or to synchronize the read operations across several
sites.
One of the problems is the establishment of the global
concurrency control mechanism.
To implement the concurrency control mechanism for read or update
operations, the global user must request and control specific services
offered by the local system such as: locking database items, etc.
These services must be to ensure consistency across the databases.
Most systems do not make the services necessary to implement global
concurrency control available to an external process.
Since MULTIBASE
is designed to operate without requiring modifications to existing
17
systems, the tools needed to ensure consistency across databases are
not globally available.
Thus, autonomy for database updates is
maintained locally, and MULTIBASE provides the global user with the
same level of data consistency that the local host DBMS provides to
each local database user.
Example:
In IMS, the locking mechanism for read or update
operations is offered at the segment level.
This mechanism is defined
in the Program Communication Block (PCB) and can be enforced at the
segment level.
2.1.2.2
Design Objectives
MULTIBASE has three key objectives which will be described as the
following:
*
(LAND 82)
Generality
MULTIBASE has been designed to be a general tool.
It is capable
of providing integrated access to various database systems used for
various applications, and is not necessarily for a specific
application area.
*
Compatibility
MULTIBASE is considered compatible with existing database systems
and applications.
This is because of its capability to interface with
pre-existing systems without changes to the DBMS, or to applications
currently in operation.
The local sites retain full autonomy for
maintaining the databases.
All local access and applications continue
to operate without modification.
*
Extensibility
This last objective requires that it must be relatively easy to
18
connect a new local system into an existing MULTIBASE configuration.
2.1.3
DAPLEX Language and the Functional Data Model
The language provided to the MULTIBASE users is DAPLEX.
DAPLEX is
both a Data Definition Language (DOL) and Data Manipulation Language
(DML) and is based on the Functional Data Model (FDM)
(SMIT 81).
DAPLEX's goal is to provide a conceptually natural database interface
language.
The DAPLEX constructs used to model real world situations
are intended to closely match the conceptual constructs that a human
being might employ when analyzing real world situations.
2.1.3.1
Functional Data Model (FDM)
The notion of a Functional Data Model was first introduced by
(EDGAR and LARRY 77) •
This work explored the use of a functional
approach as a tool for modeling the data structures representable
under the three dominant data models (Hierarchical, Relational,
Network).
Irreductible Models
(EDGAR 77) has objected that the n_ary relational model is still
somewhat too implementation oriented (a tuple is too much like a
record, which is not a 'natural' real-word construct), and the purpose
of a data model is better served by a formalism in which 'atomic
facts' are separated from one another.
Example:
Consider the following n_ary relation for suppliers:
S(s#,sname,status,city).
This n_ary relation might be better expressed as the combination
of three binary relations:
19
SN(si,sname)
ST(si,status)
SC(si,city).
In this latter version, each relation represents a single 'atomic
fact' - that is, each relation represents the relationship between a
specific entity (a supplier in the example) and one of its properties
(name, status or city).
(DATE 77) defines an n_ary relation as irreductible if it already
corresponds to a single atomic fact and cannot be nonloss-decomposed
into a set of projections.
Out of such considerations, three similar but distinct data models
have evolved, the binary relation, the irreductible relation, and the
functional data model.
Functional Data Model
The Functional Data Model is the most popular of the irreductible
approaches.
The reason is that the functional data models lends
itself to the design of an attractive access language.
(This reason
will be explained later.)
The Functional Data Model involves 'entities' and 'functions'.
Entities are represented by unary relations.
contains a set of entities.
Each entity type
Functions map entities to entities.
Functions can be single-valued or multi-valued, can be partially
defined or totally defined.
For example:
Students could be represented by a unary relation S of student
identifiers (that is, surrogates), plus a collection of functions:
20
SSTUDi
S
-->
NUMS
SNAME
S
-~
CHAR30
SDEPTi
S
-...:>
NUM3
The last of these is read as:
NUM3.'.
'SDEPT is a function from S to
That is, given a values from the unary relationS, there
exists precisely one value corresponding to S under the function
SDEPTi in the domain called NUM3 (which is assumed to consist of all
numeric string of length 3).
Notice, SDEPTi is used as a function-
name rather than an attribute-name, attributes as such are not part of
the model.
Similarly, course could be represented as a unary relation C of
course identifiers, plus functions:
CCOURSE#
c
CTITLE
c --7'
CDEPTi
c
-.,...--
NUM3
CINSTt
c --y
NUM3
-..:;;>
ALPHANUMS
CHAR20
Enrollments can also be defined as a unary relation E of
enrollments, plus functions:
ESTUDi
E-..:.)' S
ECOURSE#
E -~ C
EGRADE
E
-7· CHARl
As indicated above, a function is actually a binary relation.
The
function SDEPTi, for example, consists of all ordered pairs s,sdepti ,
where s is a student identifier and sdepti is the 3 numeric number of
that student's department.
Given a function F: A
5
B, the value of B, b say, corresponding
21
under f to a given value a of A is referred to as 'f(a)' or 'f of A'.
Thus, for example, we can refer to the department of a given student
as SDEPT#(s) or SDEPTt of s.
If we wish to refer to the students for
a given department, we need the inverse of SDEPTi function.
the set, DEPTSi say, of all ordered pairs
That is,
sdepti,s_set), where sdepti
is the number of a student department and s set is the set of all
student identifiers for students in that department.
Note that this
inverse function corresponds to an unnormalized relation because it is
set-value.
Example
DEPTSi
->
NUM3
s
inverse of SDEPTi
The concept of the composition of two functions is the basic point
to prove that the functional data model lends itself to the design of
an attractive access language.
The composition of two functions is
itself another function; for example, given the following two
functions:
Fl
A->
B
and
F2 : B 7' C
Then the composition of Fl and F2 (Fl.F2) is a function from A to C.
Consider, for example, the query 'retrieve the names of the students
enrolled in courses belonging to the COMPUTER SCIENCE dept.'
SNAME
of
ESTUDi
of
ECOURSEi*
OF
CTITLE*
( 'COf.lPUTER SCIENCE 1 )
CTITLE*
('COMPUTER SCIENCE')
or
SNAME
ESTUD#
ECOURSEi*
('*' indicates the function is inversed.)
22
SNAME
S
7
char30
ESTUDi
E
~
S
ECOURSE#*
C
-)
E
CNAME*
CTITLE -? C
(ECOURSE#
E -
(CTITLE*: C -
SNAME • ESTUDi • ECOURSEi* • CTITLE*
c)
char20)
is a function from CHAR30
to CHAR20.
Note, however, this simple form of query-expression is certainly
not adequate to handle all types of query.
Extensions are necessary
to deal with (for example) the simultaneous retrieval of multiple
attributes of an entity, or the simultaneous retrieval of attributes
of multiple entities (a join-like operation).
Such extensions require
functions that can return tuple-values results, instead of only single
values.
Similarly, it is desirable to introduce functions that can
accept multiple arguments, in order to handle queries such as: "find
the number of 'b' students in the 'COMPUTER SCIENCE' dept."
The DAPLEX language, which provides such features, will be
discussed in the next section.
Computer Corporation of America claims that the functional data
model was selected because it embodies the main structure of both the
flat file data model, such as the relational model, and the link
structured data model, such as the network model.
Entity types
correspond roughly to relations in the relational model or record
types in the network model.
Functions correspond to owner-coupled
sets in the network data model.
23
Network:
-Record types --7Entity types
- Set types
Relational:
--7 Functions
tuples of relation
--~Entity
types
One thing should be noticed here is that CCA never mentions about the
hierarchical data model.
The reason is explained later in this
section.
2.1.3.2
DAPLEX Language
DAPLEX is a data definition and manipulation language for database
systems, grounded in a concept of data representation called the
functional data model.
A fundamental goal of DAPLEX is to provide a
database system interface that allows the user to more directly model
the way he thinks about the problems he is trying to solve.
constructs of DAPLEX are the entity and the function.
The basic
These are
intended to model conceptual objects and their properties.
a.
Data Definition
DECLARE statements establish functions in the system.
Functions
are used to express both entity types and properties of an entity.
Example:
SNAME:
DECLARE student( )
S
~/
7
CHAR30
ENTITY
Define the entity type STUDENT. (ENTITY is the systemprovided type of all entities.)
DECLARE name(student)
~/
STRING
This states that 'NAME'
is a function that maps
entities of type 'STUDENT' to entities of type STRING.
(STRING is one of the number of entity types provided
24
by the system along with such other types as INTEGER
and BOOLEAN. )
DECLARE dept(student) -)
department
• States that DEPT is a function that applies to a
STUDENT.
Entity returns an entity of type DEPARTMENT.
(Note: A DEPARTMENT entity itself is returned, not a
department number or other identifier.)
These above two functions are called single-values as they always
return a single entity.
The following statement illustrates the
multivated function:
DECLARE course(student)
->
course
• The COURSE function returns a set of entities of type
COURSE.
In all function applications, the sets are considered unordered
and contains no duplicates.
So far, the functions discussed only take
one argument.
* Multiple Argument Functions
DAPLEX provides the capability to declare any number of arguments.
Example:
DECLARE grade(student,course(student))
~>
integer
• The GRADE function might return the grades which a
student obtained in courses.
{Note:
In the E_R model, the creation of new entity types is
necessary for this situation.
In this system, the enrollment of a
student in a course is viewed as a conceptual object, and then the
'grade of' property is assigned to that 'object'.)
25
*Derived Function (Function Inversion}, Derived Data
Often, some properties of an object are derived from properties of
other objects to.which it is related.
For example, assume that
courses have an 'instructor of' property.
We may then consider an
'instructor of' property with which the student is enrolled.
The
principle of conceptual naturalness dictates that it be possible for
users to treat such derived proper.ties as if they were primitives
(SHIPMAN 81).
Such alternative representation of the same facts are
modeled in DAPLEX by the notion of a derived function.
Essentially,
we are defining new properties of objects based on the values of other
properties.
Function inversion is another type of derived function.
It is used to represent the function mapping in the reverse direction.
In the context of the functional data model,
'derived data'
is
interpreted to mean 'derived function definitions.'
Example:
Given a COURSE entity, we may apply the INST function to obtain
the instructor of the course.
DECLARE inst(course}
instructor.
~
Given an INSTRUCTOR entity, we may use the function inversion
to obtain the courses that he teaches.
DEFINE COURSE(instructor)
Note the use of the
7
INVERSE OF instructor(course)
'DEFINE' statement for the derived
function.
DEFINE instructor(student)
~
instructor(course(student))
defines a function 'INSTRUCTOR' over STUDENT entities which
returns the instructors of courses the student is taking.
The
26
function ' INSTRUCTOR' may now be used in queries as if it had
been a primitive function.
The user need not be aware that it
is derived data.
Derived functions can also be defined over the system-supplied
entities types.
DEFINE student(STRING)
~r
INVERSE OF name(student)
DEFINE gradepointaverage(student)
~AVERAGE(grade(student,
course))
OVER course(student)
defines a 'grade point average' property of students.
(BOOLEAN
expressions can also be used.)
Special Operators for Defining Functions
INVERSE OF
(use for inversion function as discussed above)
INTERSECTION OF, UNION OF, DIFFERENCE OF operators may be used
to form set intersections, unions, and differences.
They are
most useful in creating new types.
Example:
DEFINE (student(teacher)
7 INTERSECTION OF student,
instructor
COMPOUND OF operator is used to create derived entities
corresponding to the elements of the cartesian product of its
operands.
Figure 2 provides an FDM graphic description of a University
database.
The rounded enclosures indicate entity types and the arrows
depict functions.
Notice that some entities have functions directed
toward them, without any function directed from them.
These entities
27
are the lowest level descriptions of the properties of entities.
FOr
example, NAME and RANK are STRINGS, SALARY is an INTEGER, STRING and
INTEGER are the lowest level descriptions of these functions.
The figure may be interpreted as follows.
is composed of four conceptual entities.
DEPARTMENTs~
and INSTRUCTORs.
COURSEs~
COURSEs, and is a member of a DEPARTMENT.
a HEAD INSTRUCTOR.
INSTRUCTOR.
An
The University database
They are
STUDENTs~
A STUDENT has a NAME, takes
A DEPARTMENT has a NAME and
A COURSE has a TITLE and is taught by an
INSTRUCTOR has a NAME, RANK, and SALARY, and is a
member of a DEPARTMENT.
Following is a description of the above natural view of the
database formulated in DAPLEX language.
->
ENTITY
DECLARE name(student)
->
STRING
DECLARE dept(student)
7
department
DECLARE course(student)
_..,.
/
course
-';
ENTITY
DECLARE title (course)
->
STRING
DECLARE dept (course}
->
department
DECLARE instructor(course)
-/
'
instructor
DECLARE instructor (
-i
ENTITY
DECLARE student(
DECLARE course{
)
)
)
DECLARE name(instructor)
-/
'-
STRING
DECLARE rank (instructor)
->
STRING
DECLARE dept (instructor)
""':'
department
DECLARE salary(instructor)
.....
-;
INTEGER
'-
28
DECLARE department( )
ENTITY
DECLARE name(department)
STRING
DECLARE head(department)
instructor
* Subtypes are supertypes.
Subtype and supertype relationships provide the capability to
declare entities in a more general way.
For example, rather than
declare STUDENT and INSTRUCTOR entities as in Figure 2, we can define
the students as persons, and instructors as employees, who in turn are
persons.
This declaration will be useful for the concept of
generalization explained later in this section.
DECLARE person ( )
7
ENTITY
DECLARE name(person)
...;./
STRING
DECLARE student( )
.:..;>
person
DECLARE dept(student)
...;./
department
DECLARE course(student)
->
course
DECLARE employee(
->
person
DECLARE salary(employee)
...;.>
INTEGER
DECLARE manager(employee)
->
employee
DECLARE instructor ( )
7
employee
DECLARE rank (instructor)
7'
STRING
DECLARE dept(instructor)
-/
department
)
In this definition, the STUDENT function is defined to return a
set of PERSON entities.
That is, the set of STUDENT entities is a
subset of the set of PERSON entities.
This implies that any STUDENT
entity also has the NAME function defined over it since it is
29
necessarily a PERSON entity as well.
The same consideration can be
made with the EMPLOYEE type specification.
2.1.4
MULTIBASE's Architecture
The basic architecture is structured to facilitate the achievement
of the above objectives.
It consists of two basic components: a
schema design aid and a run-time query processing subsystem.
The
schema design aid provides tools to the 'integrated' database designer
to design the global schema and to define a mapping from the local
databases to the global schema.
The run-time query processing
subsystem then uses the mapping definition to translate global queries
into local queries, ensuring that the local queries are executed
correctly and efficiently by the local DBMS.
2.1.4.1
Component Architecture
The component architecture of MULTIBASE includes two types of
modules:
Global data manager (GDM) handles all global aspects of a
query.
Local data interface (LDI) handles all specific aspects of a
local system.
Figure 4 illustrates MULTIBASE component architecture.
All the components and their functions in these two modules are
explained in more detail in the 'Query Processing' section.
2.1.4.2
Schema Architecture
One major technical challenge in supporting a heterogeneous
distributed database environment is to provide a global view of data
stored in local databases.
The local aspects of this global view must
30
be transparent to the global user, that means, all the information
about the location of data, data model, access language, etc., must be
insulated from the global user.
The approach followed in MULTIBASE is
to separate the issues of incompatible data handling and data
integration from issues related to the 'homogenization' of the
heterogeneous local databases.
This is accomplished by defining a
global schema over the DAPLEX representation of local host schemata.
Incompatible data handling and data integration are handled at the
global schema level, and homogenization of local databases is handled
at the DAPLEX representation level.
(See Figure 3.)
a. The Local Host Schemata
The local host schemata are the original pre-existing schemata
defined in the local data models and used by local DBMSs.
In the
university environment, for example, Network data models may be
used for the main campus database, and Relational data models may
be used for the branch campus database, etc.
The local data model does not need to be remodified when
incorporated into a MULTIBASE configuration.
So, its schema still
keeps the same configuration.
b.
The DAPLEX Local Schema
The DAPLEX local schema is equivalent to the local host schema
for each local DBMS.
DAPLEX's flexible modeling capabilities make
it easy to emulate a variety of data models.
(See subsection
2.1.3 for a complete explanation of the DAPLEX functional data
model.)
As a result, the local schemata are expressed in a single
model and higher levels of MULTIBASE need not be concerned about
31
the different data models among the local DBMS.
Because handling
of local system differences is isolated at this level, the
addition of a new database into the MULTIBASE configuration does
not significantly affect the higher levels of the system.
c. The DAPLEX Auxiliary Schema
The DAPLEX auxiliary schema describes a DAPLEX auxiliary
database that is maintained by an internal DBMS within MULTIBASE.
This schema contains the information needed for integrating
databases.
Two kinds of data are stored in the DAPLEX auxiliary
database:
Data that are not available in any of the local databases.
Data that are needed to resolve incompatibilities.
(See Section 2.1.8.1 for more details.)
d. The DAPLEX Global Schema
The DAPLEX global schema is an integrated view of the data in
the underlying, non-integrated DBMSs, including the auxiliary
database.
Depending upon the global user's demand, MULTIBASE can
support several different global schemata for the same underlying
DBMSs.
A powerful mapping language (see DAPLEX language)
describes how to achieve these views using DAPLEX local and
auxiliary schemata.
This concept is used in pre-existing DBMSs
(IMS, TOTAL, SEQUEL ••• ).
To the user, the global schema appears to be as a schema for an
integrated DBMS.
With the availability of the global schema, the
user does not need to be aware of the location of the required
data, or the differences in schemata and data models among the
32
local DBMS.
{Some authors refer to this design as full autonomy.)
Thus, MULTIBASE presents the user with the illusion of a
homogeneous and integrated database, without requiring that the
local databases be physically integrated.
2.1.5
Query Processing Architecture
This section describes every step involved in the processing of a
global query through MULTIBASE.
All the components in the GDM and the
LDI, and their responsibilities of query processing are illustrated.
GDM decomposes the DAPLEX global query into single-site queries.
Each single-site query is still expressed in DAPLEX language, but it
is expressed against the local schema and references data at exactly
one local host DBMS.
LDI accepts these single-site queries and
translates them into appropriate local query languages.
These queries
are then executed at the local DBMS.
The results are reformatted by
the LDI into standard representation.
GDM accepts these results and
manages to compose them into a single answer which will be sent to the
global user.
2.1.5.1
Global Data Manager {GDM)
The global data manager architecture includes the following
components:
a.
{Figure 5)
Transformer
The transformer takes the DAPLEX global user query over the
global schema as input.
It provides as output a DAPLEX global
query which references the DAPLEX local schema and the auxiliary
database schema.
33
To perform this process, the transformer first passes the query
into a parse tree.
and functions.
Components of the parse tree are global entities
The LS relationships to the GS are then substituted
into the parse tree by the view mapping mechanism.
The result is a
parse tree consisting of entities and functions of the local schemata.
b.
Optimizer
The optimizer interfaces with several components in the GDM
such as: transformer, decomposer, and filter.
Its responsibility
is to provide an optimization strategy to process the query.
First, it examines the global DAPLEX query from the transformer,
and then determines an overall strategy before forwarding to the
decomposer.
The optimization does not stop here.
In fact, the
optimizer subsequently refines the strategy after the query has
been processed by the decomposer and the filter.
In some cases,
the optimizer may need several stages of processing by the
decomposer and the filter.
The final output from the optimizer
includes a global execution strategy that combines local
processing and data movements steps.
One objective of the
strategy plan is to minimize the query processing cost.
The cost
is a function of processing requirements of each local query and
the interprocess data communications.
A second objective of the
access plan is to determine any precedence ordering of local
queries which may be required.
Since a given entity or function may exist in more than one
LS, the Auxiliary schema is used to resolve any incompatibilities
of the data items between two LSs.
This is indicated in the
34
strategy plan.
c.
Decomposer
The decomposer separates the DAPLEX global query from the
optimizer into subqueries according to its local sites.
At this
time, the single-site queries are still expressed in DAPLEX.
Among these single-site queries, the one that references the
auxiliary database or the one that requires a combination of data
accross sites are sent to the internal DBMS for execution.
d.
Filter
The responsibility of the filter is to remove all operations
which are not allowed or supported by the local DBMSs.
These
operations are performed in a later stage by the internal DBMS.
It is the job of the filter component of the GDM to examine each
DAPLEX single-site query to locate operations not supported by the
destination host system.
e.
Monitor
The monitor examines the execution of the strategy developed
by the optimizer.
Eventually, it initiates the DAPLEX single-site
queries at the appropriate LDis.
It also sends to the internal
DBMS a final query that may include references to the auxiliary
database and operations that compensate for local system
limitations.
The monitor must ensure that all local queries are
carried out and the results are returned and integrated into the
global form.
The monitor must also make sure that any precedence
relationships, or interdatabase relationships, are carried out.
These arise when the data selected from one local DBMS depends on
35
data selected from another DBMS.
Each local DBMS is interfaced by
a Local Database Interface (LDI).
The monitor passes to an LDI
the data move and local processing steps required for the local
database.
Finally, it combines the data returned by the LDI and
formats the output as requested by the user.
2.1.5.2
Local Database Interface (LDI)
The architecture of an LDI includes the following components:
(Figure 6)
Optimizer
Translator
Data formatter
a.
Optimizer
Based on the DAPLEX single-site queries, the optimizer
determines a strategy for local query processing.
This strategy
depends upon the Data Manipulation Language capability and format
of the local host DBMS.
If the local DBMS uses a high-level
(i.e., set-at-a-time) language such as DAPLEX, query optimization
is relatively direct.
However, if the local DBMS uses a low-level
(i.e., record-at-a-time) language such as CODASYL DML embedded in
COBOL, the optimization may be complex.
discussed in this section.
Local optimization is not
If interested, readers may refer to
(LAND 82).
b.
Translator
The translator provides a uniform interface to all local
databases.
It translates DAPLEX single-site queries to different
local queries expressed in their own language.
Two queries that
36
look very similar when expressed in DAPLEX over local schemata may
be transformed by the LOis into dramatically different local
queries.
If the local host system provides restricted query
capabilities, the complexity of an LDI is increased tremendously.
An interesting point is raised about the trade-off between having
a limited capabilities LDI and having a large amount of data to be
moved to the GDM for further processing, or vice versa.
This is a
separate topic that is not discussed in this paper.
(For references use (LAND 82) and (DAYA 82)).
2.1.6
View Definition and Generalization for Database Integration
Formulating and implementing queries that require data from
more than one database poses many problems for the user.
These
problems include resolving discrepancies between the databases, such
as differences in representation and naming conflicts; resolving
inconsistencies between copies of the same information stored in
different databases; and transforming a query from the user's language
into a set of queries expressed in the different retrieval languages
supported at the different sites.
This section focuses on the issue
of database integration.
2.1.6.1
Mapping between LOCAL HOST SYSTEM and DAPLEX LOCAL SYSTEM
The data modeling capabilities of DAPLEX incorporate those of
relational and network models, the principal data models in use today.
This suggests the possibility of DAPLEX front ends for existing
databases and database systems.
In this section,
discussed in detail.
the mapping between the LHS and LS is
LS data models are represented in the two
37
popular models: network and relational.
a.
Relational Model
A relational database schema consists of a set of relation
definitions.
To translate a relational LHS to a functional LS, we
essentially map each relation to an entity type.
A tuple of a
relation in a relational data model is similar to an entity in a
functional data model.
A tuple is uniquely identified by its
primary key and has one or more attributes, just as an entity has
one or more functional values.
Therefore, to map a relational
model LHS into a functional model LS, an entity is defined in the
LS for each relation in the LHS, and a function is defined on the
corresponding entity type for each attribute of the relation.
range of the function is the domain of the attribute.
The
If the
attribute is a primary key, then the function must be totally
defined and one-to-one.
In any case, due to the relational
format, the function must be single-valued, not set-valued.
It is clear that the relational model is a subset of the
functional model since isomorphic DAPLEX description of any
relational database is subjected to the following limitations:
- No multivalued functions are allowed.
- Functions cannot return user-defined entities.
- Multiple-argument functions are not allowed.
- There are no subtypes.
Having specified the isomorphic description, and assuming the
existence of a suitable data manipulation translator, DAPLEX
requests can be written against the relational database.
However,
38
the full benefits of the DAPLEX approach will not be available
because of the limitations in the underlying data model.
What is
needed is to define derived functions which provide a more
convenient view of the database.
The following figures show the
data description in DAPLEX which is mapped from the relational
data description, and the definition of the derived function for
the functional view.
STUDENT
STUDi
NAME
DEPTi
COURSE
TITLE
COURSE :It
DEPTi
INSTRUCTOR
NAME
RANK
DEPTi
NAME
headi
ENROLLMENT
STUDt
COURSEi
INSTRUCTOR
INSTRUCTOR
1
DEPARTMENT
dept#
SALARY
39
*
RELATIONAL DATA DESCRIPTION
DECLARE
student (
..;.)'
ENTITY
DECLARE
studi(student)
..;.>
INTEGER
DECLARE
name(student)
7
STRING
DECLARE
dept# (student)
7>
INTEGER
DECLARE
course ( )
..:.>
ENTITY
DECLARE
coursei(course)
7
INTEGER
DECLARE
title(course)
~
STRING
DECLARE
depU (course)
-7-
INTEGER
DECLARE
instructori(course)
..;;.">
/
INTEGER
DECLARE
enrollment( )
-;-
ENTITY
DECLARE
studi(enrollment)
~
INTEGER
DECLARE
coursei(enrollment)
~
INTEGER
DECLARE
instructor ( )
..:.>
ENTITY
DECLARE
instructori(instructor) -?'
INTEGER
DECLARE
name(instructor)
-?'
STRING
DECLARE.
rank (instructor)
...;.>
STRING
DECLARE
depti(instructor)
7
INTEGER
DECLARE
salary(instructor)
-)"
INTEGER
DECLARE
department ( )
..:.>
ENTITY
DECLARE
depti(department)
7
INTEGER
DECLARE
name(department)
-7
STRING
DECLARE
headi(department)
..;./
INTEGER
.._
40
*
The relational description in DAPLEX.
DEFINE dept(student)
7
department
SUCH THAT
depti(department)
DEFINE course(student)
-/
course
= depti(student)
SUCH THAT
FOR SOME enrollment
stud#(student)
= stud#(enrollment)
AND course#(enrollment)
=
course#(course)
DEFINE dept(course)
~/
department
SUCH THAT
dept#(course)
DEFINE instructor(course)
7>
instructor
= dept#(department)
SUCH THAT
instructor#(instructor)
=
instructor#(course)
DEFINE dept(instructor)
~
department
SUCH THAT
dept#(instructor)
DEFINE head(department)
-/
instructor
= dept#(department)
SUCH THAT
instructori(instructor)
=
headi(department)
*
Definition of the Functional View
The above derived functions can be thought of as adding semantic
information which is not expressed in the relational data model.
41
This capability provides the user with convenient access and security.
b.
Network Data Model (CODASYL model)
If an LHS is defined in the CODASYL model, then it consists of
record types and set types.
The functional data model consists of
entity types and functions on entity types.
So to map the LHS
into an LS, one simply maps record types and set types into entity
types and functions, respectively.
The concept of record type in the CODASYL model is very
similar to that of entity types in the functional data model.
A
record in the CODASYL model has a record ID, and one or several
attributes.
The record ID uniquely identifies the record, and the
attributes describe properties of the record.
Similarly, in the
functional data model, an entity is an object of interest, and the
functions defined on the entity return values that describe the
properties of the entity.
Therefore, a record type corresponds to an entity type, and
the attributes of the record type correspond to functions defined
on the entity type.
If an attribute of record type is a unique key {no duplicate
is allowed), then the corresponding function must be a totally
defined one-to-one mapping.
If the attribute is a repeating group
(declared to have multiple occurrences in a CODASYL model) , then
the corresponding function is a set-value function.
In the CODASYL data model, a set type maps an owner record to
a set of member records, or, conversely, a set type maps a member
42
record type to a unique owner record.
Therefore, a set type
resembles a function that maps an owner entity to a set of member
entities, or, conversely, maps a member entity to a unique owner
entity.
Again, in the CODASYL data model, a set type implies not only
certain semantic information but also the existence of an access
path.
For example, a set type 'WORK-IN' between 'DEPARTMENT' and
'EMPLOYEE' record type implies that the employees owned by a
department work in that department.
But, it also implies that
there is an access path from a department record to the employee
records owned by that department and another access path from each
employee record to its own department record.
Since the LSs are
for query optimization, this access path information must be
captured in the LSs.
Therefore, for each set type in an LHS, the
corresponding LS must define not only a set-valued function from
the owner entity type to the member entity type, but also a
single-valued function from each of the member entity types to the
owner entity type.
In a CODASYL model, a record type can be declared to have an
index file which creates for the key, and the record is directly
accessible through the index key.
For this record type, a system
set function of which domain is the key value and range is the
entity type (corresponding to the record type) must be defined in
the LS.
This system set function is used only for query
processing optimization.
designer.
It is not visible to the database
Therefore, it cannot be incorporated in the global
43
schema.
This restriction is imposed to preserve the data
independence of the global schema.
2.1.6.2
Integration of DAPLEX Local Schema Using View
and Generalization
This subsection deals with the topics of how to define the GS to
resolve discrepancies between the local databases.
Specifically, the
concept of generalization (SS 77, LG 78) coupled with the extensive
view definition facilities are used as a powerful tool for database
integration (KG 81).
Besides generalization, another useful modeling
tool is classification.
a.
Generalization Concept
Generalization is an abstraction that groups classes of
objects with common properties into a generic class of objects
(SS77).
For example, consider the two entities:
- A student entity is described by the following properties:
SSNO, NAME, ADDRESS, DEPTi, MAJOR, GPA.
- And an employee entity with the following properties:
SSNO, NAME, ADDRESS, SALARY, JOBHISTORY.
Because of their common properties: SSNO, NAME, ADDRESS, a generic
type PERSON can be formed.
Applications that are not concerned
with the special properties of students or employees need not
distinguish between them, and can, instead, treat them as a
person. ThU's, incorporating the concept of generalization in the
data model provides greater flexibility in semantic modeling.
44
*
Generalization OVer Entity
The concept of generalization assumes even greater importance for
database integration in MULTIBASE.
The local schema having been
independent may contain entities at different levels of
generalization.
Example:
LSl
PERSON (entity)
LS2
STUDENT (entity)
SSNO
SSNO
NAME
NAME
ADDRESS
ADDRESS
DEPTi
MAJOR
EMPLOYEE (entity)
SSNO
NAME
ADDRESS
SALARY
JOB HISTORY
The integrated global schema may not be satisfactorily defined
either at the level of LS2 (if it has no information available to
distinguish between persons in LSl who are students and those who are
employees), or at the level of LSl (for then the distinction between
students and employees in LS2 would be lost to global users).
reasonable solution is to include all three entity types:
A
45
STUDENT, EMPLOYEE, PERSON in the GS with the relationships
STUDENT
ISA
PERSON, EMPLOYEE
ISA
PERSON explicitly defined
between them.
(ISA relationships for generalization, STUDENT
ISA
PERSON
implies that every entity of type STUDENT is also of type PERSON.)
Example:
Suppose the same entity type occurs in two different schemas
but with different properties.
LS2
LSl
EMPLOY EEl
EMPLOYEE2
SSNO
SSNO
NAME
NAME
SALARY
ADDRESS
AGE
AGE
A conventional data model would suggest an 'outer join' of entity
sets in LSl and LS2 for the GS.
The 'outer join' operator sets NULL
values to both AGE values and ADDRESS values for employees in the
first and second database, respectively (CODD 79).
Instead, by using
the generalization to define the GS, the same effect is achieved
without introducing artificial null values.
In a later section, we
shall see that query modification is improved by using this additional
structure on view.
*
Generalization Over Functions
The concept of generalization can also be
generalization over functions.
extended to
~
'
46
Reconsider the above example.
Suppose that two additional
functions, HOMEPHONE and WORKPHONE, were defined for EMPLOYEE entities
in LSl and multivalues function PHONES for EMPLOYEE entities in LS2.
Then a convenient abstraction would be to define the generic
multivalues function PHONES for the generic entity type EMPLOYEE in
GS.
(disjoint case
=
take the union of two entities.
This is
explained by set theory.)
Thus, if employee has workphone pl and homephone p2 in the first
database, and the set phones: (p2, p3) in the second database, then
pl, p2, and p3 are all included in PHONES(e) in the global view.
b.
Classification Concept
Besides generalizations, another useful modeling tool is
classification.
Classification takes a generic object, such as
units, and partitions it into a disjoint entity.
To illustrate
how classification and generalization work together to integrate
databases, consider the following two databases:
One contains PRODEQUIPl and the other contains AUDI002,
VIDE02, and COMPUTER2.
Two different approaches are possible for integrating these.
The
first is to generalize the latter into PRODEQUIP2 and then
generalize EQUIP for PRODEQUIPl and PRODEQUIP2.
The alternate is
to classify PRODEQUIPl into AUDIOl, VIDEOl, COMPUTER! and then to
generalize these into AUDIO, VIDEO, and COMPUTER.
The choice of
which view to use depends on what integrated information is of
interest to the user.
47
c.
View Definition Language for DAPLEX
DAPLEX, as explained in the previous section, is a language
that permits the user to express interaction over a variety of
different data models in a unified way.
This section introduces
the construct of DAPLEX language for defining the global schema as
a view of the local schemas.
*
Defining Virtual Entity Set and Virtual Functions
The basic form of a view definition for a virtual entity set V is:
EXTENT of V IS
FOR EACH xl in Xl WHERE Pl {xl)
FOR EACH x2 IN X2 WHERE P2{xl,x2)
FOR EACH xn in Xn WHERE Pn{xl,xl, •••• ,xn)
DEFINE V TO BE
{vfl
~
expression-1,
{vf2 ?> expression-2,
vfm
7 expression-m)
where
Xl, X2, ••• Xn are entity set in the base schema,
Pj(xl, x2, ••• xj) is a predicate over the entity variables
xl, x2, ••••• xj,
expression-1 is a valid DAPLEX expression xl, x2, •••• xn,
vfl, vf2, ••• vfm are virtual functions whose domains is the
virtual entity set V and whose ranges are expression-1,
expression-2, •••• expression-m, respectively.
48
*
PROJ (V, i)
The PROJ operator allows the view definer to access the
entity sets over which a virtual entity has been defined.
V is an
entity variable ranging over a virtual entity set, and i is the
index of the component entity set.
This is represented by the ith
FOR EACH statement in the view definition.
PROJ operators never
appear in user queries; they are evaluated during query
modification.
*
Defining the Generalization Hierarchy
In addition to virtual entity sets and virtual functions, the
final virtual object supported is the generalization hierarchy.
A
generalized object is formed from the disjoin union of specific
objects.
Functions over a generalized object are conditionally
defined over the specific objects.
Overlapping specific objects
can be generalized by first defining views to separate them into
non-overlapping collections of entities, and then defining the
generalization view over the disjoint virtual entity sets.
The
view definition for a generalized object G over specific object
Sl, S2, Sn is:
EXTENT OF G IS
FOR EACH g IN Sl
U S2
U ••• U Sn
CASE g is
WHEN Sl 7
DEFINE G TO BE
(vfl
?>
expression-11,
49
vfm
-?
WHEN Sn
expression-1m)
~
DEFINE G TO BE
2.1.7
(vfl
?>
expression-nl,
vfm
~
expression-nm)
Query Modifications
A view is a schema object derived from existing database objects.
A view is usually defined as a query which could materialize it.
However, it is not necessary to physically construct the view: a query
over a virtual object can be mapped into a query over actual schema
objects by replacing the view with its definition.
This mapping is
called Query modification (STON 75).
Query Modification Algorithm
Step 1.
A query which references a generalized object is expanded into the
union of queries over the alternative specific objects.
For each
generalization, the hierarchy G reference in the range part of the
query, these steps are followed:
Substitute the specific objects Sl, ••• , Sn for g.
Replace the query by the union of queries over each Sl
separately.
Subtitute the definitions of conditionally defined functions
for the appropriate case under consideration.
Evaluate projection operators.
50
Repeat this step for all generalization hierarchies in the query.
Step 2.
For each query produced in step 1, substitute the iteration part
of the virtual entity set definition for the iteration over the
virtual entity set in the range part of the user query.
Step 3.
For each virtual function whose domain was substituted in step 2,
substitute the definition of the function for the function in the
qualification and the target list.
Step 4.
If the resulting query still contains virtual objects, repeat
steps 1-3.
Example:
To illustrate the algorithm, consider a view constructed from:
- PRODUCT!
(no., name, price, etc ••• )
- WAREHSEl
(no., quantity)
- PRODUCT2
- WAREHSE2
Define virtual functions PRODWAREl and PRODWARE2
EXTENT OF virprodl
IS
FOR EACH p IN product!
DEFINE virprodl TO BE
virprodlno
-:'>
prodwarel
->
prodlno (p),
51
THE { w IN warehsel
WHERE prodlno(p)
virprodlname
..;.>
EXTENT OF virtprod2
= warelno(w)),
prodlname (p))
IS
FOR EACH p IN product2
DEFINE virprod2 TO BE
virprod2no
-/ prod2no(p),
prodware2
->
THE ( w IN warehse2
WHERE prod2no(p)
7
virprod2name
= ware2no(w)),
prod2name (p))
The definitions of the generalized objects PRODUCTS and WAREHSE,
and the conditional function PRODWARE between them are:
EXTENT OF products IS
FOR EACH p IN virprodl U virprod2
CASE p IS
WHEN virprodl
DEFINE products TO BE
number
.:.;;-
prodware
-:-
virprodlno (p) ,
THE ( w IN warehse
WHERE PROJ(w,l)
= prodwarel(s)),
vi rprodlname (p) )
prodname
WHEN virprod2
DEFINE products TO BE
( number
7
virprod2no(p),
52
prodware THE ( w IN warehse
WHERE PROJ(w,2)
->
prodname
= prodware2(p)),
virprod2name(p))
EXTENT OF warehse IS
FOR EACH w IN
U warehse2
war~hsel
CASE w IS
WHEN warehsel
DEFINE warehse TO BE
warnum
~
warlnum(w),
warquan
~
warlquan (w))
WHEN warehse2
DEFINE warehse TO BE
warnum
7
war 2num (w) ,
warquan
-?
war2quan (w))
Normally, the definition of PRODUCTS and WAREHSE would be in terms
of the base entity sets PRODUCTl, PRODUCT2, WAREHSEl, and WAREHSE2,
rather than the virtual entity sets VIRPRODl and VIRTPROD2.
The user
query is to find all product numbers that have a quantity greater than
100.
Print product name.
FOR EACH w IN warehse
FOR EACH p IN products
WHERE warquan
100
WHERE prodware(p)
=w
PRINT prodname(p)
First, generalizations are replaced by the union of queries over the
specific objects:
53
(1)
Objects: VIRPROD1, WAREHSE1
FOR EACH w IN warehse1
WHERE war1quan(w)
100
FOR EACH p IN virprod1
WHERE THE (w' IN warehse1 WHERE w'
= prodware1(p)) = w
PRINT virprod1name(p)
u
(2)
Objects: VIRPROD2, WAREHSE1
FOR EACH w IN warehse1
WHERE war1quan(w)
100
FOR EACH p IN virprod2
WHERE THE (w' IN warehse2 WHERE w'
= prodware2(p)) = w
PRINT virprod2name(p)
u
(3)
Objects: VIRPROD1, WAREHSE2
FOR EACH w IN warehse2
WHERE war2quan(w)
100
FOR EACH p IN virprod1
WHERE THE (w' IN warehse1
WHERE W' = prodware1(p)) = w
PRINT virprod1name(p)
u
(4)
Objects: VIRPROD2, WAREHSE2
FOR EACH w IN warehse2
WHERE war2quan(w)
100
FOR EACH p IN virprod2
WHERE THE (w' IN warehse2 WHERE w'
PRINT virprod2name(p)
= prodware2(p)) = w
54
Next, the definitions of VIRPRODl and VIRPROD2 are substituted into
the respective subqueries.
The first subquery is used as an example:
FOR EACH w IN warehsel WHERE warlquan(w)
100
FOR EACH p IN productl
WHERE THE (w' IN warehsel WHERE w' = prodwarel(p)} = w
PRINT virprodlname(p)
Finally, the definitions of functions are substituted into the query.
In this case, PRODWAREl and VIRPRODlNAME are replaced by their
definitions.
FOR EACH w IN warehsel WHERE warlquan(w)
100
FOR EACH p IN productl
WHERE THE (w' IN warehsel
WHERE w' = THE (w" IN warehsel)
WHERE prodlno(p) = warelno(w")) = w
PRINT prodlname(p)
The resulting query is rather complex, and in its current form would
be extremely difficult to process.
Certain query transformations can
be applied to place the query into a form which is easier to process.
The first transformation eliminates THE operations from a query:
x =THE (yIN X WHERE P(y))
in the qualification becomes:
P(x)
i.e.,:
Substitute x for y in P
x = THE
x IN X WHERE P(x))
55
Eliminate the associated set constructor and THE operator.
X
= P(x)
If x does not iterate over X, then the clause is automatically false.
After this transformation, if the result of the qualification is
false, the query is not executed.
FOR EACH w IN warehsel WHERE warlquan(w)
100
FOR EACH p IN productl
WHERE THE
w' IN warehsel
WHERE prodlno(p)
= warelno(w') = w
PRINT prodlname(p)
Applying the transformation again results in a query without any THE
operators:
FOR EACH w IN warehsel WHERE warlquan(w)
100
FOR EACH p IN productl
WHERE prodlno(p)
= warelno(w)
PRINT prodlname(p)
In subqueries (2) and (3), a comparison is made between warehsel and
warehse2.
These subqueries can be eliminated from further
consideration because their qualification will always evaluate to
false.
The final form of the sample query is shown below:
FOR EACH w IN warehsel WHERE warlquan(w)
FOR EACH p IN productl
WHERE prodlno(p)
PRINT prodlname(p)
u
= warelno(w)
100
56
FOR EACH w IN warehse2 WHERE war2quan(w)
100
FOR EACH p IN product2
WHERE prod2no(p)
= ware2no(w)
PRINT prod2name(p)
So far, we have assumed that the object types that make up the
generalized object are disjoint.
If this is not the case, then
virtual entity sets, which are restrictions of the original objects,
can be defined to be non-overlapping.
El'
= El
- E2
E2'
= E2
- El
El2
= El V
*
E2
Overlapping entity sets.
We assumed that it is possible to identify the overlapped entities by
some of their properties.
IDEl
identifier of El
IDE2
identifier of E2
Then the three non-overlapping regions can be defined as:
EXTENT OF El' IS
FOR EACH
e
IN El
WHERE (FOR NO f IN E2 WHERE IDEl(e)
DEFINE El' TO BE
(
...
)
EXTENT OF E2' IS
= IDE2(f))
57
FOR EACH
e
IN El
WHERE (FOR NO fIN E2 WHERE IDEl(e)
= IDE2(f))
DEFINE E2' TO BE
(
...
)
EXTENT OF El2' IS
FOR EACH
e
IN El
WHERE (FOR NO f IN E2 WHERE IDEl(e)
= IDE2(f))
DEFINE El2' TO BE
(
...
)
Since !DEl and IDE2 are incompatible, then the integration schema must
include a directory which lists equivalent identifiers.
described as above.
The views are
Processing queries containing qualifications of
the above form would be very expensive.
It might be more reasonable,
from a processing efficiency viewpoint, to assume that all entity sets
are always disjoint.
2.1.8
Database Integration and Incompatible Handling
If the LSs were identical (i.e., contained the same entity types
and functions)
and there were no conflicts among data values, then
database integration would be a trivial task.
In this case, the
global schema can be defined identically to each LS, and the extension
of each global entity type or function to be the union of the
corresponding extensions in the local databases.
However, in general,
there are two sources of difficulties:
- Schema differences: LSs are not identical.
- Data differences: Data values stored in different databases
might conflict.
58
It might seem that the simple solution for the above problem is to
define the GS (and its extensions) to be the disjoint union of the LSs
(and their extensions).
This solution will place the burden of
integration entirely on the users.
The users must then understand the
semantics of all the local databases to formulate queries.
This section explains and suggests an approach on how the DBA can
design global user views to resolve various kinds of schema and data
differences.
It uses the generalization and view definition concepts
in the above subsections.
The approach includes two steps:
Resolve schema differences so that the entity types of
interest to the user's application in all the LSs look
similar,
and combine
these entity types using
generalization.
Data differences are resolved by approximately defining the
functions on the supertype.
2.1.8.1
Schema Integration
Schema integration includes the following resolutions:
Naming conflicts.
Scale differences.
Structural differences.
Differences in abstraction.
a.
Name Conflicts
Naming conflicts are easily handled by renaming.
b.
Scale Differences
The same function values might be stored using different
scales in different databases.
To resolve this problem, a
il
59
conversion table might be created, stored in the integration
database and used in the view definition.
In more complex
situations, conversion may require calling an arbitrary DBAdefined procedure.
LSl
LS2
EMPl
EMP2
Example:
ID
STRING
ID
STRING
HEIGHT
REAL
HEIGHT
REAL
(ems)
(inches)
INTEGER
WT
INTEGER
WT.CODE
(light, medium, heavy)
(lbs)
CASE 1:
Height is measured in inches in LSl and ems in LS2.
There is
a bijection between inches and ems; we can choose either function.
In this example, we choose ems, and include the conversion formula
from inches to ems in the view definition.
Define virtual function HEIGHTRANGE in the following view:
EXTENT OF superemp IS
FOR
e
in EMPl
DEFINE superemp to be
superid
emplid(e)
heightrange
......
)
2.54
*
emplheight(el)
'
60
CASE 2:
For weight, there is no conversion formula from lbs to code
(light, medium, heavy).
Instead, there is a table for converting
between lbs and encoded weights.
This table is stored in the
integration database and used in the view definition.
The
table
relating
the
'lbs'
weight
to
the
'light/medium/heavy' weight code is represented by the entity set
TABLECONV.
For some TABLECONV entity t, TABCODE(t) is 'medium', while
TABLBS(t) is '50lbs'.
We can define a virtual entity set of EMP2
weights with EMPl lbs as follows:
EXTENT OF emplemp2weight IS
FOR EACH
IN emp2weight
w
FOR EACH
t
IN tableconv
WHERE tabcode(t)
= emp2weight(w)
DEFINE emplemp2weight TO BE
(emplemp2weightlbs
tablbs(t))
In general, the approach to handling incompatibilities of this
type is to:
Define a directory in the integration schema to contain
the equivalences.
Construct a virtual 'composite' entity set from the base
entity set and the directory.
And define a virtual function, over the composite, from
the directory function.
61
c.
Structural Differences
Structural differences include:
Missing functions or entity types (different in aggregation).
Modeling a real-world object or relationship by an entity in
one schema and a function in another schema.
Missing functions can be solved by using virtual generalization
hierarchies (i.e., define common functions in the virtual views).
Different functions or entity types are resolved by first defining
virtual entity types and functions, then using generalization.
d.
Differences in Abstraction
The entity types and functions in the LSs may have been
defined at different levels of generalization.
integrated using generalization.
generalization,
These can be
Besides different levels of
the LSs might differ in other forms of
abstraction.
2.1.8.2
Data Integration
This section discusses data discrepancy among local databases.
a.
The local databases are mutually inconsistent, but correct.
One reason for this might be that entities which appear to be
the same are actually different.
For example:
~1
entity:
~2
entity:
EMP
functions:
EMP
functions:
EMPNO:
EMP
_,.
integer
EMPNO:
EMP
->
integer
SAL
EMP
-;~
integer
SAL
EMP
...;;-
integer
62
*
Entity Discrepancy
Suppose there is an entity el with EMPNO(el) = 1, and SAL(el) = 25
in one database, and an entity e2 with EMPNO(e2)
=
45 in other database.
= 1,
and SAL(e2)
Also, suppose that although entities el
and e2 have the same EMPNO, they represent different employees in
the real world.
in the GS.
This implies that EMPNO cannot be an identifier
A concatenated database ID with a local ID is one of
the solutions for defining the GS ID.
*
Functions Discrepancy
Suppose that el and e2 represent the same employee, but the SAL
functions might represent the salaries of two different jobs.
This implies that SAL functions are homonyms.
In this case, the
resolution is to include these two functions (renamed} into a
virtual function, or to derive the SAL function on the virtual
function by some computation on the two SAL functions on the based
functions.
b.
The local databases are mutually inconsistent and incorrect.
In this case, several options can be used:
Using more credible data;
Triggering some proper actions, e.g., error messages;
Executing procedures embedded in the view definition
statement.
2.2
Four Levels of Architecture Approach (UCLA)
2.2.1
Introduction
In the direction of solving the heterogeneous data translation
problem, the UCLA research group suggests to provide four translation
63
layers (or levels) in between the user who submits queries to the
database, and the network DBMSs (PIRA 79).
This section describes the concept and architecture of this
approach.
The four translation levels are:
The DBMS data model dependent level (local level or LL) •
The data model independent local level (unified local level
or ULV).
The unified global database levlel (UGL).
The end users virtual database level (VL).
As you can see, this heterogeneous database network modeling
architecture is similar to the MULTIBASE system introduced in 2.1.4.
The difference between these two approaches is that MULTIBASE
queries are directed toward the global database model in a language
called DAPLEX, whereas, in the four level architecture, user queries
are directed at a DBMS dependent subschema of the global database.
2.2.2
Concept and Architecture
This section explains the proposed system architecture and the
four translation levels.
2.2.2.1
System Architecture
The proposed system architecture is illustrated in Figure 7.
The GLOBAL QUERY TRANSLATOR processes the query initially
submitted by the user and, with the knowledge of the virtual database
model associated with that query, translates it to a form acceptable
by the global conceptual model and global internal model.
The query
is then decomposed by a QUERY DECOMPOSER and ACCESS PATH SELECTOR into
the appropriate subqueries.
The subqueries will then have to be
64
translated into the query language or data manipulation language of a
specific GDBMS.
After being processed by the corresponding site, the
result of these subqueries are then joined together and reformatted by
the QUERY RECOMPOSER according to the virtual database model.
The
final result is the answer to the original query based on the user's
virtual model.
In local processing, the local DBMS processes the queries directly
and completely.
All the above processes are ignored.
A number of
important catalogues or directories and mapping or data definition
translation procedures are necessary.
2.2.2.2
The Four Translation Layers
As described above, the four translation layers are
(see
Figure 8) :
- Local Level (LL)
- Unified Local Level (ULL)
- Unified Global Level (UGL)
-Virtual Level (VL).
Each of the translation levels are composed of two types of data
modeling.
The conceptual model at a given level is a means of
visualizing the database in a meaningful form relative to data content
and relationships.
The internal model, also referred to as the
physical or access path model, is provided for the system's use in the
decision making processes relative to the management of data.
It maps
the conceptual model to the actual physical placement of data in a
form so as to provide efficient data management while providing
transparency of the physical nature to the users.
65
The internal model is not discussed in this paper.
Figure 9
illustrates the communication flow of a user query through the four
translation levels to the local databases.
a.
The Local Level (LL)
This level is the DBMS dependent level.
At this level, final
query translations are directed to manipulate a local database.
In other words, this is the final level before an actual local
DBMS acts upon one of the communication network's databases.
b.
The Unified Local Level (ULL)
The ULL contains generalized translation models of each local
database from its local data model type to a data independent
model.
c.
The Unified Global Level (UGL)
The UGL provides a generalized transformation of a combination
of all the ULL translations.
This then provides an overall global
view of the entire network of databases.
It represents the global
information content and data relationships of the network as a
whole.
This representation is independent ofphysical database
locations.
d.
The Virtual Level
The VL is best described as the end user's subview of the
global database.
It is the logical view in which the user poses
queries upon the data.
translation.
Thus,
it is the first level of
Ideally, the user's logical view will be in the
logical model to which the user is accustomed and which the user's
host DBMS uses.
This provides more flexibility and reduces the
66
time necessary to learn new language and reprogram.
This is a
major difference between MULTIBASE and this approach.
Because the
virtual level is missing in MULTIBASE, a new language called
DAPLEX is introduced to define the global view and build users'
queries or applications.
2.2.3
Conceptual Model Using E-R Model
This section explains the E-R model and the representation of the
conceptual model for the four layer architecture.
2.2.3.1
E-R Model
The Entity-Relationship model (CHEN 76)
is a highly logical
representation of database information content.
The E-R relationship
model can be used as a basis for unification of different views of
data: the network data model, the relational data model, and the
hierarchical data model.
It incorporates some of the important
semantic information about the real world and can achieve a high
degree of data independence.
The E-R model consists of a set of
objects. Objects can either be of type entity or type relationship.
The entities represent independently recognizable things in the real
world, while relationships exhibit the associations between instances
of entities.
a.
Entities
An
entity is a 'thing' which can be distinctly identified.
A
specific person, company or event is an example of entity.
According to the need of the enterprise, entities can be
classified into different entity types.
67
Example:
EMPLOYEE,
PARTS,
SUPPLIERS.
An entity set is a group of entities of the same type.
are many 'things'
There
in the real world which may be viewed
differently depending upon the enterprise's objectives.
It is the
responsibility of the database administrator to select suitable
entity types for his company.
b.
Relationships
Entities are related to each other.
Different types of
relationships may exist between different types of entities.
A
relationship set is a set of relationships of the same type.
Example:
(Cardenas 77) modifies and classifies an E-R relationship as
hierarchical or non-hierarchical.
A hierarchical relationship
associates occurrences of two or several entities such that at
least one of the entities needs the primary keys of other entities
to specify it uniquely.
68
For example:
,-~------.
I'
EMP
!
''
i
~-----T---;
/---/~-,.
//
·"-._
"". EMP-DEP '
·r
,-----~----'-----·~,
i
DEP
The relationship set EMP-DEP indicates that existence of an entity
in the entity set DEPENDENT depends on the corresponding entity in
the entity set EMPLOYEE.
Hierarchical relationship should be l:M or 1:1.
c.
Roles
The role of an entity in a relationship is the function that
it performs in the relationship.
For example, a 'marriage' is a relationship between two
entities in the entity set PERSON.
'Husband and wife' are roles.
The concept of role is very important in the E-R model.
define the relationship in both directions.
It helps
(In the network,
hierarchical, and functional data model, only the unidirectional
relationship is defined.)
d.
Attributes and Values Sets
The information about an entity or a relationship is obtained
by observation or measurement, and is expressed by a set of
attribute-value pairs.
values.
For example: '3', 'red', 'JOHN' are
Values are classified into different values sets, such as
FEET, COLOR, FIRST-NAME.
69
An
attribute can be formally defined as a function which maps
from an entity set or a relationship set into a value set or a
CARTESIAN products of values sets.
Attribute and value sets are different in concept except that
they may have the same name in some cases.
More than one
attribute may map from the same entity into the same value set.
For example:
NAME and ALTERNATE-NAME map from the entity set
EMPLOYEE to value sets FIRST-NAME and LAST-NAME.
Note that relationships also have attributes.
The concept of
an attribute relationship is important in understanding the
semantics of data and in determining the functional dependencies
among data.
2.2.3.2
Representation of Layer's Conceptual Model
The E-R data model explained above is used in this section for the
discussion of the layer's conceptual model.
*
Relational Data Model
The relational data model of a database consists of a set of
tables or relations.
Both entities and relationships of the E-R
model are translated into tables and are treated in the same way.
The representation of entities is the same in the E-R and
relational models.
The E-R relationship is defined as a table
which contains the mapping between instances of the participating
entities.
70
Example:
r----------!
!
(a)
I
El {PART)
I
I
'
:
&i,
PD
L__ _________________l
E-R model
'
I
R is non-hierarchical
M:N Relationship
;~up;-~
E2
_ f?i_
S_I?_____
.J
equivalent
Relation Tl
I
: Pi
I
PD
---·-------·~-~-------~-t
Relation T2
,----------r···-- -------- ·!L _______
Pi i
S#
QTY
:
!__ -----------~- _ _ L________ __j,
-~--------------
Relation T3
-.- --
--
--
-·-··----------
S
----~---------:
D
equivalent
71
(b) Relation T
I___ Pi
Si
PD
SD
QTY
Integrity constraint:
-··---··· ----------~-~-----i
Functional Dependencies
Pi gives PD
Si gives SD
(c) Relation T4
Pi
PD
Relation T5
Pi
(a)
Si ,
SD.QTY
Three relations Tl, T2, T3.
Tl and T3 represent the entities.
T2 represents the relationship which contains the
key of PARTS SUPPLIERS and QTY.
(b)
One relation T.
Using functional dependencies, T decomposes into Tl, T2, and
T3.
(Note: The relationship about the relationship is hard to
visualize in this case.)
(c)
Two relations T4 and T5.
In the case of l:n and 1:1 relationships which are nonhierarchical, different relations are possible.
relationships are translated in a different way.
Hierarchical
72
El (EMP}
Ei, EN
E-R MODEL
R IS A HIERARCHICAL l:n
,--
RELATIONSHIP
--
E2 (DEP}
I
NAME, AGE
I
t
EQUIVALENT
RELATION Tl.
----~----~-
'
'
E:ff,
EN
'
RELATION T2.
E:fl:
.
NAME
AGE
\I
- ... -------- - - - - - - 1
with its integrity constraint.
In the above example, the dependent entity can be translated
to one relation only by including the key Ei so that tuples are
unique.
Since a relational database with different conceptual models
(E-R model} can be defined, one conceptual model should be chosen
to avoid ambiguity at the time of defining the map between the LL
and ULM.
This is a crucial responsibility of the DBA specialist
in charge of integration of the system.
73
In a multi-database environment, one goal is the consistency
of the representative conceptual models.
Same entities and
relationships among different conceptual models should be examined
carefully.
The concept of generalization and classification,
discussed in previous section, should be used to achieve the
process of merging unified local levels to build the unified
global level (global view).
To achieve this, it is probably
necessary to convert some of the relationships to entities and
vice versa.
*
Network Data Model (CODASYL DBTG GDBMS, CINCOM'S TOTAL,
or HIS' IDS}
The mapping is usually simple because in the network model, in
contrast to the relational model, entities and relationships are
separated explicitly.
M:n relationships are represented through
linked records.
*
IMS
The data structure used in IMS 'physical databases' is
hierarchical.
The relationship between two segments, in terms of
the E-R model, can be hierarchical or non-hierarchical depending
on the IMS child segment.
A problem with the non-hierarchical
relationship is that it is not allowed in an IMS 'physical
database'.
(A child segment must have a parent segment.)
One feature of IMS is that it allows definition of a 'logical'
relationship between different physical databases.
relationship is non-hierarchical, can be l:n or m:n.
This
The
attributes of the relation are held in the IMS pointer segment or
~
74
the intersection segment.
Because of the hierarchical data structure used in IMS, a set
of integrity constraints are needed for the conceptual and
internal models.
*
An
Example. of ULCMs and UGCM
RELATIONS:
relation PART (P#,PD,CL)
part information: part number, part description,
and its class
relation WH(Wi,WD)
warehouse information: warehouse number and its
description
relation PSTR(PiA,PiC,QTY-USED)
product structure: assembly part number, component
part number, and quantity used per
assembly
relation PW(Pi,Wi,QTY)
part-warehouse: part number, warehouse number, and the
quantity of a part at a warehouse
INTEGRITY CONSTRAINTS:
All the part numbers in PW and PSTR must be in PART.
warehouse numbers in PW must be in WH.
All the
75
f ·- ··-·-··--
I
I
~-----~--,
i
El2: WH
i
I
l
Wi,WD
----------
_:
Unified Local Conceptual Model of Relational Database.
MAPPING:
ULCM Elements
LLM Elements
Entity Ell
Relation PART
Entity El2
Relation WH
Relation Rll
Relation STR
Relation Rl2
Relation PW
IMS DDL:
DBD NAME=DB2
SEGM
N&~=PART,BYTES=37
FIELD NAME=(Pi,SEQ),BYTES=l6,START=l
FIELD NAME=PD,BYTES=20,START=l7
FIELD NAME=CL,BYTES=l,START=37
SEG NAME=WH,PARENT=PART,BYTES=27
FIELD NAME={Wi,SEQ) ,BYTES=S,START=l
FIELD NAME=WD,BYTES=l6,START=6
FIELD NAME=QTY,BYTES=6,START=22
76
•
r-;;l = PART
1
1
I
! Pi,PD,CL I
L ___ -----------1
r-------- -----1
' E22: WH
Wi ,WD _______
I
I
_j
Unified Local Conceptual Model of IMS Database
MAPPING:
UCLM Elements
LLM Elements
Entity E21
Segment PART with all fields
Entity E22
Segment WH(Wi,WD), projection
of WH segment over (Wi,WD)
Relation R21
WH is the 'physical child' of
PART (the relationship between
parent and child)
R2l.QTY:
segment.
QTY field
in WH
77
CODASYL DDL:
SCHEMA NAME IS DB4.
AREA NAME IS DATA-AREA.
RECORD NAME IS PART.
LOCATION MODE IS CALC HASH-Pi USING Pi IN PART
DUPLICATES ARE NOT ALLOWED.
WITHIN DATA-AREA.
02
Pi~
TYPE IS CHAR 16.
02
PD~
TYPE IS CHAR 20.
02 CL; TYPE IS CHAR 1.
RECORD
NAME
IS
WH.
WITH DATA-AREA.
02
Pi~
TYPE IS CHAR 5.
02 WD; TYPE IS CHAR 16.
02 QTY; TYPE IS DEC 6.
SET NAME IS INVENTORY
OWNER IS PART.
MEMBER
IS
WH.
MANDATORY AUTOMATIC.
ASCENDING KEY IS Wf IN WH.
DUPLICATES ARE ALLOWED.
SET OCCURRENCE SELECTION IS THRU LOCATION OF MODE
OWNER.
Pi,PD,CL
Unified Local Conceptual Model of CODASYL Database
MAPPING:
UCLM Elements
LLM Elements
Entity E31
Record PART with all fields.
Entity E32
Record WH with two fields
(W#, WD).
Relation R32
R32.QTY: QTY field in WH
record.
79
----I
WAREHOUSE:
Wi,WD
JI
'---·-------
El:
Union of entities Ell, E2l, E3, with all their attributes
E2:
Union of entities El2, E22, and E32
Rl:
Relation Rll
R2:
Union of relations Rl2, R22, and R32.
Unified Global Conceptual Model
SECTION III
SUMMARY
81
SECTION III
3.1
SUMMARY
Differences Between the Two Approaches
In this section, the differences and similarities between the two
approaches are explained.
3.1.1
Data Model Differences
The representation of data by use of the FDM and E-R model obtains
a common goal.
This goal is the natural view of a heterogeneous
distributed database.
models.
This view must encompass all existing database
The E-R data model achieves this goal because of its
satisfaction of data independence.
but to a lesser degree.
The FDM also achieves this goal,
Hierarchical data model representation is one
of the major problems for FDM.
The FDM is not totally data
independent because its access paths are explicitly stated at the
external level (when used in conjunction with the DAPLEX language).
Relationships of entities in hierarchical databases are implicit
through parent-child relationships.
Because of this, their explicit
access paths are internal constructs.
The explicit access path
specifications of the FDM and the implicit access paths of
hierarchical databases do not permit the FDM to encompass hierarchical
databases.
The UCLA project intends to overcome this problem by using
the E-R model externally with SDAGs internally.
Access paths are
internally indicated in the E-R diagrams whereas explicit and implicit
relationships exist internally by the use of SDAGs.
Direct Access Path, see HORO 82.)
(SDAG: Semantics
82
3.1.2
Language Support Tradeoffs
The differences in language support have both positive and
negative tradeoffs.
In the case of MULTIBASE, queries expressed in DAPLEX require
users to learn a new query language.
One major disadvantage is that
all current retrieval applications must be re-written using DAPLEX.
However, the fact that there are fewer query translation levels
implies a higher probability of consistent responses to a wide variety
of queries.
In contrast with the UCLA project, supporting languages currently
running on local DBMSs provide the users with an environment to which
they are accustomed.
Furthermore, current applications can be used
without modification.
To provide these capabilities, one might
suspect that the variety of queries properly processed would not be as
great as it would be using a single global query language such as
DAPLEX.
3.1.3
Architectural Differences
Two fundamental differences exist in these layered architectures.
They are the virtual level of the UCLA approach and the auxiliary
schema of MULTIBASE.
The reasons for these differences are apparent
if one considers the query translation approach and objectives of each
architecture.
The query translation algorithms depend upon the
internal representation of the various layers of the architecture.
In
the case of MULTIBASE, the FDM is used to interpret queries as a
target for each physical database.
Thus, the auxiliary schema exists
to resolve the differences in the physical database once data has been
retrieved.
83
With the UCLA approach, the internal models indicate the access
path containing the inter-relationships of the database along the
access paths.
The MULTIBASE auxiliary schema is embedded in the
internal access path model, at the unified local level of the UCLA
architecture.
The second difference is the number of layers.
difference is the result of the two approaches' objectives.
This
UCLA
wants to support existing DBMS query languages whereas CCA chooses to
provide a uniform query language interface.
3.1.4
Functionality
The functionality of an HDDBMS is the overall database management
capability for an HDDB.
The basic goal is to allow integrated access
to as much of the HDDB as possible.
Considering the function a DBMS provides for a database, the best
an HDDBMS can do is provide the same capability for an HDDB.
Unfortunately, due to translation difficulties for the differing data
models under a common representation, and concurrency control
mechanisms of differing DBMSs, full functionality will never be
realized.
Because of this, both CCA and UCLA have imposed limitations
on their respective HDDBMs, and the limitations of each project are
summarized below.
a.
Functionality of MULTIBASE
The objective of MULTIBASE is to provide a uniform interface
through a single HDDB schema and query language.
Due to the
differences in the update semantics and concurrency mechanisms of
each DBMS, this has been further reduced to a retrieval only
HDDBMS.
For this reason, it is difficult to consider MULTIBASE an
84
HDDBMS.
It is more like an HDDB reporter.
accomplished through DAPLEX.
The retrieval can be
However, maintenance of the physical
databases is left to the local application programs.
integrity requires manual intervention.
Thus, HDDB
That is, common
information spread across two or more physical databases must be
indirectly kept consistent.
Usually, such a task requires
migration of data, via manual procedures, from one database site
to another.
Another limitation of MULTIBASE is the limitation of the
supported DBMS.
No hierarchical data model is supported although
the most frequently used DBMS today is IMS, a hierarchical DBMS.
Thus, a large population has been left out.
A third limitation of MULTIBASE is non-support of existing
applications which report the data at a local site.
Since
MULTIBASE requires the use of the DAPLEX language for HDDB
retrieval, existing applications will never run on the HDDB.
They
will have to be re-written if they are to access the HDDB.
Of course, imposing these limitations has a benefit also; that
is, the simplicity due to the reduction in functionality.
capability implies a simpler solution.
Less
MULTIBASE is now in the
development stage and is expected to become available sometime in
1986.
b.
Functionality of the UCLA Approach
An
objective of the UCLA approach is to provide a true HDDBMS.
Unfortunately, due to semantics and concurrency differences, a
fully capable HDDBSM will never be realized.
Because of this, a
goal of the UCLA project is to provide a limited HDDBMS in the
case of updating.
Just exactly what these limitations are has yet
to be fully identified.
Aside from this limitation, all other goals of a full HDDBMS
are objectives, and ambitious ones at that, of the UCLA project.
Thus, existing local applications should run against the HDDB and
all DBMS data models should be supported.
ambition leads to complications.
beyond that of MULTIBASE.
research stages.
Of course, this
The UCLA objectives are far
Because of this, it is still in the
Many issues have yet to be resolved.
Most
notably are the limitations on update capability and the
translation for queries and data.
3.1.5
Translation
As mentioned, the UCLA architecture for the global-local level
translation is equivalent to MULTIBASE, differing only in the external
and access path models.
The challenge for the UCLA group is to extend
this architecture to encompass virtual level processing.
This has not
been researched yet.
There is a lot of discussion about the translation's complexity
based on the four layers architecture.
One might suspect the need for
another layer between the virtual and unified global levels which
could be referred to as the unified virtual level.
This unified
virtual level would provide to the virtual and global levels what the
unified local level provides for the local and global levels.
an intermediate mapping.
This is
The reasoning for this is as follows:
To
transform between a virtual level DBMS data manipulation language into
86
the global E-R model requires an association of the virtual level
database definition, expressed in the data definition language of the
virtual level DBMS, and the unified g_lobal level, expressed as a set
of internal access paths by using SDAGs.
Thus, some sort of mapping
between the virtual and unified global levels, preferably E-R, must be
defined.
Therefore, an intermediate mapping has been identified.
The
result is another layer.
3.2
Conclusion
Two approaches to HDDBMS are discussed and compared with respect
to data modeling, database integration and query translation.
approaches have been based on their objectives.
These
In the case of CCA,
the basic objective is to build a HDDBMS to provide uniform reporting
capabilities on the HDDB through the use of FDM and DAPLEX languages.
Because of the simplification due to the limitations imposed by this
goal, a prototype version of MULTIBASE is currently under development.
The UCLA project has a much more ambitious goal in that its
objective is to provide a complete HDDBMS (with limited update
capabilities)
in such a manner that existing DBMS languages and
applications can operate on the HDDB.
project development has yet to begin.
Because of this, the UCLA
It may begin only when the
issues regarding update limitations and query and data translations
are resolved.
A general consensus is that automating the management of an HDDB
is a very complex and difficult problem.
today.
The need for it exists
Ideally, one would like to see a full working HDDBMS where the
objectives of the HDDBMS are those of the UCLA project.
However,
0
'
88
REFERENCES
(ADIBA 82)
Adiba, M., 'Distributed Database research at Grenoble
University,' Database Engineering, December 1982
Vol. 5, No. 4.
{BRAY 76)
Bray, 0. H., 'Distributed Database Design
Considerations,' Trends and Applications: Computer
Networks, 1976.
(CARDENAS 79) Cardenas, A. H., 'Database Management System', Allyn and
Bacon Inc., 1979.
(CHAN 81)
Chan, A., 'The Design of ADA Compatible Local Database
Manager (LDM),' Technical Report CCA-81-09, Computer
Corporation of America, Cambridge, Mass., November 1981.
(CHEN 76)
Chen, P. P., 'The Entity Relationship Model: Toward A
Unified View Of Data,' ACM Transactions On Database
Systems, Vol. 1, No.1, March 1976, pp. 9-36.
(CHU 79)
Chu, w. w. and Chen, P. P., 'Tutorial: Centralized and
Distributed Data Base Systems,' IEEE Computer Society,
Long Beach, California, 1979.
(CODD 79)
Codd, E. G., 'Extending the Database Relational Model to
Capture More Meaning,' ACM TODS, December 1979, pp. 397434.
(DATE 77)
Date, C. J., 'An Introduction of a Database System,' 2nd
Edition, Addison-Wesley, London, 1977.
(DAY 82)
Dayal, u. and Hwang, H. Y., 'View Definition and
Generalization for Database Integration in MULTIBASE: A
System for Heterogeneous Distributed Databases,' Proc.
Sixth Berkeley Workshop on Distributed Database
Management and Comp. Networks, February 1982.
89
(GIIG 84)
Gligor, V. D. and Luckenbaugh, G. L., 'Interconnecting
Heterogeneous Database Management Systems,' Computer,
IEEE Computer Society, Vol. 17, No. 1, January 1984, pp.
33-43.
(GLORIEX 82)
Glorieux, A. M. and Litwin, w., 'Distributed Data User's
Needs: Experience from Some SIRIUS Project Prototypes,'
Database Engineering, December 1982, Vol. 5, No. 4.
(HM 78)
Hammer, M. and Mcleod, L., 'The Semantic Data Model- A
Modeling Mechanism for Database Applications,' Proc.
SIGMOD Con£., 1978, pp. 26-36.
(HORO 82)
Horowitz, J. R., 'Specification Of The Generalized
Internal Models In A Heterogeneous Database Management
System Network,' Masters Thesis, UCLA Department Of
Computer Science, March 1982.
(JOHN AND DIANE 77) John, s. M. and Diane, C. P. s., 'Database
Abstraction: Aggregation and Generalization,' ACM
Transactions on Database System, Vol. 2, No. 2,
June 1977.
(KG 81)
(LAND
(LG
82)
78)
(PIRA 80)
Kartz, R., Goodman, N., Landers, T. A., Smith, J. M. and
Yedwab, L., 'Database Integration and Incompatible
Handling in MULTIBASE - A System for Integrating
Heterogeneous Distributed Databases,' Technical Report
CCA-81-06, Computer Corporation of America, Cambridge,
Mass., May 1981.
Landers, T. and Rosenberg, R. L., 'An Overview Of
MULTIBASE,' from 'Distributed Databases,' North Holland
Publishing Company, 1982, pp. 153-183.
Lee ,R. M. and Gerritsen, R., 'Extended Semantics for
Generalization Hierarchies,' Proc. SIGMOD, 1978, pp. 1825.
Pirahesh, M. H. and Cardenas, A. F., 'Data Base
Communication In A Heterogeneous Data Base Management
System Network,' Information Systems, Vol. 5, 1980, pp.
55-79.
90
(SHIPMAN 81)
Shipman, D., 'The Functional Data Model and the Data
Language DAPLEX,' ACM Transactions on Database Systems,
Vol. 6, No. 1, March 1981.
(EDGAR AND LARRY 77) Edgar, H. s. and Larry, K., 'Data Architecture
and Data Model Consideration,' AFIPS Conference
Proceedings, Vol. 46, 1977.
(SMITH 81)
Smith, J. M., et. al., 'MULTIBASE- Integrating
Heterogeneous Distributed Database Systems,' Proceedings
of the 1981 National Computer Conference, AFIPS Press,
1981, pp. 487-499.
(STON 75)
Stonebraker, M., 'Implementation of Views and Integrity
Constraints by Query Modification,' Proceedings ACM
SIGMOD Conference, 1975, pp. 65-78.
(ZIMM 80)
Zimmerman, H., 'OSI Reference Model - The ISO Model Of
Architecture For Open Systems Interconnection,' IEEE
Transactions On Communications, Vol. COMM-28,
April 1980, pp. 425-432.
91
c:
"'::arn
c:
"'
rn
::a
c:
c:
"'
::a
"'
::a
rn
rn
n
c
:a ::c
.,
c
c: "'
...j...j
rn
::a
c:
"'rn
::a
~
•
n
c
.,:ac:
-4
rn
::a
c:
;~
"'
...j
@@
c:
~.-.-.
::a
"'::arn
FIGURE l
::c
c
Distributed Database Network Configuration
92
NAME
STUDENT
COURSE
NAHE
DEPT
TITLE
HEAD
RANK
SALARY
FIGURE 2
Functional Data Model Description Of A University Database
93
DAPLEX
GLOBAL
SCHEMA
DAPLEX
LOCAL
SCHEMAL
.. .
LOCAL
HosT
SCHEMAL
•
.•
DAPLEX
LOCAL
SCHEMAN
LOCAL
HosT
SCHEMAN
FIGURE 3 MULTIBASE Schema Architecture
DAPLEX
Aux 1u ARY
SCHEMA
94
RESULT
DAPLEX
GLOBAL
QuERY
GLOBAL
DATA
MANAGER
DAPLEX
SINGLE-SITE
QuERY
DAPLEX
SINGLE-SITE
QuERY
FoRMATTED
DATA
•••
LOCAL
DATABASE
INTERFACE}
•••
LOCAL
DATABASE
INTERFACEN
1<111~1----- RAW _ _ _.......... ,
LOCAL
QUERY
DATA
LOCAL
LOCAL
Host
•••
DBMS I
FIGURE 4
Host
DBHSN
MULTIBASE Component Architecture
LOCAL
QuERY
95
USER"S
QUERY
RESPONSE TO
USER'S QUERY
'
DAPLEX GLOBAL
QuERY ON
GLOBAL VIEW
I'
I TRANSFORMER I
DAPLEX GLOBAL
QuERY ON
DAPLEX LocAL
AND AUXILIARY
SCHEMATA
COMBINED
DATA
1
I
DAPLEX
GLOBAL
1r
OPTIMIZER l~------~:.4DECOMPOSER
j~
SUBQUERY
GLOBAL
STRATEGY
I
DAPLEX
QuERIES
SEPARATED
BY SITE
AUXILIARY
J_ATABASE
REDUCED
DAPLEX
QUERIES
L-------:1 FILTER
H
l t10N I TOR rDAPLEX
SINGLESITE
QuERIES
DATA
J
SITE}
FINAL QuERY
I
~INTERNAL DBMSj
I
SrTEI
SITEN
r. . r
T. . r
FIGURE 5
~ITEN
GDM Component Architecture
96
DAPLEX
SINGLE-SITE
QUERY
FORMATTED
DATA
1.
NETWORK
INTERFACE
DAPLEX
SINGLE-SITE
QUERY
[ OPTIMIZER
AccEss
PATH
QuERY
~
L TRANSLATOR
DATA
FoRMATTER
J
LOCAL
QuERY
RAV
DATA
~·
HosT
INTERFACE
I
Q UERY
! ~
DATA 1r
FIGURE 6
LOCAL
HosT DBMS
Local Database Interface Architecture
\
97
GLOaAL
OUEAY
DATABASE
COMMUNICATION
SYSTEM
NETWOIIK
TIIIANSLA TOfl
CATALOGUE
ULCM
ULIIIII
UGCM
UGIIIII
UM
VIII
OUE"Y
'niANSLATOfl
-------~-1
I
SUIOUE"Y
IUeOUE"Y
·---------,
I
\
FIGURE 7
UCLA System Architecture
98
VIRTUAL
LEVEL
VIRTUAL
LEVEL
UNIFIED
LOCAL
LEVEL
UNIFIED
LOCAL
LEVEL
LOCAL
LEVEL
LOCAL
LEVEL
FIGURE 8
I
I
I
I
I
I
I
I
I
I
I
I
I
VIRTUAL
LEVEL
I
I
I
I
I
UCLA Layered Architecture
UNIFIED
LOCAL
LEVEL
LOCAL
LEVEL
99
USER
QUERY
.
I 1
I I
I I
I I
RESULTS
--------------------------------~-------------------------------VIRTUAL LEVEL
----------------------------------r-----------------------------VtRTuA~_
VIRTUAL
CONCEPTUAL "ODEL
I
INTERNAL "ODEL
----------------------------------L-----------------------------1 t
I
________________________________tI __ JI ____________________________ _
UNIFIED GLOBAL LEVEL
----------------------------------r-----------------------------UNIFIED GLOBAL
1
UNIFIED GLOCAL
CONCEPTUAL
HODEL
INTERNAL HODEL
--------------------------------r-lt____________________________
_
I
I
________________________________t __
I
lI ____________________________ _
UNIFIED LOCAL LEVEL
----------------------------------r-----------------------------UNIFIED LQCAL
YNIFIED LOCAL
CONCEPTUAL
"ODEL
I
NTERNAL HODEL
--------------------------------,-lt____________________________
_
I
I
I
I
________________________________t __ l ____________________________ _
lOCAL LEVEL
----------------------------------T-----------------------------LocAL
LoCAl
CONCEPTUAL HoDEL
1
I
INTERNAL "ODEL
--------------------------------,--~-----------------------------
LOCAL QUERY
I
RETRIEVED DATA
FIGURE 9 Communication Flow Of A User Query Through
The Four Translation Levels To The Local Databases
87
promises for achieving these are well into the future, if ever.
The
only alternative in the near future is to utilize tools such as DAG or
MULTIBASE to provide integrated and uniform access along with the
manual efforts to ensure the HDDB integrity completely.
© Copyright 2025 Paperzz