Appendix 3 - Product Architecture
 
Previous Chapter: Appendix - SQL
Next Chapter: Appendix - Special Issues

This chapter is included in the manual to provide information about the design and implementation issues of Ovrimos. Some technical details are described, but none of the issues addressed here are by any means necessary for the non-technical user.

Communication Model

The communication model of the SQL Server is best represented by the following picture, where the basic modules used for the database implementation appear. The rectangles represent program modules, while the ellipses represent data structures (e.g. the shared memory). The classical cylinder representation is used for data files. No arrows appear in the picture. The connection lines denote a two-way communication.

Picture app-3.1 The communication model for the SQL server

In the connection with the sqlcore, the grey lines indicate that a request is addressed to the multiplexer, while the solid ones that the request is delegated to and handled by the appropriate processing mechanism. More on communication and request handling is described in the working ina pools section.

Parts of the model

The basic modules which compose the database are the dbmcore, the sqlcore, the virtual processors (working ina pools), and the SQL and HTTP clients.
The dbmcore and sqlcore modules constitute the server kernel and accept calls from the client programs. In the following paragraphs, the dbmcore and sqlcore modules, as well as their communication and exchange methods are analyzed.

Before proceeding to more detailed descriptions, we give a short definition of the elements used in the model.

dbmcore
It is a set of modules handling low level calls to the data base (filing system).

sqlcore
It is a set of modules that interpret the SQL calls to low-level calls and transmit them to the dbmcore. An HTTP server is embedded into the sqlcore, allowing the module to double as a Web server.

The client-server model, in combination with the fact that the two basic kernel parts reside on the network server, minimizes useless data traffic and maximizes the speed of execution of SQL commands. The communication protocol between the server part and the client applications is TCP/IP.

network endpoints and working ina pools (wip)
Before reaching the sqlcore, the client calls for initiating a connection are "caught" by a multiplexer that, depending on the call, delegates it to an SQL serving mechanism (ina) or an HTTP serving mechanism (ina). The communication protocol at this point is TCP/IP. Before accepting a call, the multiplexer checks the working ina pool for a free processing mechanism.

SQL-clients
SQL clients are either SQL terminal programs or any other application (Java, Perl, etc.) that uses SQL.

HTTP clients
HTTP clients are Web browsers or programs and scripts that use the HTTP protocol.

The network endpoint communication with the database is achieved through a protocol where a shared memory and three semaphores are used. Details on the use of semaphores are presented in paragraph dbmcore/sqlcore communication protocol.

Connection terminology
Four terms, namely session, session group, statement and connection are used throughout this technical part of the manual and they are explained below to clarify an unavoidable overloading of these words in the context of a database user's guide.

Session and Session group
Every application program retains one or more sessions, which are grouped in session groups. In its simplest definition, a session is a communication channel, identified uniquely through a handle (session handle). A group of sessions is a session group. More on these two terms is written in the paragraph where the dbmcore communication protocol is described.

Statement and Connection
For every SQL application there is an active entity through which the SQL commands are executed. This is called a statement. Every SQL application may have grouped statements in one or more Connections. The Connection is a handle for these statements. A statement is a subordinate notion of connection.
More or less, a statement corresponds to a dbmcore session and a connection to a dbmcore session group. Therefore, all statements of a connection end up to sessions of the same session group.

Note!
The term statement is used throughout this chapter to denote an active entity that executes SQL commands and does not have the meaning of an SQL statement (i.e. an SQL command). The term SQL command is used whenever the need arises to talk about one in this chapter.

In the following discussion, the terms table and data file, key and index file, and row and record have the same meaning.
 

The DBMCORE

The dbmcore is the central module of Ovrimos. Together with the sqlcore they constitute the kernel of the product. It is a layer that serves low level calls to the relational filing system.
The basic submodules of the dbmcore are the following:

Calls to the dbmcore are derived from the clients. These calls (see also app-3.1) are calls issued by the sqlcore. Before getting into a detailed description of the submodules, it is a good point to pause and examine the way the communication between the dbmcore and its client, the sqlcore, is performed.

Dbmcore/Sqlcore communication protocol

The communication to the dbmcore is demand driven. The implementation of a shared memory data structure mandates the use of a semaphore that allows access to the shared memory. For the dispatching of a message from the shared memory to the dbmcore and vice-versa, two more semaphores are used.
The three semaphores are supported by the operating system where the dbmcore and the sqlcore reside. Their implementation is operating system dependent.

The steps for the timing and serving of requests are as follows:

 1. The client addresses semaphore 1 (which controls message dispatching to the shared memory). The client is a statement handling ina.

2. The message is written into the shared memory. 3. The dbmcore is notified, through semaphore 2 (which allows or denies the reading of the shared memory from the dbmcore), that there is a message to read.
Through semaphore 3, the client waits for reading from the shared memory messages sent by the dbmcore.

4. The dbmcore reads the message and executes the request (if all other conditions are satisfied) and puts the reply into the shared memory.

5. The client is signaled, via semaphore 3, that the reply is ready and may be accessed. The dbmcore waits again on semaphore 2 for reading.

6. The client quits the critical section and releases, through semaphore 1, the shared memory, which, thus, becomes available to another client.

All messages have a limited length and correspond to a simple function, which may be executed very fast. The testing with this architecture (in previous applications with the use of dbmcore) shows that the multiplexing of many sessions is performed easily and without problems.
Every message is an elementary call to the dbmcore, which is served immediately. The only exception is when the client requests a wait-lock on a record. In this case the client may have to repeat the same message several times, until either a successful return, a time-out, or a definite failure of the request.

Description of the dbmcore

The following picture, app-3.2, shows the main submodules of the dbmcore. The rectangles denote parts of the program (submodules) and the data in the hexagon rules and procedures. The arrows denote a dependence (every rectangle with leaving arrows depends on the rectangles where the arrows point). Even though it does not appear in the picture, the communication is two-way, which means that when a submodule delegates the execution of an operation to one of the submodules it depends on, it will wait for a reply.
The lower half of the shared memory structure appears on the top of the picture to remind that it is the only means of communication between the dbmcore and the sqlcore, which issues the calls.

Picture app-3.2. The dbmcore structure

The coreserv is the upper level of the dbmcore and, basically, it is a wait loop that reads messages, placed by the client applications into a shared memory data structure. See, also the description of communication to the dbmcore via semaphores. Messages to the server are requests for low level processing in the database files, therefore the basic function of coreserv is to delegate the requests to the appropriate submodule.

Most requests to the dbmcore will end up to the database submodule (file and index manager) for execution. But since the data file manager should not have to deal with users, access permissions, transactions, etc., the coreserv first addresses the submodule dbnet, which, in turn, addresses the submodule locktrans (lock and transaction manager). Through the locktrans the commands to the database are integrated in a wider frame of actions, which result to the positioning of the request into the proper transaction and the handling of the request in a multi-user environment.
When the message is accessed from the shared memory and is delegated to dbnet, the session and the session_group where the message was issued are known to the locktrans. The session parameters are passed to dbnet. The dbnet examines if the message belongs to an active transaction. It also makes a request for a lock (if necessary) to the lock manager. Only when the lock permission is granted, the dbnet proceeds to pass the request to the file manager.
The sessions of one session_group have access to the bookmarks of another and may use the locks in common. More on bookmarks in the Data and Index Manager paragraph.
If the process fails, then a rollback of the transaction where the request belongs is performed.

Yet another of the core submodules is the schema. The schema contains information of the file and index catalog. Many calls to the database require access to this catalog.
One of the supported SQL client commands is the modification of the catalog data (data definition). In this case, the schema addresses the dbnet and requests proper locking actions, in order to implement the requested modification. The coreserv reserves a session specifically for this purpose.

The lock and transaction manager is one submodule, locktrans, since these functions are tightly interwoven and there is no point in attempting to separate them. The file and index manager and the data integrity mechanisms are also described in the following sections.

The Lock/Transaction Manager

The transaction is basic in all data management of Ovrimos. Commit and Rollback are used with their classical meanings and they are not described any further.

Dbmcore transactions are executed under the following rules:

Picture app-3.3 shows the transaction nesting. The pseudocode that follows describes the nesting.
 Picture app-3.3 Nested transactions

T1 is an example of a simple transaction, without nesting. T2, which contains nested transactions, may be described by the following pseudocode:

Start T2
...
Statements of T2
...

...
Commit T2
 

The nested transaction mechanism is implemented in the dbmcore but is not yet available to the user through the existing SQL terminal interface. It may, however be used by sqlapp4 applications. The nesting of transactions is possible through the nesting of cursors (declaring a cursor within a cursor).
Checkpoint and nested transactions will be available to SQL Terminal users in the Gold Edition of Ovrimos.

The lock manager is mandatory in a multi-user environment for two reasons:
1. For consistency
The data available to one user are not modified (by another user) until the first user completes any transaction with these data.
2. For integrity
The data in the database files must reflect in the correct order all updates made to them.

Data locking is performed on the row, but there are many isolation levels, as we will see below. If no locking is done, three unwanted conditions may arise in the process of concurrent data access and modification by client requests.

1. dirty read (reading of non-existing or incorrect data)
Suppose there is a transaction T1. T1 accesses and updates a field X from a row.
Before the T1 transaction is committed, another transaction T2 may 'see' the updated field.
T1 is rolled back and all its updates are canceled.
Now T2 has performed a dirty read, because it has read something that does not really exist.

2. non-repeatable read
Suppose there is a transaction T1. T1 accesses and reads one row.
Suppose there is a second transaction T2, which accesses this row, updates it and commits the updates.
During the same transaction T1, the same row is accessed, but this row is now modified.
Now T1 has performed a non-repeatable read, because it could not read twice the same data from the row.

3. phantoms (sudden appearance of previously non-existing data)
Suppose there is a transaction T1, which selects, during a query, a set of rows, all of them satisfying a certain predicate.
Suppose there is a second transaction T2, which inserts a row, that also satisfies the same predicate.
T1 performs the same query and accesses a set of rows, which contain a row that was not there before.
Now T1 has read a phantom.

To avoid all these abnormal situations, there exist, as mentioned before, four isolation levels, according to the SQL standard. In particular, the four isolation levels are:

Each isolation level puts more severe restrictions from the previous one.

The locking in Ovrimos is made in the record level (table row). There are two different locks:
1. read-lock
2. write-lock

Locks expire automatically at the end of the transaction.

The four isolation levels are achieved with the use of the locks in the following way:

1. READ UNCOMMITTED

2. READ COMMITTED 3. REPEATABLE READ 4. SERIALIZABLE The lock requests are examined and if it is not possible to grant a lock immediately, the request is not rejected at once. Instead, a wait indication is placed in the waiting queue. (See also, communication between the dbmcore and the client). This waiting time is enough to allow short transactions to commit their actions and release the requested objects.
Nevertheless, a wait of this sort allows interdependencies between the client programs and may cause a cycle (deadlock).
In order to prevent the deadlock, a graph of interdependencies is created, which is checked every time a request is candidate for the waiting list. If the insertion results in a cycle, then the request is immediately rejected.
Internally, the processing mechanism arranges to keep repeating the request until the expiration of the waiting time. In this case the client program is notified that a timeout has occurred, and it is permitted to repeat the request with the hope that meanwhile the owner of the desired object has released it. This kind of action, however, may lead to an infinite cycle of request-rejection (livelock), which should be avoided.

The Data and Index Manager

Static information

The lower layer of Ovrimos is a file and index manager. The database files are structured files, where every record is placed in the first available blank or after the last record. Every data file consists of a header and an unlimited number of records.
For every data key there is an index file, where the index is stored in a B-tree form. The B-tree is a multi-level tree. The nodes may be terminal or non-terminal. Index information is stored only in the B-tree leaves. All the non-terminal nodes are used for access to the leaves.

Dynamic information

The data files grow automatically as the stored data increase and the process is transparent to either developer or user. Similarly, the index files are reorganized dynamically every time a leaf data exceed its storing capacity. The leaf becomes automatically a non-terminal node with actions that are transparent to the user.
The file manager opens files when a request is made for them and closes the opened files only if there is a handle shortage. A memory map is used for caching file data and about a thousand B-tree nodes are also cached, allowing a very fast accessing of data. The data caching uses the virtual memory facilities of the operating system. It is operating system dependent.

For every dbmcore session there is an anonymous bookmark in every file, which shows the current record according to the most recently used access key. If the last access was sequential, the bookmark is not related with any key. A set of functions is available to the user (through ODBC, Java, Perl etc., but not SQL terminal) to allow the positioning of named bookmarks. There is no limit to the number or the place where these bookmarks are set.
This way, the anonymous bookmark allows only one current position for each data file (table) per session but as many named bookmarks as the user needs.
All sessions of one session_group are allowed to have access to the anonymous bookmarks of each other. We will return to bookmarks when we talk about the sqlcore.

Server Actions for data integrity

Ovrimos ensures data and referential integrity.
The data integrity is guaranteed by disallowing null values in any one component of a primary key.
The data file manager of dbmcore examines the record and allows insertion of the row only when it complies to this rule.

Other integrity rules, such as unique values in the primary key and, in general, of every key that is defined as having unique values are checked at this point.
Finally, fields other than keys, defined as not null are checked to ensure that they comply to the rule.

Data log, backup and recovery

A log is used to record all actions performed to the database. All actions recorded on the log are based on transactions. The beginning and end points of each transaction, as well as the actions on the data base are recorded on the log and used for a possible roll back of open transactions.

The log is particularly useful to allow on-line backup. Since a previous backup exists and the log is always updated, it is possible to take a new backup any time without having to stop the server from running. A copy of the log and the previous backup are made and may be used in combination if data recovery is required. See also, Database Maintenance.

A backup is automatically taken at shutdown. Every time the database is shutdown, the log is run forward, the backup files are replaced by the new backup, the dirty bit is turned to No and, subsequently, the log is erased. When the database is started again, a new log file starts and the dirty bit is turned to Yes.

It is also necessary to have a log for data recovery from a backup after an abnormal termination, such as power failure. In this case, the log is run and the database returns to the status it was before the crash. Afterwards, all uncommitted transactions are rolled back.

The procedures of restoring the database after a soft or a hard failure are described in Database Maintenance: Data recovery.
 

The SQLCORE

The sqlcore is a set of submodules, where the SQL commands are translated to low-level calls and are passed to the dbmcore of Ovrimos.
The hub part of the sqlcore is the SQL compiler, which receives the SQL commands and interprets them to elementary calls to the dbmcore.

The calls to the sqlcore come from either SQL clients or HTTP clients.

Before proceeding to the description of the SQL compiler, we examine the connection of SQL calls and HTTP calls to the sqlcore.

Working ina pools

There is a pool of working virtual processors (called inas), similar to threads, dedicated to handling either SQL commands or HTTP commands and a multiplexer monitoring active sockets.
The number of inas for each database is a parameter that may be tuned by the database manager (see also parameters). When a request for connection, an HTTP request or an SQL message arrives, the multiplexer examines if there are free inas to handle the request. If an appropriate free ina exists, then the socket is assigned to it. Otherwise, (if no free ina exists) the connection call is put on a priority wait-queue. This mechanism is a built-in TP Monitor and limits the number of inas to a manageable quantity and the system resources are used more efficiently.

Note! To avoid forcing the casual user to read technical details, inas are called threads throughout the rest of the manual sections.

The handling of an asynchronous SQL request from there on is graphically represented in picture app-3.4


 

Picture app-3.4 The communication protocol with the client ina
 

The SQL compiler

The process of translating each SQL statement to simple calls to the dbmcore is the compilation of each command followed by the execution, which generates the elementary calls.
The steps of this translation are similar to the steps taken by a standard compiler, which generates, instead of machine language instructions, executions plans. The execution of such a plan generates dbmcore commands. These simple dbmcore commands are the messages passed by the sqlcore to the shared memory of the dbmcore (See also, dbmcore/sqlcore communication).

Translation of an SQL command

Like all compilers, the SQL compiler is based on a number of passes, where the input and actions of each pass depend on the results of the previous pass.
The grammar, describing SQL is LALR(1) and is implemented with YACC. Since YACC uses inherently static internal data structures, the resulting code is not re-entrant. Thus, a critical section, shared by all the statement handles (stmthdl) exists. At this point only a few changes are required to turn the code into re-entrant and they will implemented in a future version. Since this part of the code is small, the bottleneck is not severe.

During the syntax analysis of a command the semantic tree is also generated. Up to 200 types of tree nodes (terminal and non-terminal) are used at this point.

Next picture, app-3.5 is a general layout of the compilation phases an SQL command has to go through until it becomes an executable code by the dbmcore.

Picture app-3.5 Compilation steps of an SQL command

Each one of the four passes is described in detail below.

Syntax Analysis
It has already been mentioned that YACC is used for syntax analysis of the SQL commands. The syntax analysis leads eventually to the analysis of semantic data and the generation of the semantic data tree. Early during syntax analysis, syntax errors and mistyped words are flagged and the user is notified by an error message about the possible source of the problem. If no syntax errors are found, then the preparation proceeds to  the rest of the passes, which reform the command to the optimal form it may take before execution.

For the following discussion, the term expression will be used for an SQL command.

The first pass (preliminary actions)

  1. For every table used within the expression, a correlation name, called range variable is defined.

  2. Thus, for tables e.g. ALPHA and BETA ranges variables ALPHA_0, BETA_0 are defined.
  3. Every table column used within the expression without the prefix of the table where it belongs, is prefixed with the range variable of the appropriate table.

  4. Thus, in an expression of the sort: select names from ALPHA .... where there is no table prefix to the column names, names becomes ALPHA_0.names.

    Also, if an * exists in select, i.e. selection of all columns of a table, the * is replaced by all the columns of the table. These columns are prefixed with the range variable of the appropriate table, e.g. if table ALPHA has columns X, Y, Z and ALPHA has range variable ALPHA_0, then select * from ALPHA where ... is interpreted as: select ALPHA_0.X, ALPHA_0.Y, ALPHA_0.Z from ALPHA_0 where...

  5.  For every syntax tree node expected to enter the execution plan (see also Execution plan), a data structure is inserted in a structure table. This will be used later.
  6. For select subqueries (select commands that contain a select within the context of another select), a correlation study of the select expressions is made and all those that may be executed in an outer loop are moved in a tree branch that is closer to the top (loop-invariant optimization). This is an inter-select optimization.

  7. For example, the expression:
    select X from alpha where x < (select agv(y) from beta)
    contains the subquery select agv(y), i.e, the average of the values in column y, which does not depend on other expressions and may easily be promoted to a higher branch in the tree, where it will be evaluated fewer times.
  8. The conditions of WHERE are examined and simplified, e.g. the BETWEEN is replaced by an equivalent condition where the symbols < and > are used.

  9. Respectively, the quantified predicate (ANY, ALL) and every condition which contains IN, is replaced by functions containing EXISTS.

    If the predicate ANY is used, e.g.

    is replaced by If the predicate ALL is used, e.g. is replaced by If  IN is used, e.g. is replaced by If the expression contains aggregates e.g. GROUP BY or HAVING, then the condition will be placed in HAVING.

    Conditions with EXISTS and UNIQUE are not simplified.

  10. For every condition which contains NOT in an outer point, the NOT is distributed to an inner loop according to De Morgan's rules.

  11. Thus, condition NOT (A AND B) becomes (NOT A OR NOT B)
  12. Constant folding. Whenever possible, a constant is replaced by its value.

  13. Also, expressions such as 5+3, are replaced by the result (8).
  14. A first process of the ORDER BY clause is made.

  15. Specifically, all sorting requests by a column data are examined and initially, they are all translated to a form that uses the internal numbers of its column (i.e. their positions in the table).
    For example, ORDER BY X, Y, Z
    becomes
    ORDER by 4, 1, 0 (if these are the positions of the columns within the table)
    The next step is an attempt to serve this sorting request using one of the existing keys. If this is not possible, then the sorting is examined again in the fourth pass, where the program decides on the access path. Otherwise, the sorting by the existing keys is considered earlier.
The second pass

During the second pass the types are checked for compatibility and error messages are issued. If some types are compatible but not the same, they will be converted in the third pass.

The third pass

The actions in the third pass are:

  1. Conversions of all types that need be converted to a relation or expression.

  2. In case of comparisons, no conversion is made, because it is possible to be more effective to make the conversion dynamically later.
    One example is the comparison for a.x < b.y,
    where it is not certain if the condition a.x <= pred(b.x) is more effective than the condition succ(a.x) <= b.y
  3. Recognition of the dynamic parameters (?)
The fourth pass

During the fourth pass the translation and the production of the execution plan is made. From the execution plan the elementary calls to the dbmcore are generated.
The execution plan is a set of actions which must be done in order to perform the required action.
Before proceeding any further into the description of the execution plan production, the notion of the iterator is introduced, as iterators constitute the building blocks of the execution plan.

Interators and results

For every command that contains a select statement, the user expects a sequence of results. A set of results is fully defined when we can traverse it and study its values. The mechanism through which we can traverse the results is the cursor. More technically, one could say that a cursor is the name of a pointer to the memory which is used for the execution of a given command.
The notion of a sequence is a little narrower than the meaning of a set, because a set has no order in its members, while the elements of a sequence (at least as long as the traversing cursor is active) must always be returned in the same order.

A sequence of results, like a set of results, may be described in two ways, which are either through enumeration of its members e.g. {1,3,5,7,9} or through a description of the relationship that generates the set, e.g. {X/X odd, between 1 and 10}.
When the sequence of results is generated from the elements of an existing table (either permanent or temporary) it is defined as an enumeration or description by extension. On the other hand, when the sequence is produced dynamically, as a combination of results that do not physically exist in a table, then this is defined as a description by intension.
In both cases, the main point is to have a result generation in a way that does not allow the user (as the final receiver of the results) to distinguish the way the sequence has been obtained.

To describe the answers to SQL queries, a set of mechanisms, called iterators, has been defined. The iterators are software mechanisms, which can be combined recursively to produce the required result sequences. These combinations are dynamic relationships between iterators and deal with the way these iterators are embedded one inside the other to set out the execution plan. Before setting the execution plan of any SQL query, the compiler must always use an iterator, that describes the results. This is necessary whenever results are produced, either when the rows physically exist in a permanent data table, or when they are dynamically calculated at traversal time. If the table is temporary, then the results are placed in this table through the use of another iterator.
According to what was previously mentioned, the user should not be able to distinguish what iterator combinations have been used and in what way.

The iterators are the building blocks of the execution plan and will be briefly described before going into the details of the execution plan setup.

For the generation of an execution plan, the SQL compiler uses a well-defined set of iterators, which may be combined together to result to all possible sequences. This way, iterators belong to other iterators, but they are all handled in a uniform manner.

There is a cursor for each SQL statement. But each cursor that executes a command uses the root iterator, which consists of many elementary iterators.
The root iterator is the abstract class iter. The notion of an abstract class in this context is similar to the object-oriented language abstract classes. All other classes are generated from an abstract class, but no instances of this abstract class exist.
Picture app-3.6 is the abstract tree.

Picture app-3.6 Iterator class hierarchy (static tree)

The static hierarchy tree shows how a class is generaterd from another and what properties does it own. For the dynamic tree, which deals with the embedding of iterators in the execution plan, we will talk in the execution plan paragraph.

iter
This is the hierarchy root. It defines the mobility rules in the result sequence. It allows movement to the first, last, previous and next object of the sequence. All iterators are scrollable.

Access path

Every SQL command that results in a table is called table expression. Usually, the commands that have sequences of rows as a result (a table form) are select commands.

The notion of the access path has to do with the way the compiler decides about the best path for a table expression, i.e. which is the best path, depending on the conditions, that the compiler will have to follow in order to traverse the records of a set of tables involved in a select statement. Since the select statements are, in general, complex and nested, there are many access paths in one execution plan.
The current paragraph refers to the path generation and the criteria, according to which a path is defined as optimal.

For every select command there is always an elementary select command, that does not contain the clauses distinct, group by and having. The access path for such a command is called elementary access path.

General rules for finding the elementary access path

For a set of tables, a combination of tables and condition placing on these tables is made. For every possible combination of table access, a path is generated as follows:

  1. The compiler examines the order of tables and the select conditions. The access path for every table is either sequential or indexed, depending on the table keys and the expression conditions.
  2. The conditions pertinent to each table are placed as close to an external point in the path as possible.
  3. If some conditions are left out, these are placed to the most external point of the path possible.

  4. The placement of conditions to external points of  select path is an intra-select optimization, as opposed to the attempt of placing of many select statements in the proper position, which is an inter-select optimization.
The compiler computes the cost of each path, according to certain criteria, which will be mentioned below, and selects as best the one with the lower cost, i.e. the one with the smaller number of table accesses.

Although this strategy is based on combinations and, theoretically, it could be proved very time-consuming, it has, practically, very good results, because the tables involved in a select statement are rarely more than three or four, rendering thus the number of possible combinations very low.

We call subquery the part of an SQL statement that follows a select (i.e. the tables, rows, and conditions involved a query). The order by clause does not belong to the subquery.

The access to every table is obtained through a fileiter and every table is traversed either sequentially or via an index (seq_iter or indexed_iter).
There is an exception, where the access is made via a compo_iter, if  the creation of the compo_iter is justified by the IN conditions (e.g. IN (value1, value2, ..., valueN)), when these conditions fill in the first components of a subquery key or when sets of values (union_iter composed of value_iter) end up in an indexed_iter. The indexed_iter exploits the existing indexes and makes optimal use of the database.

As a criterion for the finding of the optimum access path, the minimization of the number of accesses to the tables is considered. The criterion is local, in the sense that every access path is examined locally (in isolation for the select under consideration) and not in combination with any other select.
The only exception where a non-local criterion may be used is when an order by clause, which defines a desired sequence of results exists in the subquery. In this case, it is possible to use the table keys and avoid the generation of a temporary table. In this case the benefit of using the keys may be deemed greater than the selection of an optimum access path.

Calculation of the access path cost

For every access path there is a sequence of access to the tables and selection conditions from these tables: (Table1, Conditions1), ... (TableN, ConditionsN). As mentioned before, the conditions are placed as close to Table1 as possible, i.e. as close to the most external loop as possible.

Range of repetition
We call range, Ni, of each repetition the number of records that we calculate that have to be examined (accesses to the table). These accesses are either the size of the table (i.e. the number of rows) for sequential iterators or a number that is calculated depending on the key for indexed iterators.

Condition selectivity
We call selectivity of each condition, fi, the evaluated percentage of records that fulfill the condition. Selectivity is calculated according to the following criteria:
The = has a 0.01 selectivity
The <> has a 0.99 selectivity
The > and < have a 0.5 selectivity
When the condition is composite and contains the logical operators AND and OR, then the AND has as selectivity the minimum of the conjunction, while the OR has as selectivity the maximum of the disjunction.

In general, the selectivity is less than one with the exception of conditions containing table expressions. In this case the selectivity calculation will take into account the range of this expression as well.
The following discussion is a very simple example with a few nodes, where the evaluation of the access path cost is presented.
Suppose there is a select statement, using tables A, B and C and there exist conditions for each one of the tables.
In picture app-3.7 we can see one of the possible access paths, where the tables and the conditions are placed and the evaluation of the cost is performed.

Picture app-3.7 Evaluation of the access path cost

The repetition range of the first level is N1 (i.e. N1 accesses to table 1). The number of records selected for the next level (i.e. satisfying the condition f1) is a total of N1*f1. The repetition range of the second level is N2, so there will be (N1*f1)*N2 accesses for this level. ((N1*f1)*(N2*f2) of them will be satisfying this condition and these will be the records to participate to the record selection of the next level.
In the third level, the repetition range is N3, so we have
(N1*f1)*(N2*f2)*N3 records.
Condition f3 participates in the definition of the final number of records that will be selected, but does not limit the number of accesses to the last level.
The total number of accesses is the sum of accesses to each level, i.e:
N1+(N1*f1)*N2+(N1*f1)*(N2*f2)*N3

The access path cost is evaluated by the generic formula:

N1+(N1*f1)*N2+(N1*f1)*(N2*f2)*N3+...+(Pi=1,n-1Ni*fi)*Nn

The last selectivity fn is not used in the evaluation, because it does not have any effect at this point. It does not prevent the access to the database, it merely shows which records will be selected because of the condition.
When the cost evaluation of each access path is completed, the best access path is selected (i.e. the one with the minimum cost).

Iterators and bookmarks

As mentioned before, every cursor corresponds to an SQL statement, which generates an execution plan, which in its turn is a tree of iterators.
The leaves of this tree always refer to a traversable table (they are fileiters). The fileiter may be a simple sequence or an indexed iterator, but in both cases, there is a bookmark in the table and this bookmark corresponds to the bookmark described in the data and index manager of the dbmcore.
An SQL command may refer several times to the same table. Every fileiter has its own position for every table. All in all, a cursor, as a tree of iterators, has a number of current positions (bookmarks), which correspond to the leaves of the tree.

Execution plan

The table expression found in the outermost select of the command is called main table expression. In general, the main table expression may contain a number of secondary table expressions. Each of them also contains table expressions. Consequently, all internal select queries must have been recursively evaluated before executing a select command.
Every table expression is correlated up to a certain degree to those containing it. For every table expression execution an attempt is made to place it as close to the outer point as possible (this way it is executed fewer times).
It is possible, of course, to have an internal expression that is completely uncorrelated. This expression may be transferred to the outmost part of the execution plan.

Most of the time, a select expression is executed in two phases:
1. Evaluation of the table expression.
2. Traversal of the results.

The evaluation of a table expression may be considered as producing a table as a result. In fact, it may either generate a real, temporary table or refer to a virtual table via an iterator.
In case of a real, temporary table generation, the execution plan determines the iterator, which will traverse another table and fill the real temporary table.  If prerequisites for the expression execution exist, then we have a recursive evaluation.

Every SQL statement query is presented abstractly as a tree of nodes. Each of the nodes defines its own iterator. In the end, we define also, a basic operator, a primitive operator and a temporary table iterator.

Basic iterator of a node is the node that traverses the sequence generated by the evaluation of the expression represented by this node.

Primitive iterator is the one traversing a table expression that does not contain distinct, group by and having.

Temporary table iterator is the one that fills the temporary table with values, fetched by the traversal of an expression.

In the following table, the node type (i.e. the query type) is presented in one column and the iterator used to evaluate the query is presented in the other column. We use symbol S for the node, symbol iter(S) for the basic iterator and symbol iter'(S) for the primitive iterator.

The term subquery is used to describe a classical select statement with syntax:
select <column-name-1>, ..., <column-name-2> | *
from <table-name-1>, ..., <table-name-2>
where condition
 
 
Name  Node S =  iter(S) =
a  Select(without order by) with subquery x iter (x)
a'  Delete searched with subquery x  iter(x)
a'' Update searched with subquery x iter(x)
b  Select with order by and subquery x indexed_iter to a temporary table, generated by iterx(x) or iterx(x) if it is compatible with order by
c  Union all of x and y union_iter (iter(x), iter(y)) 
d  Union (without all) of x and y seq_iter in a temporary table generated by union_iter
e  Subquery (with aggregation) seq_iter in a temporary table generated by using iter'(S) for aggregation
f  Subquery with distinct (without aggregation) seq_iter in a temporary distinct table generated by iter'(S)
g  Subquery with all (without aggregation) or select for update iter'(S)
g'  Subquery inside a unique condition No iterator is defined. 
iter'(S) is used to fill a temporary table and check the unique condition.
h  Subquery count(*) in a non-scalar position value_iter(S), where execution of S as a scalar select produces a value.
The system can generate distinct tables, inside which only distinct result rows are placed.

Since the last line of the iterator table requires some explanation, we note that it is always the case that a select count(*) returns a number and not a table row. If, however, it becomes necessary to have the results returned as a sequence, then we use a value_iter, which returns only one value, but its existence allows a uniform handling of the iterator tree.

For every execution plan of a select statement there is an access path, which defines the order by which the real table accesses are made. If we wish to express concisely the relationship between the execution plan and the access path, we may say that the execution plan is a tree showing what is going to happen, In other words, the execution plan shows the order of the expression evaluation, while the access path shows how the evaluation is to be done, i.e. what is the order of accesses in the tables.

A compilation and execution example

In the following example there is a table (a system table), called _columns, which has information about the columns of all the database tables.
Every row of the table contains information about the characteristics of each column defined by the system.
The table has, among others, a composite key the (table_id, column_id), which will be used in the example.

SQL statement: select * from _columns where table_id = 8 and column_id < 3;

Steps:

Replacement of the * by the table columns.

The range variable _COLUMNS_0 is used as a prefix to every table column. The types in conditions are checked for consistency and converted if necessary. The condition < is evaluated.

Query:
select _COLUMNS_0.table_id, _COLUMNS_0.column_id, _COLUMNS_0.column_name, _COLUMNS_0.data_type, _COLUMNS_0.precision, _COLUMNS_0.length, _COLUMNS_0.scale, _COLUMNS_0.radix, _COLUMNS_0.nullable from _COLUMNS as _COLUMNS_0
where (_COLUMNS_0.TABLE_ID = $cvt (8, smallint) and _COLUMNS_0.COLUMN_ID < $pred ($cvt(3, smallint)))

where $cvt() is the type conversion function. In the first case, number 8 is converted to smallint to be consistent with the TABLE_ID type and in the second case, number 3 is converted to smallint to be consistent to the COLUMN_ID type. Function $pred() returns the predecessor.

The execution plan requires the proper iterator and shows the form of the row the iterator will select.

Execution Plan:
  Composite iterator {
     Traversal of table 2 _columns by key 0 {
       Bounds:
       0: Equal to
           $cvt (8, smallint)
       1: From
           Minimum
          To
           $cvt (3, smallint)
     }
     Calculating:
       _COLUMNS_0.table_id, _COLUMNS_0.column_id, _COLUMNS_0.column_name,
       _COLUMNS_0.data_type, _COLUMNS_0.precision, _COLUMNS_0.length,
       _COLUMNS_0.scale, _COLUMNS_0.radix, _COLUMNS_0.nullable
  }
The execution plan shows a traversal of table 2 (_columns) with the key table_id. Since (table_id, column_id) is a key (key 0), the condition is evaluated using an indexed_iter, with bounds, the table_id = 8 for the first part of the key (Equal To $cvt(3, smallint)), and  column_id < 3 (1:From Minimum To $cvt(3, smallint)).

Finally, we see the rows returned after finding which rows fulfill the conditions:

*
0 table_id: 8
1 column_id: 0
2 column_name: 'table_id'
3 data_type: 5
4 precision: (null)
5 length: 1
6 scale: (null)
7 radix: (null)
8 nullable: off
*
0 table_id: 8
1 column_id: 1
2 column_name: 'fkey_no'
3 data_type: 5
4 precision: (null)
5 length: 1
6 scale: (null)
7 radix: (null)
8 nullable: off
*
0 table_id: 8
1 column_id: 2
2 column_name: 'col_no'
3 data_type: 5
4 precision: (null)
5 length: 1
6 scale: (null)
7 radix: (null)
8 nullable: off
The three table rows returned are these that fulfill the condition table_id = 8 and column_id < 3, as it appears in the results.
 
 
Previous Chapter: Appendix - SQL