Home >>Distributed DBMS Tutorial >DDBMS - Relational Algebra for Query
It is initially scanned, parsed and validated when a query is placed. An internal representation of the query, such as a query tree or a query graph, is then generated. To extract results from the database tables, alternative execution methods are then devised. The method of selecting the most suitable query processing execution strategy is called query optimization.
Query optimization in DDBMS is a crucial task. As the number of alternative strategies will increase exponentially due to the following factors, the complexity is high.
Hence, the goal in a distributed system is sometimes to find a successful query processing execution strategy rather than the right one. The time for a query to operate is the sum of the following −
Query processing is a collection of all activities starting from the placement of the query to the display of the query results. As seen in the following diagram, the steps are-
Relational algebra defines the core set of relational database model operations. A relational algebra expression forms a sequence of relational algebra operations. The result of this expression reflects the result of a query from a database.
The basic operations are –
Projection
Projection operation displays a subset of fields of a table. This gives a vertical partition of the table.
Syntax in Relational Algebra
$$\pi_{<{AttributeList}>}{(<{Table Name}>)}$$
For example, let us consider the following Student database −
Roll_No | Name | Course | Semester | Gender |
---|---|---|---|---|
2 | Ankur | BCA | 1 | Male |
4 | Mohit | MCA | 1 | Male |
5 | Azam | BCA | 2 | Male |
6 | Anshika | BCA | 1 | Female |
8 | Shivani | MCA | 1 | Female |
We will use the following relational algebra expression if we want to show the names and courses of all students.
$$\pi_{Name,Course}{(STUDENT)}$$
Selection
The selection operation shows a subset of tuples that satisfy those conditions in a table. This gives the table a horizontal partition.
Syntax in Relational Algebra
$$\sigma_{<{Conditions}>}{(<{Table Name}>)}$$
For example, we can use the following relational algebra expression in the Student table if we want to show the information of all students who have selected MCA courses.
$$\sigma_{Course} = {\small "BCA"}^{(STUDENT)}$$
Combination of Projection and Selection Operations
We need a combination of projection and selection operations for most queries. These expressions can be written in two ways:
To show the names of all female BCA course students , for example-
$$\pi_{Name}(\sigma_{Gender = \small "Female" AND \: Course = \small "BCA"}{(STUDENT)})$$
$$FemaleBCAStudent \leftarrow \sigma_{Gender = \small "Female" AND \: Course = \small "BCA"} {(STUDENT)}$$
$$Result \leftarrow \pi_{Name}{(FemaleBCAStudent)}$$
Union
If P is the result of an operation, and Q is the result of another operation, then the union of P and Q ($p \cup Q$) is the set of all tuples without duplicates in either P or Q, or both.
To reveal all students who are either in semester 1 or in the BCA course, for example-
$$Sem1Student \leftarrow \sigma_{Semester = 1}{(STUDENT)}$$ $$BCAStudent \leftarrow \sigma_{Course = \small "BCA"}{(STUDENT)}$$ $$Result \leftarrow Sem1Student \cup BCAStudent$$
Intersection
If P is the result of an operation and Q is the result of another operation, then the P and Q intersection ($p \cap Q$) is the set of all the tuples in both P and Q.
Provided the following two schemas, for example,-
EMPLOYEE
EmpID | Name | City | Department | Salary |
---|
PROJECT
PId | City | Department | Status |
---|
To display the names of all cities where a project is located and also an employee resides −
$$CityEmp \leftarrow \pi_{City}{(EMPLOYEE)}$$ $$CityProject \leftarrow \pi_{City}{(PROJECT)}$$ $$Result \leftarrow CityEmp \cap CityProject$$
Minus
P-Q is the set of all tuples that are in P and not in Q, if P is the result of an operation and Q is the result of another operation.
For example, to list all departments that do not have an ongoing project (status projects = ongoing) –
$$AllDept \leftarrow \pi_{Department}{(EMPLOYEE)}$$ $$ProjectDept \leftarrow \pi_{Department} (\sigma_{Status = \small "ongoing"}{(PROJECT)})$$ $$Result \leftarrow AllDept - ProjectDept$$
Join
The Join operation combines the related tuples of two separate tables (query results) into one single table.
For example , consider two schemas in a bank database, Customer and Branch, as follows.
CUSTOMER
CustID | AccNo | TypeOfAc | BranchID | DateOfOpening |
---|
BRANCH
BranchID | BranchName | IFSCcode | Address |
---|
To list the employee details along with branch details −
$$Result \leftarrow CUSTOMER \bowtie_{Customer.BranchID=Branch.BranchID}{BRANCH}$$
Until optimization, SQL queries are converted into equivalent relational algebra expressions. A query is initially broken down into smaller blocks of queries. The equivalent relational algebra expressions are translated from these blocks. Optimization involves optimizing each block and then optimizing the whole of the query.
Examples
Let us consider the following schemas −
EMPLOYEE
EmpID | Name | City | Department | Salary |
---|
PROJECT
PId | City | Department | Status |
---|
WORKS
EmpID | PID | Hours |
---|
Example 1
We write the SQL query to show the details of all employees who earn a salary LESS than the average salary.
SELECT * FROM EMPLOYEE WHERE SALARY < ( SELECT AVERAGE(SALARY) FROM EMPLOYEE ) ;
This query contains one nested sub-query. So, this can be broken down into two blocks.
The inner block is −
SELECT AVERAGE(SALARY)FROM EMPLOYEE ;
If the result of this query is AvgSal, then outer block is −
SELECT * FROM EMPLOYEE WHERE SALARY < AvgSal;
Relational algebra expression for inner block −
$$AvgSal \leftarrow \Im_{AVERAGE(Salary)}{EMPLOYEE}$$
Relational algebra expression for outer block −
$$\sigma_{Salary <{AvgSal}>}{EMPLOYEE}$$
Example 2
We write a SQL query to show the project ID and status of all employee 'Arun Kumar' projects.
SELECT PID, STATUS FROM PROJECT WHERE PID = ( SELECT FROM WORKS WHERE EMPID = ( SELECT EMPID FROM EMPLOYEE WHERE NAME = 'ARUN KUMAR'));
This query contains two nested sub-queries. Thus, can be broken down into three blocks, as follows −
SELECT EMPID FROM EMPLOYEE WHERE NAME = 'ARUN KUMAR'; SELECT PID FROM WORKS WHERE EMPID = ArunEmpID; SELECT PID, STATUS FROM PROJECT WHERE PID = ArunPID;
(Here ArunEmpID and ArunPID are the results of inner queries)
Relational algebra expressions for the three blocks are −
$$ArunEmpID \leftarrow \pi_{EmpID}(\sigma_{Name = \small "Arun Kumar"} {(EMPLOYEE)})$$
$$ArunPID \leftarrow \pi_{PID}(\sigma_{EmpID = \small "ArunEmpID"} {(WORKS)})$$
$$Result \leftarrow \pi_{PID, Status}(\sigma_{PID = \small "ArunPID"} {(PROJECT)})$$
Computation of Relational Algebra Operators
It is necessary to analyse relational algebra operators in several different ways, and any alternative is called an access road.
The alternative to computation relies on three main factors:
The time for a relational algebra operation to be performed is the sum of-
Because the time a tuple is processed is much shorter than the time it takes to get the tuple from the storage , especially in a distributed system, disk access is very often considered to be the metric for calculating the cost of relational expression.
The selection operation estimation depends on the complexity of the selection situation and the availability of indexes for the table attributes.
Depending on the indexes, the computation alternatives below are
Each tuple in P must be compared with each tuple in Q to test whether the join condition is satisfied when we want to join two tables, say P and Q. If the condition is satisfied, the corresponding tuples are concatenated, duplicate fields are deleted and added to the result relationship. This is, consequently, the most costly operation.
The popular approaches for computing joins are −
Nested-loop Approach
This is the conventional join approach. It can be illustrated through the following pseudocode (Tables P and Q, with tuples tuple_p and tuple_q and joining attribute a) −
For each tuple_p in P For each tuple_q in Q If tuple_p.a = tuple_q.a Then Concatenate tuple_p and tuple_q and append to Result End If Next tuple_q Next tuple-p
Sort-merge Approach
In this method, based on the joining attribute, the two tables are individually sorted and then the sorted tables are merged. As the number of records is very high and can not be accommodated in the memory, external sorting methods are introduced. Once the individual tables are sorted, one page is brought into the memory of each of the sorted tables, merged based on the joining attribute, and the joined tuples are written out.
Hash-join Approach
This approach consists of three steps: the phase of partitioning and the phase of probing. Tables P and Q are broken into two sets of disjointed partitions in the partitioning phase. A common feature of hash is agreed upon. For assigning tuples to partitions, this hash function is used. Tuples in a partition of P are contrasted with tuples in the corresponding partition of Q during the testing process. They are written out if they match, then.