Teradata
Architecture:
The biggest strength of the Teradata is the parallelism. So the architecture of the Teradata is designed in such way to keep this strength in mind. Teradata is unique from any other database because of its unique architecture only. The main components of Teradata architecture is PE (Parsing Engine), AMP (Access Module Processor) and BYNET. We look into these components in details after looking the logical view of the architecture.
PARSING ENGINE (PE):
Whenever
a user login to Teradata it actually connect to Parsing Engine (PE). When a
user submits a query, then the PE takes action, it creates a plan and instructs
AMPs what to do in order to get the result from the query. The PE knows all, it
knows how many AMPs are connected to Teradata system, how many rows are in the
table and what is the best possible plan to execute the query. This is why the
PE is also called as the OPTIMIZER Beside making a perfect plan for query
execution PE also make a check on the access right of the user that weather the
user has the privilege to execute the query or not. In this way PE also perform
security feature on the users.
Access Module
Processor (AMP): Each AMP attached to the Teradata system
listens to the PE via the BYNET for instructions. Each AMP is connected to its
own disk and has the privilege to read or write the data to its disk. The AMP
can be best considered as the computer processor with its own disk attached to
it. Whenever it receives the instructions from the PE it fetches the data from
its disk and sends it to back to PE through BYNET. Each AMP is allowed to read
and write in its own disk ONLY. This is known as the SHARED NOTHING
ARCHITECTURE’ Teradata spreads the rows of the table evenly across all the
AMPs, when PE asks for data all AMPs work simultaneously and read the records
from its own DISK. Hence a query will be as slow as the slowest AMP in the
system. This is known as parallelism.
The BYNET is the
communication channel between PE and AMP. It ensures that the communication
between PE and AMP is correct and on right track. In Teradata system there are
always two BYNET systems. They are called as ‘BYNET 0’ and ‘BYNET 1’. But we
refer them as a single BYNET system. The reason two BYNET exist on a Teradata
system is that
1. If one BYNET fails,
the second BYNET takes over it place.
2. Two BYNET improve
the performance of the system, the PE and AMP can talk to each other over both
BYNET which fasten the communication.
Short Summary:
The
PE checks the syntax of the query, check the user security rights
Then
PE comes up with the best optimized plan for the execution of the query
The
PE passes this plan through BYNET to AMPs.
The
AMPs follow the plan and retrieve the data from its DISK.
Then
AMPs passes the data to PE through BYNET.
The
PE then passes the data to the user.
Primary Index in
Teradata:
Each
table in Teradata is required to have a primary index. Even if you did not
define any primary index in CREATE table statement, the Teradata system will
automatically takes very first column of the table as the primary index. The
primary index defines where data will reside and which AMP receives the row.
The three most
important roles the primary index does is the following
Data
Distribution
Fastest
way to retrieve Data
Incredibly
important for Joins
In
short primary index provides the fastest physical path to retrieving data
Unique Primary
Index (UPI)
A
unique primary index means that the value for the selected column must be
unique
In
the example below the EMP_ID is UniquePI
Non unique primary
index
means that the value for the selected column can be non unique.
In
the example below the LAST_NAME is
the Non Unique Primary Index.
A
NUPI will almost never spread the rows of the table evenly among the AMP’s.
Key Point - Don’t assume
that UPI is always the BEST CHOICE; sometimes the use of NUPI is very critical.
If you need to perform lot of query with LAST_NAME in the WHERE clause then
NUPI on last name is essential
The
primary index is the key to determine where the ROW of the table will reside on
which AMP.
When
a new row arrive for insert in Teradata the following steps occur
Teradata
Parsing Engine (PE) examines the primary index of the row.
Teradata
takes the primary index of the rows and run it through HASHING ALGORITHM
The
output of the Hashing Algorithm is the 32 bit Row – Hash value
COLLECT STATS is one on the most
useful utility in Teradata. It helps Parsing Engine (PE) to make an effective
plan to execute query, so that less resource are utilized and performance is
improvised.
But
there is a serious confusion between doing COLLECT STATS on column level or
table level.
A thumb rule is to collect
statistics when they changed by 10%. (That would be 10% more rows inserted, or
10% of the rows deleted, or 10% of the rows changed, or some combination.)
Here
i like to explain the difference between both the scenarios with appropriate
example.
Suppose
we have a table and we want to collect statistics on 4 columns. We can do this
by the below mentioned query
Collect
stats on TABLE_NAME column (COL1);
Collect
stats on TABLE_NAME column (COL2);
Collect
stats on TABLE_NAME column (COL3);
Collect
stats on TABLE_NAME column (COL4);
The
other way of defining COLLECT STATS on the same table is
Collect
stats on TABLE_NAME;
The
second query is collecting stats on table level. Both the approach will do the
same thing, but we cannot directly collect stats on table level. If you are
collecting STATS on table level, then the STATS must already be defined on the
above mentioned 4 columns of the table. This can be done at the time of
creation of table. If we are not
defining the STATS for the columns earlier, then our COLLECT STATS on table
level will give an error message. Collect stats on table can only be used
on a table which has stats defined on it, on any no. of columns for that be.
Once stats are defined on the columns you can you use collect stats on table
for refresh the stats for all the defined columns. If you use collect stats on
TABLE_NAME column (COL1) it will refresh the stats on the mentioned column
(COL1) only.
Pros & Cons:
We can say that COLLECT STATS
on table level is just the shortcut of collecting stats on all the columns on
whom we have already defined stats. It saves the overhead of writing COLLECT
STATS query on each column, each time we want to gather statistics. In our case
there are only 4 columns whose statistics we want, but suppose in a huge table if there are more than 10-20 columns which are
required for COLLECT STATS, then our COLLECT STATS on table level saves us a
lot of typing time. Its good practice to perform collect stats on the columns
of even an empty table, when data is loaded into the table collect stats on
table can be used to collect all the statistics without having to collect
statistics on the individual columns.
But from a DBA
point of view it’s always wise to collect stats on column level. Because if we are
doing collect stats on TABLE level then stats are not committed on any column
until all stats are collected. Since collecting stats required resources, so if
you have huge data table then it might take a long time. If you need to free
the resource then you have to abort the whole process, and no stats will be
collected on any column. On the other hand if your are doing it on COLUMN
level, and killed the process in between then at least we have stats collected
on columns which are already done before killing the script. And we can again
start from that column where we left.
Reference: http://www.teradatatech.com/
No comments:
Post a Comment