SET VS MULTISET
Table Type Specifications of SET VS
MULTISET
There are two
different table type philosophies so there are two different type tables. They
are SET and MULTISET. It has been said, “A man with one watch knows the time,
but a man with two watches is never sure”. When Teradata was originally
designed it did not allow duplicate rows in a table. If any row in the same
table had the same values in every column Teradata would throw one of the rows
out. They believed a second row was a mistake. Why would someone need two
watches and why would someone need two rows exactly the same? This is SET
theory and a SET table kicks out duplicate rows. The ANSI standard believed
in a different philosophy. If two rows are entered into a table that are exact
duplicates then this is acceptable. If a person wants to wear two watches then
they probably have a good reason. This is a MULTISET table and duplicate rows
are allowed. If you do not specify SET or MULTISET, one is used as a default.
Here is the issue: the
default in Teradata mode is SET and the default inANSI mode
is MULTISET.
Therefore, to
eliminate confusion it is important to explicitly define which one is desired.
Otherwise, you must know in which mode the CREATE TABLE will execute in so
that the correct type is used for each table. The implication of using a SET or
MULTISET table is discussed further.
SET and MULTISET Tables
A SET table does not
allow duplicate rows so teradata checks to ensure that no two rows in a table
are exactly the same. This can be a burden. One way around the duplicate row
check is to have a column in the table defined as UNIQUE. This could be a
Unique Primary Index (UPI), Unique Secondary Index (USI) or even a
column with a UNIQUE or PRIMARY KEY constraint. Since all must be unique, a
duplicate row may never exist. Therefore, the check on either the
index or constraint eliminates the need for the row to be examined for
uniqueness. As a result, inserting new rows can be much faster by eliminating
the duplicate row check.
However, if the table
is defined with a NUPI and the table uses SET as the table type, now a
duplicate row check must be performed. Since SET tables do not allow duplicate
rows a check must be performed every time a NUPI DUP (duplicate of an existing
row NUPI value) value is inserted or updated in the table. Do not be fooled! A
duplicate row check can be a very expensive operation in terms of processing
time. This is because every new row inserted must be checked to see if it is a
duplicate of any existing row with the same NUPI Row Hash value. The
number of checks increases exponentially as each new row is added to the table.
What is the solution?
There are two: either make the table a MULTISET table (only if you want
duplicate rows to be possible) or define at least one column or composite
columns as UNIQUE. If neither is an option then the SET table with no unique
columns will work, but inserts and updates will take more time because of the
mandatory duplicate row check.
If you define UPI on MULTISET is absolutely valid however it destroys the actual usage of MULTISET tables.
SET tables do not allow duplicate rows to be inserted,
and it is not allowed to create a duplicate row with an UPDATE statement.
MULTISET tables do not have these restrictions.
Apart from differences mentioned above, you may have to
deal with several performance issues when working with SET tables.
In case of an INSERT INTO SELECT * FROM
statement, duplicate rows will be filtered automatically for SET tables, which means that no error occurs.
statement, duplicate rows will be filtered automatically for SET tables, which means that no error occurs.
If you run an INSERT INTO
VALUES (a, b) statement, you will receive an error message in case you try to insert duplicate rows.
VALUES (a, b) statement, you will receive an error message in case you try to insert duplicate rows.
There is no way to change a SET table into a MULTISET
table after it has been created. I must admit that have not found out yet why
this limitation exists.
There is a performance impact involved with SET tables.
Each time a row is inserted or updated, Teradata has to check if the next row
to be added would violate the uniqueness constraint. This is called DUPLICATE
ROW CHECK, and will seriously degrade performance if many rows with
the same primary index are inserted. The number of checks increases
exponentially with each new row added to the table!
There is no
performance impact for SET tables when there is a UPI (Unique primary
index) defined on the table. As the UPI itself ensures uniqueness, no DUPLICATE
ROW CHECK will be performed.
Instead of an UPI you can as well use any USI (Unique
Secondary Index), or any column with a UNIQUE or PRIMARY KEY constraint.
Basically, anything which ensures uniqueness and
therefore allows Teradata to bypass the DUPLICATE ROW CHECK, is welcome in such
a situation.
SET tables are good candidates for performance improvement.
The easiest way to find all of them is to look at the relevant DBC.TABLES
entries:
SELECT * FROM DBC.TABLES
WHERE checkopt = ‘N’ AND TABLEKIND = ‘T’; — Set Tables
SELECT * FROM DBC.TABLES
WHERE checkopt = ‘Y’ AND TABLEKIND = ‘T’; — Multiset tables
Whenever the uniqueness of entries is guaranteed
programmatically – such is the case in a GROUP BY statement – you can
redesign your table easily from SET to MULTISET.
You can achieve some performance gain, depending on how
many records per PI value you have.
Keep in mind that the number of DUPLICATE ROW CHECKS
grows exponentially with the number of records per Primary Index (PI).
Here is an example to demonstrate the impact of having a
SET table, and many duplicate primary index values:
We create two identical tables, which will be the target
tables for our example:
CREATE SET TABLE
TMP_SET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);
CREATE MULTISET TABLE
TMP_MULTISET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);
— In a next step we figure out our session id, as we will
need it to analyze the resource usage:
SELECT SESSION;
7376827
— We insert random data into
the Set and Multiset table, but only use 500 different Primary Index values to
cause
— some impact on performance. The “descr” column has to be quite unique as row level duplicates would be filtered
INSERT INTO TMP_MULTISET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM
;
;
— some impact on performance. The “descr” column has to be quite unique as row level duplicates would be filtered
INSERT INTO TMP_MULTISET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM
;
INSERT INTO TMP_SET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM
;
;
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM
;
— We check the CPU and DISK IO usage for the SET and the
MULTISET table:
SELECT * FROM DBC.DBQLOGTBL
WHERE SESSIONID = 7376827;
Total IO
|
CPU Seconds
|
|
Set
|
126.827,00
|
20,15
|
Multiset
|
3.556,00
|
0,96
|
Above example shows that much more disk accesses and
CPU seconds are used in the case of a SET table.
Actually, there IS
a difference between SET and MULTISET even if you have a UNIQUE index:
CREATE TABLE T1 (c1 INTEGER NOT NULL, c2 INTEGER) PRIMARY
INDEX (c1);
INSERT T1 VALUES (1,1);
INSERT T1 VALUES (1,2);
INSERT T1 VALUES (2,1);
CREATE SET TABLE T2_SET (c1 INTEGER NOT NULL) UNIQUE
PRIMARY INDEX (c1);
CREATE MULTISET TABLE T2_MULTISET (c1 INTEGER NOT NULL)
UNIQUE PRIMARY INDEX (c1);
INSERT T2_SET SELECT c1 FROM T1;/* succeeds - quietly eliminates the duplicate
and inserts 2 rows */
INSERT T2_MULTISET SELECT c1 FROM T1;/* fails with
duplicate key error */
Reference:
http://www.dwhpro.com/teradata-multiset-tables/
http://lakshmikishore.blogspot.com/2010/12/table-type-specifications-of-set-vs.html
No comments:
Post a Comment