Feb 7, 2016

Teradata: Teradata Multiset vs. Set Tables

SET VS MULTISET
Table Type Specifications of SET VS MULTISET
There are two different table type philosophies so there are two different type tables. They are SET and MULTISET. It has been said, “A man with one watch knows the time, but a man with two watches is never sure”. When Teradata was originally designed it did not allow duplicate rows in a table. If any row in the same table had the same values in every column Teradata would throw one of the rows out. They believed a second row was a mistake. Why would someone need two watches and why would someone need two rows exactly the same? This is SET theory and a SET table kicks out duplicate rows. The ANSI standard believed in a different philosophy. If two rows are entered into a table that are exact duplicates then this is acceptable. If a person wants to wear two watches then they probably have a good reason. This is a MULTISET table and duplicate rows are allowed. If you do not specify SET or MULTISET, one is used as a default. Here is the issue: the default in Teradata mode is SET and the default inANSI mode is MULTISET.
Therefore, to eliminate confusion it is important to explicitly define which one is desired. Otherwise, you must know in which mode the CREATE TABLE will execute in so that the correct type is used for each table. The implication of using a SET or MULTISET table is discussed further.
SET and MULTISET Tables
A SET table does not allow duplicate rows so teradata checks to ensure that no two rows in a table are exactly the same. This can be a burden. One way around the duplicate row check is to have a column in the table defined as UNIQUE. This could be a Unique Primary Index (UPI), Unique Secondary Index (USI) or even a column with a UNIQUE or PRIMARY KEY constraint. Since all must be unique, a duplicate row may never exist. Therefore, the check on either the index or constraint eliminates the need for the row to be examined for uniqueness. As a result, inserting new rows can be much faster by eliminating the duplicate row check.
However, if the table is defined with a NUPI and the table uses SET as the table type, now a duplicate row check must be performed. Since SET tables do not allow duplicate rows a check must be performed every time a NUPI DUP (duplicate of an existing row NUPI value) value is inserted or updated in the table. Do not be fooled! A duplicate row check can be a very expensive operation in terms of processing time. This is because every new row inserted must be checked to see if it is a duplicate of any existing row with the same NUPI Row Hash value. The number of checks increases exponentially as each new row is added to the table.
What is the solution? There are two: either make the table a MULTISET table (only if you want duplicate rows to be possible) or define at least one column or composite columns as UNIQUE. If neither is an option then the SET table with no unique columns will work, but inserts and updates will take more time because of the mandatory duplicate row check.
If you define UPI on MULTISET is absolutely valid however it destroys the actual usage of MULTISET tables.


SET tables do not allow duplicate rows to be inserted, and it is not allowed to create a duplicate row with an UPDATE statement. MULTISET tables do not have these restrictions.

Apart from differences mentioned above, you may have to deal with several performance issues when working with SET tables.

In case of an INSERT INTO SELECT * FROM
statement, duplicate rows will be filtered automatically for SET tables, which means that no error occurs.

If you run an INSERT INTO
VALUES (a, b) statement, you will receive an error message in case you try to insert duplicate rows.

There is no way to change a SET table into a MULTISET table after it has been created. I must admit that have not found out yet why this limitation exists.

There is a performance impact involved with SET tables. Each time a row is inserted or updated, Teradata has to check if the next row to be added would violate the uniqueness constraint. This is called DUPLICATE ROW CHECK, and will seriously degrade performance if many rows with the same primary index are inserted. The number of checks increases exponentially with each new row added to the table!

There is no performance impact for SET tables when there is a UPI (Unique primary index) defined on the table. As the UPI itself ensures uniqueness, no DUPLICATE ROW CHECK will be performed.

Instead of an UPI you can as well use any USI (Unique Secondary Index), or any column with a UNIQUE or PRIMARY KEY constraint.
Basically, anything which ensures uniqueness and therefore allows Teradata to bypass the DUPLICATE ROW CHECK, is welcome in such a situation.
SET tables are good candidates for performance improvement. The easiest way to find all of them is to look at the relevant DBC.TABLES entries:
SELECT * FROM DBC.TABLES WHERE checkopt = ‘N’ AND TABLEKIND = ‘T’;  — Set Tables
SELECT * FROM DBC.TABLES WHERE checkopt = ‘Y’ AND TABLEKIND = ‘T’;  — Multiset tables

Whenever the uniqueness of entries is guaranteed programmatically – such is the case in  a GROUP BY statement – you can redesign your table easily from SET to MULTISET.
You can achieve some performance gain, depending on how many records per PI value you have.
Keep in mind that the number of DUPLICATE ROW CHECKS grows exponentially with the number of records per Primary Index (PI).
Here is an example to demonstrate the impact of having a SET table, and many duplicate primary index values:
We create two identical tables, which will be the target tables for our example:
CREATE SET  TABLE TMP_SET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);
CREATE MULTISET  TABLE TMP_MULTISET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);
— In a next step we figure out our session id, as we will need it to analyze the resource usage:
SELECT SESSION;
7376827
— We insert random data into the Set and Multiset table, but only use 500 different Primary Index values to cause
— some impact on performance. The “descr” column has to be quite unique as row level duplicates would be filtered
INSERT INTO TMP_MULTISET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM
;
;
INSERT INTO TMP_SET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM
;
;
— We check the CPU and DISK IO usage for the SET and the MULTISET table:
SELECT * FROM DBC.DBQLOGTBL WHERE SESSIONID = 7376827;
Total IO
CPU Seconds
Set
126.827,00
20,15
Multiset
3.556,00
0,96

Above example shows that much more disk accesses and CPU seconds are used in the case of a SET table.

Actually, there IS a difference between SET and MULTISET even if you have a UNIQUE index:
CREATE TABLE T1 (c1 INTEGER NOT NULL, c2 INTEGER) PRIMARY INDEX (c1);
INSERT T1 VALUES (1,1);
INSERT T1 VALUES (1,2);
INSERT T1 VALUES (2,1);

CREATE SET TABLE T2_SET (c1 INTEGER NOT NULL) UNIQUE PRIMARY INDEX (c1);

CREATE MULTISET TABLE T2_MULTISET (c1 INTEGER NOT NULL) UNIQUE PRIMARY INDEX (c1);

INSERT T2_SET SELECT c1 FROM T1;/* succeeds - quietly eliminates the duplicate and inserts 2 rows */

INSERT T2_MULTISET SELECT c1 FROM T1;/* fails with duplicate key error */

Reference:
http://www.dwhpro.com/teradata-multiset-tables/
http://lakshmikishore.blogspot.com/2010/12/table-type-specifications-of-set-vs.html

No comments:

Post a Comment