Monday 21 April 2014

Oracle - Case Insensitive Sorts & Compares

Oracle enhanced it’s case-insensitive sorts and compares feature with 10g Release 2. Version 11g has the following types of sorts:

• Binary sort
• Monolingual linguistic sort
• Multilingual linguistic sort

NLSSORT

NLSSORT returns the string of bytes used to sort char.
Both char and 'nlsparam' can be any of the datatypes CHAR, VARCHAR2, NCHAR, or NVARCHAR2. The string returned is of RAW datatype.
The value of 'nlsparam' can have the form :
'NLS_SORT = sort'
where sort is a linguistic sort sequence or BINARY. If you omit 'nlsparam', then this function uses the default sort sequence for your session. If you specify BINARY, then this function returns char.
If you specify 'nlsparam', then you can append to the linguistic sort name the suffix _ai to request an accent-insensitive sort or _ci to request a case-insensitive sort.
This function does not support CLOB data directly. However, CLOBs can be passed in as arguments through implicit data conversion.

-------------------------------------------------------/ Test Cases /-------------------------------------------------------------
Sample data and table used for examples throughout the scope of the document:-
To get us started, here's a simple sample table with a few rows in it.

Connected to Oracle Database 11g Enterprise Edition Release 11.2.0.2.0
Connected as Lalit
 
SQL> CREATE TABLE NAMES AS SELECT * FROM (
2 WITH DATA AS(
3 SELECT 'Lalit' NAME from dual UNION ALL
4 SELECT 'Srikanth' NAME from dual UNION ALL
5 SELECT 'Rahul' NAME from dual UNION ALL
6 SELECT 'Praveen' NAME from dual UNION ALL
7 SELECT 'Swetha' NAME from dual UNION ALL
8 SELECT 'Karthik' NAME from dual UNION ALL
9 SELECT 'Suresh' NAME from dual)
10 SELECT * from DATA);

Table created

SQL> select * from names;
NAME
--------
Lalit
Srikanth
Rahul
Praveen
Swetha
Karthik
Suresh

7 rows selected

The default: Binary sort

The default sort order is always binary, meaning that the characters are sorted in the order of their character number in the character set.

SQL> SHOW PARAMETER NLS_SORT;  NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
nls_sort string BINARY

SQL> SHOW PARAMETER NLS_COMP;

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
nls_comp string BINARY

So this is how the default sorting happens -

SQL> SELECT name, ascii(substr(name,1,1)) FROM names order by name;

NAME ASCII(SUBSTR(NAME,1,1))
-------- -----------------------
Karthik 75
Lalit 76
Praveen 80
Rahul 82
Srikanth 83
Suresh 83
Swetha 83
7 rows selected

SQL> SELECT name, ascii(name) FROM names order by name;

NAME ASCII(NAME)
-------- -----------
Karthik 75
Lalit 76
Praveen 80
Rahul 82
Srikanth 83
Suresh 83
Swetha 83

7 rows selected

What happens if the data is case-sensitive, the sorting of the characters changes as per the ASCII value returned. However, internally everything happens in binary format. Let’s see this happening when we alter the case sensitivity-

SQL> update names set name='kARTHIK' where name='Karthik';


1 row updated

SQL> SELECT name, ascii(name) FROM names order by NLSSORT(name,'NLS_SORT=BINARY');

NAME ASCII(NAME)
-------- -----------
Lalit 76
Praveen 80
Rahul 82
Srikanth 83
Suresh 83
Swetha 83
kARTHIK 107

7 rows selected

SQL> SELECT name, ascii(name) FROM names order by NLSSORT(name,'NLS_SORT=BINARY_CI');

NAME ASCII(NAME)
-------- -----------
kARTHIK 107
Lalit 76
Praveen 80
Rahul 82
Srikanth 83
Suresh 83
Swetha 83

7 rows selected

SQL> rollback;

Rollback complete

Let’s see the difference in the binary values when case-sensitivity matters-

SQL> select name, NLSSORT(name,'NLS_SORT=BINARY') BINARY ,NLSSORT(name,'NLS_SORT=BINARY_CI') BINARY_CI FROM names ORDER BY name;

NAME BINARY BINARY_CI
-------- ------------------- --------------------
Karthik 4B61727468696B00 6B61727468696B00
Lalit 4C616C697400 6C616C697400
Praveen 5072617665656E00 7072617665656E00
Rahul 526168756C00 726168756C00
Srikanth 5372696B616E746800 7372696B616E746800
Suresh 53757265736800 73757265736800
Swetha 53776574686100 73776574686100

7 rows selected

Now let’s see the use of these session parameters.

SQL> select * from names where name = 'Rahul';


NAME
--------
Rahul

SQL> select * from names where name = 'rahul'; --case does not match with actual data
NAME
--------
--no rows returned in this case

So, in such a case usually we tend to use UPPER/LOWER function on both sides of the operator.

SQL> select * from names where UPPER(NAME) = UPPER('rahul');

NAME
--------
Rahul

OR

SQL> select * from names where LOWER(NAME) = LOWER('RAHUL');

NAME
--------
Rahul

Now we got our output, but, the question is did we do it optimally?
What if the NAMES table is a MASTER TABLE of the national data for AADHAR CARD/VOTER ID. It will have billions of rows making the table size to some Gigabytes.
Usually, if the top SQLs have the name column in their predicate, then it is must to have a Balanced-tree index. Or even a bitmap-index if the mapping criteria fulfils for not null and unique conditions. In such a case, using UPPER or LOWER function will skip the Index scan and we hit a FTS case.

Let’s create a normal balance-tree index.

SQL> CREATE INDEX names_indx ON names(NAME);

Index created

Now let’s check the execution plan for the query using the case function (UPPER/LOWER).

SQL> explain plan for SELECT * FROM names WHERE UPPER(name)=UPPER('kArtHIk');

Explained

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 4048250084
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 15 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| NAMES | 1 | 6 | 15 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(NLSSORT(UPPER("NAME"),'nls_sort=''BINARY_CI''')=HEXTORAW('
6B61727468696B00') )
Note
-----
- dynamic sampling used for this statement (level=2)
18 rows selected

So, we just ended up with a FTS. A very obvious solution that would strike in the mind is to create a function-based index.

SQL> create index names_fb_indx on names(upper(name));

Index created

SQL> alter session set nls_comp='BINARY';

Session altered

SQL> alter session set nls_sort='BINARY';

Session altered

SQL> explain plan for SELECT * FROM names WHERE UPPER(name)=UPPER('kArtHIk');

Explained

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 3644882080
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 12 | 2 (0
| 1 | TABLE ACCESS BY INDEX ROWID| NAMES | 1 | 12 | 2 (0
|* 2 | INDEX RANGE SCAN | NAMES_FB_INDX | 1 | | 1 (0
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access(UPPER("NAME")='KARTHIK')
Note
-----
- dynamic sampling used for this statement (level=2)
18 rows selected

Although creating a virtual column in 11g and up would also suffice. But it is similar to a function-based index. But doing either of them is nothing but adding an overhead to the existing design. We need to maintain the indexes in an index table space. It will maintain the rowid, value of the data, but in the same case that of the original data in table.
So, let’s see for a better approach.

SQL> alter session set nls_comp='LINGUISTIC';

Session altered

SQL> alter session set nls_sort='BINARY_CI';

Session altered

SQL> SELECT * FROM names WHERE name='kArtHIk';

NAME
--------
Karthik

SQL> explain plan for SELECT * FROM names WHERE name='kArtHIk';

Explained

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 4048250084
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 15 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| NAMES | 1 | 6 | 15 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(NLSSORT("NAME",'nls_sort=''BINARY_CI''')=HEXTORAW('6B61727
468696B00') )
Note
-----
- dynamic sampling used for this statement (level=2)
18 rows selected

Everything seems fine. Task is achieved, but still we can tune it to next level.

Case Insensitive Indexes

A normal index uses the default sort order, by default, and so it's useless in a case-insensitive search - and Oracle won't use it. For large tables, the resulting full table scan can be quite a performance penalty. Fortunately, it's easy to create an index that uses a specific sort order. You simply create a function-based index that uses the NLSSORT function we saw above.

SQL> create index names_ci_indx on names(NLSSORT(name,'NLS_SORT=BINARY_CI'));

Index created

SQL> explain plan for SELECT * FROM names WHERE name='kArtHIk';

Explained

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 2561608627
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 45 | 2 (0
| 1 | TABLE ACCESS BY INDEX ROWID| NAMES | 1 | 45 | 2 (0
|* 2 | INDEX RANGE SCAN | NAMES_CI_INDX | 1 | | 1 (0
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access(NLSSORT("NAME",'nls_sort=''BINARY_CI''')=HEXTORAW('6B61727468696B0
Note
-----
- dynamic sampling used for this statement (level=2)
18 rows selected

No comments:

Post a Comment