Posted: Sun Mar 23, 2003 8:06 am Post subject: FAQ on SQL Statement Performance and Evaluation
FAQ For Evaluating SQL queries by Glenn (old, but helpful here too)
This is a guide for asking and evaluating information about SQL queries, which has been pieced together using prior posts and examples. This will aid in helping you to ask questions regarding your SQL statements and provide the information necessary for us to help answer your questions in an expeditious manner without a huge fuss.
1) What is meant by evaluation of SQL queries?
This information is related to the evaluation of performance factors and questions regarding the tuning of SQL queries. This information is not related any way to proper syntax or completing a particular desired query - just present your query and we can attempt to help. Remember though as with most help on SQL that any queries given as a result may not work because a few of us do not have the means to test the statements right at hand.
2) What kind of questions are acceptable in this sense?
Questions should be asked about concrete cases occurring in specific databases. Any theoretical cases will always be answered with the answer "it depends". There are too many variables in specific database implementations to be able to specifically say that one way of doing something is always better than another, though there are some general consistencies. (note, find article link for this)
3) What are these variables in database implementations?
These variables are things such as what indexes occur on any of the fields, how the indexes are ordered, the index's clustering factor, the number of rows in the tables, and so on.
4) OK, I have a real SQL statement case which I need evaluated. What do I need to do in asking this question?
Present the following information: Your complete SQL statement, the table definitions that apply to the fields used in the statement, any and all indexes the tables possess, and the current EXPLAIN of the statement. This will provide us the complete story to enable us to understand
your issue. Please do not abbreviate or leave out anything (one condition in your statement can change the answer). Also be prepared to answer any kind of statistical kind of information about all programs which access the tables in question (i.e. generally which fields are used in WHERE clauses most of the time, what fields are requested for ORDER BY/GROUP BY and in what sorted order)
5) I do not know how to EXPLAIN a statement. How do I do this?
The EXPLAIN can be run within any standard SQL interface or on a program when a bind is executed. To run the EXPLAIN a table must be defined for the current user by the DBA. This table is defined as follows:
Code:
CREATE TABLE "<userid>"."PLAN_TABLE"
("QUERYNO" INTEGER NOT NULL,
"QBLOCKNO" SMALLINT NOT NULL,
"APPLNAME" CHAR(8) NOT NULL,
"PROGNAME" CHAR(8) NOT NULL,
"PLANNO" SMALLINT NOT NULL,
"METHOD" SMALLINT NOT NULL,
"CREATOR" CHAR(8) NOT NULL,
"TNAME" CHAR(18) NOT NULL,
"TABNO" SMALLINT NOT NULL,
"ACCESSTYPE" CHAR(2) NOT NULL,
"MATCHCOLS" SMALLINT NOT NULL,
"ACCESSCREATOR" CHAR(8) NOT NULL,
"ACCESSNAME" CHAR(18) NOT NULL,
"INDEXONLY" CHAR(1) NOT NULL,
"SORTN_UNIQ" CHAR(1) NOT NULL,
"SORTN_JOIN" CHAR(1) NOT NULL,
"SORTN_ORDERBY" CHAR(1) NOT NULL,
"SORTN_GROUPBY" CHAR(1) NOT NULL,
"SORTC_UNIQ" CHAR(1) NOT NULL,
"SORTC_JOIN" CHAR(1) NOT NULL,
"SORTC_ORDERBY" CHAR(1) NOT NULL,
"SORTC_GROUPBY" CHAR(1) NOT NULL,
"TSLOCKMODE" CHAR(3) NOT NULL,
"TIMESTAMP" CHAR(16) NOT NULL,
"REMARKS" VARCHAR(254) NOT NULL,
"PREFETCH" CHAR(1) NOT NULL WITH DEFAULT,
"COLUMN_FN_EVAL" CHAR(1) NOT NULL WITH DEFAULT,
"MIXOPSEQ" SMALLINT NOT NULL WITH DEFAULT,
"VERSION" VARCHAR(64) NOT NULL WITH DEFAULT,
"COLLID" CHAR(18) NOT NULL WITH DEFAULT,
"ACCESS_DEGREE" SMALLINT ,
"ACCESS_PGROUP_ID" SMALLINT ,
"JOIN_DEGREE" SMALLINT ,
"JOIN_PGROUP_ID" SMALLINT ,
"SORTC_PGROUP_ID" SMALLINT ,
"SORTN_PGROUP_ID" SMALLINT ,
"PARALLELISM_MODE" CHAR(1) ,
"MERGE_JOIN_COLS" SMALLINT ,
"CORRELATION_NAME" CHAR(18) ,
"PAGE_RANGE" CHAR(1) NOT NULL,
"JOIN_TYPE" CHAR(1) NOT NULL,
"GROUP_MEMBER" CHAR(8) NOT NULL,
"IBM_SERVICE_DATA" VARCHAR(254) NOT NULL,
"WHEN_OPTIMIZE" CHAR(1) NOT NULL WITH DEFAULT,
"QBLOCK_TYPE" CHAR(6) NOT NULL WITH DEFAULT,
"BIND_TIME" TIMESTAMP NOT NULL WITH DEFAULT,
"OPTHINT" CHAR(8) NOT NULL WITH DEFAULT,
"HINT_USED" CHAR(8) NOT NULL WITH DEFAULT,
"PRIMARY_ACCESSTYPE" CHAR(1) NOT NULL WITH DEFAULT)
IN "<database>"."<tablespace>";
In version 6 or above DB2 there is another table which I've found useful in other ways - it can be useful to point out statements in a complete system which might cause problems. Its definition is below:
Code:
CREATE TABLE "<userid>"."DSN_STATEMNT_TABLE"
("QUERYNO" INTEGER NOT NULL WITH DEFAULT,
"APPLNAME" CHAR(8) NOT NULL WITH DEFAULT,
"PROGNAME" CHAR(8) NOT NULL WITH DEFAULT,
"COLLID" CHAR(18) NOT NULL WITH DEFAULT,
"GROUP_MEMBER" CHAR(8) NOT NULL WITH DEFAULT,
"EXPLAIN_TIME" TIMESTAMP NOT NULL WITH DEFAULT,
"STMT_TYPE" CHAR(6) NOT NULL WITH DEFAULT,
"COST_CATEGORY" CHAR(1) NOT NULL WITH DEFAULT,
"PROCMS" INTEGER NOT NULL WITH DEFAULT,
"PROCSU" INTEGER NOT NULL WITH DEFAULT,
"REASON" VARCHAR(254) NOT NULL WITH DEFAULT)
IN "<database>"."<tablespace>";
To run the explain, the tables as defined must be present - no error will be given if they are not present and accessible. To explain an SQL statement out of the SPUFI environment, the line "EXPLAIN ALL SET QUERYNO=X FOR" is added to the beginning of the statement. To run explain on a program, the parm EXPLAIN(YES) should be added to the bind for the program, then execute the bind.
As an aid, examples of SQL to access information in the EXPLAIN tables are presented - all the fields which would be of likely interest are included in these selects:
In SPUFI for a specific statment:
Code:
EXPLAIN ALL SET QUERYNO = 1 FOR
<statement>;
SELECT QUERYNO, PLANNO, TABNO, QBLOCK_TYPE, TNAME, METHOD,
TSLOCKMODE, INDEXONLY, ACCESSNAME, ACCESSTYPE, MATCHCOLS,
PREFETCH, PAGE_RANGE, COLUMN_FN_EVAL, SORTN_UNIQ, SORTN_JOIN,
SORTN_ORDERBY, SORTN_GROUPBY, SORTC_UNIQ, SORTC_JOIN,
SORTC_ORDERBY, SORTC_GROUPBY
FROM <userid>.PLAN_TABLE
WHERE QUERYNO = 1
ORDER BY QUERYNO, QBLOCKNO, TABNO;
DELETE FROM <userid>.PLAN_TABLE
WHERE QUERYNO = 1;
COMMIT;
In SPUFI for a program which was bound with EXPLAIN(YES):
Code:
SELECT PROGNAME, QUERYNO, TABNO, PLANNO, QBLOCK_TYPE, TNAME, METHOD,
TSLOCKMODE, INDEXONLY, ACCESSNAME, ACCESSTYPE, MATCHCOLS,
PREFETCH, PAGE_RANGE, COLUMN_FN_EVAL, SORTN_UNIQ, SORTN_JOIN,
SORTN_ORDERBY, SORTN_GROUPBY, SORTC_UNIQ, SORTC_JOIN,
SORTC_ORDERBY, SORTC_GROUPBY
FROM <userid>.PLAN_TABLE
WHERE PROGNAME = '<program name>'
ORDER BY QUERYNO, QBLOCKNO, TABNO;
COMMIT;
If DSN_STATEMNT_TABLE is available, PROGNAME, STMT_TYPE, PROCMS, and PROCSU are the usual fields of interest.
6) OK, so I have this information now and might want to experiment a bit to find the answer for modifying an inefficient statement. What approach should I take in gathering information from these SELECT statements?
I will describe one possible approach for pieceing together the information returned by the statements.
First look at how many rows are returned. Multiple rows indicate multiple steps were taken to satisfy the statement. Note what information is present or not present on each row. If information is not present on a row, then the data was retrieved and then sorted as another step.
Usualy one of the sort fields (first 4 characters) will have a Y next to it in this case.If one of the SORT fields is Y in the rows then the last part indicates why the sort was performed. UNIQ means the sort was performed to satisfy a requirement to make the data unique
(i.e. only one occurrence returned and not multiple occurrences). JOIN means the sort was performed to satisfy a JOIN condition in the statement. ORDERBY means the sort was performed as requested in an ORDER BY statement. GROUPBY means the sort was performed to satisfy a GROUP BY clause in your statement.
A next step is to look at the rows for what table names are listed (TNAME) and indexes (ACCESSNAME, INDEXONLY, MATCHCOLS). What indexes are used (ACCESSNAME)? Was the index used by itself to satisfy the statement (INDEXONLY)? How many columns were used in the index(MATCHCOLS)?
The next step is to consider how the data are accessed in each step (ACCESSTYPE). Was a tablespace scan used (ACCESSTYPE=R)? Was direct indexed access used(I)? Or was some form of index matching used (multiple rows with access type M, MI, and MX)?
The final step is to look at the PREFETCH column. This indicates whether multiple rows were accessed at the same time and placed into memory in the process of performing the statement. This field is blank unless some form of prefetch is used. PREFETCH=S means that sequential pre-fetch was used (i.e. it could pick up the next few rows and store them in memory at the same time it picks up one row). If PREFETCH=L, the system maintained a list of required rows, maybe in the form of the index so it could look them up in sequence to save time in discovery of the rows.
7) Now that I have a general picture of the statement, what kinds of things can I try looking at to improve upon the statement?
Improving upon an SQL statement is usually an act of experimentation - meaning you will have to re-try and re-experiment statements until you hit upon the right one which satisfies the needs of the time.
A good first step is trying to visualize the current process and one that might be more efficient were you to write the program. For example, does the logical process (given a 2 table join) seem to be to access the indexes and match up values and it's using one index then tablespace scanning for the other value? Does the statement's indicated process match up with what the EXPLAIN indicates? What should be the minimum required necessary to satisfy the request? Could the statement be rewritten in another way (note I've found you can usually write most mildly complex statements in about 4 different ways to return the same result)?
The next step if the statement can not be improved upon is to start looking at the database. Remember the general rule is that most program performance is 80% coding and 20% database. Look at the WHERE and ORDER BY clauses (along with joins). Does this match up with any of the indexes you have? Say your index is in f1,f2 desc,f3 order and the order by is f3,f2 desc, f1? In this case, the index is out of order and therefore unusable, even though the fields may be present. Remember that indexes may be used for one or any of the fields in the table as long as the statement supports using the index. In the above example index, it may be used to directly locate any WHERE statement which has F1, F1 and F2, or F1, F2, and F3. If the index is out of order for enough statements in the system due to the kinds of WHERE clauses used and the orders on the statements, this may indicate that the index may need changed. This is of course a judgement call based on times the fields are used, and the frequency of execution of these statements - this is why general statistics on all which accesses a table can
be useful.
If the statement still can not improved upon to an acceptable level and must be, then there is a final option. Take a close look at the frequency of JOINS which occur in the system and the fields in each of the tables. If the JOIN must occur to access data at every instance or it seems a tablespace scan must happen to access a subset of data, then this may mean the database design is faulty (study your normalization rules) or a denormalization may be useful. Of course, this is a drastic choice, but may be necessary to improve upon the performance of the complete system.
Remember too that a few statements may not be optimized completely. I have encountered a couple of cases like this.
8) What are some things I can look at to change the performance of SQL statements?
Here is a list of things to look at with short descriptions where necessary - a list of glittering generalities as it were:
Attempt to simplify the statement as much as possible. This means remove any fields which are
unnecessary to access, any sorts on any fields which are unnecessary, and any processes which are unnecessary. This even means looking at the corresponding program to evaluate whether the
statement is even necessary (an example: doing SELECT COUNT(*) to check for duplicate key before performing an INSERT - the code for this can be checked for when the INSERT is performed).
Construct a statement using WHERE clauses in place of the LIKE clause if possible. WHERE FIELD1 LIKE 'JOHN%' can easily be replaced with a simple equality.
Add conditions to JOIN statements. For example, if a condition exists on a field for one table for a JOIN field then try adding it as a condition for the other side. If a situation is present in a JOIN where the index would be useful and it's not being used in a condition currently, try adding a condition to the indexed field (FIELD1 >= ' ')
Look at the indexes for the following factors: order of the values, most significant limiting values should be specified first (i.e. an index for a location should be CITY, STATE and not STATE, CITY), does the index support the order requested by the user in most of the statements,clustering factor?
Use predictive processing - the most common condition should occur first in any set of evaluated values. Example: if most of our customers are in the state of MD, place that at the beginning of an IN list.
Use IN in place of OR, if the OR involves the same field.
Rewrite subselects as joins. This is one of the mysteries I've come across - sometimes joining two tables is quicker than performing a subselect.
Always recompile and rebind after the DBA makes any structural changes to your tables, or
upgrades DB2. Making sure Reorg and runstats are run is also useful in this point.
Do not use functions on indexed fields if possible. This generally removes the index involved from consideration.
Add an extra simple condition to force usage of the index. An example: WHERE (F1 > :VALUE1) OR ((F1 = :VALUE1) AND (F2 > :VALUE2) can be improved by adding changing it to WHERE (F1 >= :VALUE1) AND ((F1 > :VALUE1) OR ((F1 = :VALUE1) AND (F2 > :VALUE2))) The index is not used in the first case, while it is used in the second case.
Final Thought
I hope this will turn out to be a good reference we can point people to as a start. Please feel free to add to, change, or correct anything that was mentioned here...
A trick I use to increase the possibility of indexed processing in a multi-column index in COBOL is to use the between low-value and high-value.
For example (simplified) : lets say I have a 3 column index in the Accounts table as follows:
store
branch
acct
One additional column (none index) is product. Without knowing the acct(s) ahead of time, my sql is as follows:
Select store, branch, acct, product
from Accounts
where Store = '001'
and branch = '123'
and product = 'Q';
In my experience, DB2 doesn't use the index. What I do to steer DB2 into using the index is to add account into the equation as follows:
Select store, branch, acct, product
from Accounts
where Store = '001'
and branch = '123'
and (acct between :acct-low and :acct-high)
and product = 'Q';
Where acct-low is low-values and acct-high is high-values. Although I don't know ahead of time what accounts I'm looking for, I can still use the acct as described above to steer DB2 into using the index. This can have a hugh effect on performance.
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum