Sep 30 2008

Avoiding Cross Database Ownership Chaining

Tag: SQL Server, sqlJeremiah Peschka @ 7:23 pm

Cross database ownership chaining is bad. Very bad. How bad? Despite being included in SQL Server 2005, this feature is disabled by default. SQL Server 2005 Books Online goes on to say “Setting cross db ownership chaining to 1 is not recommended unless all of the databases hosted by the instance of SQL Server must participate in cross-database ownership chaining and you are aware of the security implications of this setting. For more information, see Ownership Chains.”

Why would you want to use this? Well, you might want to write a stored procedure or function that pulls data from two databases in the same query joining between the two of them.

At my current client, I have two databases on the server — a data mart and a web reporting database. There are materialized report tables in the web reporting database. Some of these tables are only used to provide filters for pulling data from the data mart. Access to the database is very limited — the web user can only execute stored procedures.

This afternoon, after struggling with cross database permission chaining for a day and a half, I posted on sqlservercentral.com’s forums. I was ultimately helped out by a user who suggested using the database option TRUSTWORTHY. I had never heard of the TRUSTWORTHY option, so I had to do some digging to get a better understanding before I suggested it as an option to my client.

What does the TRUSTWORTHY option do?

“When ON, database modules (for example, user-defined functions or stored procedures) that use an impersonation context can access resources outside the database.

When OFF is specified, in an impersonation context cannot access resources outside the database.

TRUSTWORTHY is set to OFF whenever the database is attached.”

TRUSTWORTHY allows databases to use impersonation to access resources in a different database. Here’s an example:

  • There are two databases on the server: data_mart and web_reports
  • There is a server level login: test_user
  • The test_user login is mapped to the test_user user on both data_mart and web_reports

At this point, TRUSTWORTHY is set to OFF. If I try to execute a stored procedure in the web_reports database, I will get an error message that looks like:

The server principal “test_user” is not able to access the database “data_mart” under the current security context.

The only way to fix this is to turn on the TRUSTWORTHY option on both databases:

ALTER DATABASE web_reports SET TRUSTWORTHY ON;
GO
ALTER DATABASE data_mart SET TRUSTWORTHY ON;
GO

After executing the ALTER DATABASE code, stored procedures that cross databases will execute properly.

It’s important to note that, in this situation, the stored procedures will either need to be owned by test_user or else they will need to be set to EXECUTE AS ‘test_user’.


Sep 24 2008

Cursing at cursors

Tag: SQL Server, code, sqlJeremiah Peschka @ 7:39 pm

Cursors. Many database developers have to suppress a shudder of horror when they open up a stored procedure only to find a cursor lurking inside. They certainly have their place in the database developer’s arsenal of tools, but they can usually be rewritten as an easily understood set operation.

I’ve spent the first three days at my new client learning the database schema and migrating four load routines from the old to the new database schema. The new schema is, admittedly, easier to understand simply because of the virtue of using design up front principles. But I digress.

These four load processes make use of cursors and, in one instance, nested cursors.

Without divulging too much of the client’s inner workings, the first cursor I encountered looked something like this:


DECLARE @start_date DATETIME;
DECLARE @end_date DATETIME;
DECLARE @end_date2 DATETIME;
DECLARE @account_no INT;

DECLARE my_cursor CURSOR
FOR
  SELECT account_no, start_date, end_date
  FROM Bill;
-- obviously the real query was more complicated than this

OPEN my_cursor;
FETCH NEXT FROM my_cursor
INTO @account_no, @start_date, @end_date;

SET @end_date2 = @start_date

WHILE @@fetch_status = 0
BEGIN
  IF DATEDIFF(dd, @start_date1, @end_date2) > 7
    INSERT INTO some_table (some values);
  END
END;

SET @end_date2 = @end_date;

FETCH NEXT FROM my_cursor
INTO @account_no, @start_date, @end_date;

Essentially, this will insert any records that are at least 7 days older than the previous record. The downside is that, even in my overly simplified example, this isn’t easy to read. When the full query logic is involved this becomes very difficult to read.

I re-wrote the original query using a CTE:


WITH my_cte (id, account_no, start_date, end_date)
AS (
  SELECT
    ROW_NUMBER() OVER (account_no, start_date, end_date, some_values),
    account_no,
    start_date,
    end_date,
    some_values
  FROM Bills
  GROUP BY account_no, start_date, end_date
)
INSERT INTO some_table (some_columns)
SELECT DISTINCT
  account_no,
  start_date,
  end_date,
  some_values
FROM my_cte AS o
INNER JOIN my_cte AS o2 ON o2.account_no = o.account_no
WHERE o2.start_date < o.start_date
  AND DATEDIFF(dd, o2.start_date, o.start_date) > 7

Instead of having to process each record one at a time, the new query is able to process several million records and insert approximately 60,000 records in 11 seconds and is much more readable and reusable as a result.

This CTE pattern has become the basis for several data load procedures that I’ve re-written from cursors, including a pair of nested cursors.

Special thanks, once again, to my brother for helping me rubber duck my way through this and for providing the initial idea that resulted in using a CTE.


Sep 22 2008

Get dates quickly with SQL

Tag: SQL Server, code, sqlJeremiah Peschka @ 6:39 pm

That’s right, now you too can use T-SQL to get dates quickly!

Today at the client I came across the following chunk of SQL in a where clause:

dateadd(m,-24,(cast(datepart(m,getdate()) as varchar(2))+'/1/'+cast(datepart(yyyy,getdate()) as varchar(4))))

It’s used to return the first of the month two years ago. So, for today, this would return 2006-09-01

What’s wrong with this? Well, in this query there are two string concatenations, two explicit conversions from an INTEGER to a VARCHAR, and an implicit conversion from VARCHAR to DATETIME. When run across a large number of dates, this can rapidly become an expensive operation. Especially considering that this was half of a BETWEEN statement that had something similar on the other side of the AND.

Is there a better way to do this? You bet!

SELECT DATEADD(mm, -24, DATEADD(mm,DATEDIFF(mm,0,GETDATE()),0))

This involves no implicit or explicit casting, just some basic date math. While I haven’t done any benchmarking on either option, I’m going to go out on a limb and state that the second option is much more efficient and, arguably, a lot clearer.

In addition to being cleaner, using DATEADD and DATEDIFF makes no assumption about the current locale of your database. This could save you if you are dealing with databases that are spread across the globe or if you have to deal with localized tables within a database.

More information can be found either from Pinal Dave or from Shane Cooper. I’ve bookmarked both of these links and, if you’ve ever had to write SQL code, you should to.

N.B. If you’re using this over any more than a few records you will want to create a variable to store the result of the date math since date functions are non-deterministic. Meaning, if you perform a SELECT on 2 million records, the math will be performed at least 2 million times (4 million if you’re using a BETWEEN).

Edit: Special thanks to my brother for coming to my rescue today and helping me track down an easier, SQL-based, way to do this.


Sep 21 2008

CodeThinked | What is your strategy for becoming a better developer?

Tag: codeJeremiah Peschka @ 12:08 pm

My amazing co-worker pointed out a great post from Justin Etheredge on strategies for becoming a better developer.

Self-improvement is a huge thing for me, both personally and professionally. I like both Justin and Rick’s ideas - Emulate, Get Involved, Read, and Stretch.

My strategies tie these all together and are pretty simple, really: Learn and Teach.

Teaching

I’ve gained the greatest understanding of programming by sharing my knowledge with other developers. Teaching has forced me to develop a solid understand of whatever subject I’m teaching.

But teaching doesn’t just involve teaching brown bags or presenting at user groups/conferences. Teaching also includes mentoring colleagues, co-workers, and junior developers. By mentoring, you develop a longer term teaching relationship. You grow along with the person you’re mentoring. Their questions often delve into areas of expertise that you may never have investigated which gets you both learning that subject together and makes use of the collective intelligence to improve your mutual understanding.

Learning

Never stop learning. Learn new techniques in your current language. Learn how your language’s compiler/VM works. Learn about different types of garbage collection. Learn a new language. Just keep learning. Never stop following your passion for technology. Sometimes these technologies may not be immediately applicable to your current project, but learning new concepts will help you approach your current work differently and more efficiently (check out Array.ForEach() in the .NET Framework)


Sep 13 2008

Shingling - it’s not just for roofers!

Tag: Information ArchitectureJeremiah Peschka @ 8:41 pm

I was catching up on the Information Architecture Institute mailing list and some feed reader backlog when I came across the concept of shingling. Seeing as how I have never heard of this term in the 8+ years I’ve been working, I decided it was high time that I learned about it.

In essence, shingling seeks to solve the problem of indexing large quantities of data:

  • How can you tell if two pieces of content are the same? You compare them.
  • What happens when you want to compare a lot of pages? You have to make a very large number of comparisons.
  • What happens when you’re google? You can’t solve this problem by bulk comparisons

The number of computations required to compare all of your content rapidly trends toward infinity. Enter shingling.

Shingling is nothing more than taking a single document, splitting it into smaller chunks, and generating a sufficiently sized unique fingerprint from the shingles. This way, when you are indexing content, you can immediately break down the document into the constituent shingles, generate a fingerprint, and then see if that fingerprint already exists. What you do at this point is up to you.

I ran into a similar problem when implementing an n-gram search engine that ultimately proved to be inefficient. Essentially, the computation required to generate the n-grams for a given page of content, check the index for an existing n-gram, and associate the indexed word/content chunk with the given n-gram soon proved terribly inefficient. This n-gram search engine only had to index names, addresses, and email addresses — it wasn’t even attempting to provide an index of user generated content.

I have to wonder how full-text indexing providers, like SQL Server and Ferret, will (or already have) handle the challenge of shingling indexed content. It seems like it would be a concern both on the front of storage consolidation and for the purposes of optimizing CPU cycles — for retrieval, indexing, and comparison.

References
Near-duplicates and shingling
Navigating the network of knowledge: Mining quotations from massive-scale digital libraries of books
N-gram, From Wikipedia, the free encyclopedia


Next Page »