Why Don’t We Have Federated Databases?

March 22, 2011

Federated databases are a dream that have not materialized. The SQL/MEDextension to the ANSI/ISO SQL specification is a step in the right direction. In addition, both SQL Server and Oracle have proprietary extensions that make it possible to query external data sources. If all of this technology is available today, why aren’t more people using it?

Why Don’t We Have a Federated Database?

If federated databases are such a powerful thing, why hasn’t anyone built one? Surely this is something that many businesses are clamoring for, or at least in need of. There are a number of problems facing anyone attempting to implement a federated database. Frankly, federated databases are fraught with technical difficulties. Every database vendor supports a different subset of the ANSI/ISO SQL standard, different vendors use different data types and metadata, concurrency is a huge concern with dealing with remote resources, and technology is a moving target.

Different Dialects of SQL

Anyone who has attempted to port an application from one database engine to another knows about the pain involved in translating queries. Different vendors adhere to the SQL standard to varying degrees. This problem can be solved by creating wrappers to translate sub-queries between different querying languages, but it’s still a problem that exists. This problem could be partially solved in a federated database by limiting the database vendors to a small subset of the ANSI/ISO SQL standard, but this doesn’t solve the problem, it merely avoids it by limiting functionality.

Different Metadata

Different databases may have different metadata – different data types are used to represent the same data and different structures are used to describe data. To solve this problem it becomes necessary to create elaborate mappings between columns that represent the same data. There may be situations where such a mapping becomes computationally intensive or functionally impossible. SQL Server allows the creation of .NET data types with custom methods for data searching, access, and manipulation. PostgreSQL features several data types (notably tsvector, hstore, and arrays) that would be difficult to convert to data types in other databases. Metadata differences don’t just stop at the data type level. It’s possible to model data in a number of different ways; the type of an address could be indicated using an integer key value that references a lookup table in one database or as a VARCHAR column with values constrained by the database. It’s even possible for something as simple as Unicode text encodings to cause problems: SQL Server uses the NVARCHAR data type for storing Unicode strings while other databases do not use a separate data type.

Concurrency

Concurrency, depending on your database, may be a concern. Managing concurrent operations within a single database is a difficult task, much less managing concurrency across multiple databases. Unfortunately, correct handling of concurrency across all components of a federated database is critical. Many potential problems of a federated database can be solved through different trade offs. However, managing concurrency is a nearly impossible task. To properly and effectively manage concurrency across multiple databases is to ask the impossible. Not only would this require the federated database vendor be able to account for all possible concurrency issues in relational databases, but they would need to be able to account for potential concurrency issues in any database that integrates with the federated database.

The Moving Target

Even were a database vendor to take on this task, they’d be consistently aiming for a moving target. New features are added to relational databases all the time, and there are enough major players in the market to make it difficult for users to keep up to date, much less a federated database vendor. Once you factor in the wealth of other, non-relational databases, the idea of creating a federated database system to handle metadata mapping, concurrency control, and query language resolution trends toward impossible.

The State of The Industry

Where we stand now, there is almost no chance of any independent software vendor creating a true federated database. There is hope, but not from where you would expect it. Enterprise data warehouses can fulfill much of the function of a federated database, but they still require complex ETL and data mapping to be truly useful. Adding additional information to an enterprise data warehouse can require extra work to prepare the data warehouse and ETL processes for the new data. Unfortunately, enterprise data warehouses require too much manual intervention to be a candidate for a federated database. As we’ve discussed, SQL Server and Oracle provide ways to reference remote database servers. These methods have their own problems. SQL Server linked tables are prone to problems with some objects not being remoteable. When we’re querying a remote server, we need to make sure that the parts of our query going to the remote server are handled on the remote server. This is difficult to get right. On some occasions we might even see an entire remote table be streamed across the wire to be filtered on the originating server. This is something that we don’t want to see happening. For a federated database to be a tenable product there must be an easy way to offload queries to a remote table and a guarantee of adequate performance. Properly remoting queries is incredibly complex. Assume, for a moment, that we have a report that queries data on the sales department’s database server and we also need to include data from human resources data. Our query might look something like this:SELECT sp.first_name, sp.last_name, eh.employment_duration, ts.year, ts.total_sales_by_year, r.average_review_score FROM public.sales_person sp JOIN HRDB01.employee_info.public.employee_history eh ON sp.employee_id = eh.employee_id JOIN ( SELECT o.employee_id, o.year, SUM(o.total) total_sales_by_year FROM public.orders AS o GROUP BY o.employee_id, o.year ) AS ts ON sp.employee_id = ts.employee_id JOIN ( SELECT r.year, r.employee_id, AVG(r.score) AS average_review_score FROM HRDB01.employee_info.reviews.review r GROUP BY r.year, r.employee_id ) AS rv ON sp.employee_id = rv.employee_id AND ts.year = rv.year; Looking at this query we’re hitting two separate remote objects in one remote database. In order for this query to be effective, our query optimizer must b able to re-write the query in a way that lets it build an intelligent query for the two remote tablesHRDB01.employee_info.public.employee_history andHRDB01.employee_info.reviews.review. But, in order to effectively re-write the local query to properly reference remote objects, we need to know everything about the remote objects – our calling server must be aware of as much metadata as possible so the remoted query can be re-written before being sent to the remote server. While this is doable, it puts additional load on the calling server. This server now has to maintain information about remote database objects. But that’s not all! If we want our queries to be truly optimal, our federating sever will need to be aware of how data types will behave on the remote servers and how the remote data types will interact with local data types. Once you examine the intricacies of a federated database, it becomes obvious why the federated database, as a boxed product, is beyond our reach. It’s not that the task is impossible; on the contrary such a task is very possible. The difficulty lies in coordinating all of the information available and using it to deliver data quickly. There are enough moving and potentially unknown parts that it’s non-trivial to create heterogeneous systems capable of filling out the promise of federated databases. Faced with this situation, the only viable solution is to build your own solution that answers the needs of the business.