Ten Reasons PostgreSQL is Better Than SQL Server

Why would anyone want to use PostgreSQL instead of SQL Server? There are a lot of factors to consider when choosing how to store your data. Sometimes we need to look deeper than the standard choice and consider something new. If you’re starting a brand new project, where should you store your data? Here are ten reasons why you might want to consider PostgreSQL over SQL Server.

Releases Every Year

Let’s face it, waiting three to five years for new functionality to roll out in any product is painful. I don’t want to constantly be learning new functionality, but on the flip side I don’t want to be writing hack solutions to critical business problems because I know something is coming down the pipe, but I can’t wait for a few more years before I implement it myself. Rapid release cycles guarantee that the PostgreSQL development team is able to quickly ship the features that users need and make frequent improvements.

Starting with version 9.0, the PostgreSQL release cycle has switched to a yearly cycle. Before that, PostgreSQL released whenever the features were done. Looking at the major releases on Wikipedia, it’s obvious that major releases still rolled out about once every 18 months. An 18 month release cycle isn’t bad for any software product, much less a mission critical one like a database.

True Serialization

Snapshot isolation guarantees that all reads in a transaction see a consistent snapshot of data. In addition, a transaction should only commit if the ways that it changes data don’t conflict with other changes made since the snapshot was taken. Unfortunately, snapshots allow anomalies to exist. It’s possible to create a situation where two valid transactions occur that leave the database in an inconsistent state – the database doesn’t pass its own rules for data integrity.

Serializable snapshot isolation was added to PostgreSQL in version 9.1. SSI emulates strict serial execution – transactions behave as if they are executing one after another. If there is a conflict, or even a potential conflict, the database engine throws an error back to the caller (who is left to figure out the appropriate next step).

Serializable snapshot isolation sounds painful. The kicker is that it makes it possible for databases to behave in ways that work to guarantee an even stronger level of consistency. Applications can be developed to assume that data modification will fail and subsequently retry failed transactions. The true benefit is that well written software can avoid data inconsistencies and maintain the illusion that all is operating as it should be.

Sane Defaults, Ridiculous Tuning

Okay, to be fair PostgreSQL ships with some ridiculously conservative shared memory settings. Most other PostgreSQL settings are conservative, but general enough for most generic workloads. Many people deploying PostgreSQL will not have to make many changes to PostgreSQL (probably just increasing shared_buffers to 25% of total RAM to start).

Once a PostgreSQL installation is up and running, there are a number of settings that can be changed. The best part, though, is that most of these settings can be changed at the server, database, user, or even individual query level. It’s very common to have mixed workload servers – most activity on the server is basic CRUD, but a small percentage of activity are reports that need to be aggressively tuned. Instead of moving the individual reports out to running on separate space (either separate servers, databases, or even in separate resource pools in the same database), we can simply tune a few queries to use the appropriate parameters including the memory to allocate for sorting and joins.

Unlogged Tables

Are you sick of trying to get minimally logged bulk inserts to work? Me too. Instead of trying various mechanisms to minimally log some tables, PostgreSQL give us option of creating an unlogged table – simply add the UNLOGGED directive to a create table statement and everything is ready to go.

Unlogged tables bypass the write ahead log; they aren’t crash safe, but they’re stupid fast. Data in an unlogged table will be truncated after the server crashes or there is an unclean shutdown, otherwise it’ll still be there. They’re also excluded from replication to a standby server. This makes unlogged tables ideal for ETL or other data manipulation processes that can easily be repeated using source data.

KNN for Geospatial… and More

Yeah, I hear ya, SQL Server will have this soon, but PostgreSQL already has it. If K Nearest Neighbor searches are critical for your business, you’ve already gone through some pain trying to get this working in your RDBMS. Or you’ve given up and implemented the solution elsewhere. I can’t blame you for that – geospatial querying is nice, but not having KNN features is a killer.

PostgreSQL’s KNN querying works on specific types of indexes (there are a lot of index types in PostgreSQL). Not only can you use KNN querying to find the 5 nearest Dairy Queens, but you can also use a KNN search on other data types. It’s completely possible to perform a KNN search and find the 10 phrases that are closest to “ice cream”.

KNN search capability makes PostgreSQL a serious contender for anyone looking at implementing geospatial querying. The additional flexibility puts PostgreSQL in a leadership position for many other kinds of search driven applications.

Transaction-Controlled Synchronous Replication

One of the easiest ways to keep another copy of your database is to use some kind of database replication. SQL Server DBAs will largely be used to transactional replication – a dedicated agent reads the SQL Server log, collects outstanding commands, and then ships them over to the subscriber where they are applied.

PostgreSQL’s built-in replication is closer to SQL Server’s mirroring than SQL Server’s replication (PostgreSQL’s replication has a readable standby). Log activity is hardened on the primary and then streamed to the secondary. This can either happen synchronously or asynchronously. Up until PostgreSQL 9.1, replication was an all or nothing affair – every transaction was either synchronous or asynchronous. Developers can set a specific transaction by setting the synchronous_replication configuration value for that single transaction. This is important because it makes it possible to write copious amounts of data to logging tables for debugging purposes but not have performance be impacted by synchronously committing writes to the log tables.

Any time we have more choice in how we develop our applications, I’m happy.

Writeable CTEs

CTEs are great for reads, but if I need to do something more complex with them, there are other issues involved. An example is going to make this much easier. Let’s say I want to delete stale data, but I want to store it in an archive table. To do this with SQL Server, the easiest route (from a development standpoint) is going to be to elevate my isolation level to at least snapshot, if not serializable, and use isolation levels to guarantee that no data will be changed. I could also load the PK value of the comments to be deleted into a temp table and reference that multiple times.

Both methods work, but both methods have problems. The first requires that the code be run in a specific isolation level. This relies on specific settings to be in place that may not be available. The code could also be copied out of the procedure and run in SSMS, leading to potential anomalies where a few rows are deleted but not archived. That’s no big deal for spam comments, but it could be critical in other situations. The second method isn’t necessarily bad, there’s nothing wrong with it, but it involves extra code noise. That temporary table isn’t necessary to solve our problem and is a byproduct of dealing with different isolation levels.

PostgreSQL has a different way to solve this problem: writeable CTEs. The CTE is constructed the same way it would be constructed in T-SQL. The difference is that when we’re using PostgreSQL, the data can be modified inside the CTE. The output is then used just like like the output of any other CTE:

CREATE TABLE old_text_data (text_data text); 

WITH deleted_comments AS ( 
  DELETE FROM comments 
  WHERE comment_text LIKE '%spam%' 
  RETURNING comment_id, email_address, created_at, comment_text 
) 
INSERT INTO spam_comments 
SELECT * 
FROM deleted_comments 

This can be combined with default values, triggers, or any other data modification to build very rich ETL chains. Under the covers it may be doing the same things that we’ve outlined from SQL Server, but the conciseness is beneficial.

Extensions

Ever want to add some functionality to SQL Server? What about keep up to date on that functionality? This can be a huge problem for DBAs. It’s very easy to skip a server when you roll out new administrative scripts across your production environment. Furthermore, how do you even know which version you have installed?

The PostgreSQL Extension Network is a centralized repository for extra functionality. It’s a trusted source for open source PostgreSQL libraries – no sneaky binaries are allowed. Plus, everything in PGXN is versioned. When updating PGXN provided functionality, the extension takes care of the update path for you – it knows how to make sure it’s up to date.

There are extensions for things ranging from K-means clustering, Oracle compatibility functions, to remote queries to Amazon S3.

Pushing this functionality out into extensions makes it easy for developers and DBAs to build custom packages that look and act like core functionality of PostgreSQL without trying to get the package through the PostgreSQL release process. These packages can then be developed independently, advance at their own rate, and provide complex functionality that may not fit within the release plan of the PostgreSQL core team. In short, there’s a healthy ecosystem of software being built around PostgreSQL.

Rich Temporal Data Types

One of my favorite features of PostgreSQL is the rich support for temporal data types. Sure, SQL Server 2008 finally brought some sophistication to SQL Server’s support for temporal data, but it’s still a pretty barren landscape. Strong support for temporal data is critical in many industries and, unfortunately, there’s a lot of work that goes on in SQL Server to work around the limitations of SQL Server’s support for temporal data.

PostgreSQL brings intelligent handling of time zones. In addition to supporting the ISO 8601 standard (1999-01-08 04:05:06 -8:00), PostgreSQL supports identifying the time zone by an abbreviation (PST) or by specifying a location identifier (America/Tijuana). Abbreviations are treated like a fixed offset from UTC, location identifiers change with daylight savings rules.

On top of time zone flexibility, PostgreSQL has an interval data type. The interval data type is capable of storing an interval of up to 178,000,000 years with precision out to 14 digits. Intervals can measure time at a number of precisions from as broad as a year to as narrow as the microsecond.

Exclusion Constraints

Have you ever tried to write any kind of scheduling functionality using SQL Server? If you have, you’ll know that when you have business requirements like “two people cannot occupy the same conference room at the same time”, you’ll know that this difficult to enforce with code and usually requires additional trips to the database. There are many ways to implement this purely through application level code and none of them lead to happy users or developers.

PostgreSQL 9.0 introduced exclusion constraints for columns. In short, we define a table and then add an additional constraint that includes a number of checks where at least one of the checks is false. Exclusion constraints are supported under the hood by indexes, so these operations are as quick as our disks and the index that we’ve designed. It’s possible to use exclusion constraints in conjunction with temporal or geospatial data and make sure that different people aren’t reserving the same room at the same time or that plots of land don’t overlap.

There was a presentation at the 2010 PGCon that going into the details of exclusion constraints. While there is no video, the slides are available and they contain enough examples and explanations to get you started.

Bonus Feature – Cost

It’s free. All the features are always there. There are no editions of PostgreSQL – the features always exist in the database. Commercial support is available from a number of companies, some of them even provide additional closed source features, but the core PostgreSQL database is always available, always free, and always contains the same features.

Getting Started

Want to get started with PostgreSQL? Head on over to the download page and pull down a copy for your platform of choice. If you want more details, the documentation is thorough and well written, or you can check out the tutorials in the wiki.

Comments

20 Comments so far. Comments are closed.
  1. Nice list, Jeremiah. I once argued with a Python dev about Postgres versus SQL Server. Perhaps if he had communicated as clearly as you have in this post, we might’ve actually had a discussion rather than a Nope-You’re-Wrong back-and-forth.

    • Sadly, I think too many discussions end up going the way of arguments because someone’s unwilling to listen to the other person’s point. I know I’m guilty of that all the time.

      Glad you enjoyed this list, though.

  2. Jeremiah, as a rebuttle to your Writeable CTEs example, what about using the OUTPUT clause. Ex:

    DELETE FROM comments
    OUTPUT DELETED.*
    INTO spam_comments
    WHERE comment_text LIKE '%spam%'

    • It certainly works, but the downside to the OUTPUT clause is that I have to immediately drop the data somewhere before picking it back up to perform additional querying. So if I really want to display a list of rows changed in this query plus the last time they were changed, then I have to save the OUTPUT clause to a temporary table before querying my history data. It’s a bit of syntactical sugar, but it’s quite convenient. Plus I get all of the other syntactical niceties of a CTE.

      Using the OUTPUT clause is definitely a workaround that can be made to do the same things and a writeable CTE, but the OUTPUT clause is different in that it requires a heck of a lot more work. PostgreSQL also has the RETURNING clause to do the same thing as OUTPUT.

      • I’m not buying that. For the example you gave, its actually easier to do what you’re trying to do using SQL Server. As Eric stated, all you have to do is use the OUTPUT clause. Calling the ways to do this using OUTPUT “workarounds” is just a scam. They aren’t work arounds, and the code is actually simpler.
        FAIL

  3. Tim Benninghoff,

    I loved the timing of this post. Over the past month I’ve implemented a CMS and a Wiki on PostgreSQL, and I could certainly stand to learn more about it.

  4. I’m a very happy PostgreSQL user. I’d like to step in and highlight a few areas where users of some other DBMSes get frustrated, so people are aware Pg is no more perfect than any other DB. Don’t get me wrong, I’m a big fan; that’s part of why I’m highlighting the downsides, because I get annoyed when something is hyped so much that people find actually using it to be a let-down. Let’s get the let-down and the warts over now:

    • Limited CPU concurrency. Pg can run many queries on many different CPUs, but one query can only run on one CPU at a time. If you have one really big, CPU-expensive query, Pg can’t complete it faster by using multiple CPUs. There are some limitations on concurrent disk use, too, but they’re much less significant. None of this affects workloads with larger numbers of smaller queries, where Pg easily maxes out the machine.
    • Admission control is only by connection count. Pg can’t natively be told to “run no more than 10 queries concurrently” for example. Every connection has its own query executor, so high idle connection counts are expensive in synchronization and (somewhat) in RAM. Too many actively working connections lead to contention for resources, excess context switches, etc and slow overall throughput. Pg has a “sweet spot” for number of working connections that varies depending on the hardware, and it’s usually best to use a connection pool to queue queries and stop it exceeding that sweet spot. Good connection pools like PgPool-II and PgBouncer are available, so this is more of a wart than a real problem.
    • PostgreSQL doesn’t have true stored procedures with autonomous transactions. PL/PgSQL functions always run within a transaction; they cannot commit that transaction or create others. (They *can* use savepoints via EXCEPTION blocks, though). dblink can be used as a workaround, but has its own issues. It looks like MS SQL Server may have the same limitation, but I’m not up to date with MS SQL Server and wanted to point this out in case it doesn’t.
    • All-or-nothing replication. You can’t currently control which databases or tables get replicated when you’re using wal-shipping or streaming replication, it’s all or nothing. This can be a pain if you have a busy but unimportant database and a less-busy but very important database, where replication requirements differ. You can work around it by running two PostgreSQL instances, but they can’t use the same shared memory segment so you’ll pay a RAM cost for it, plus they have to be on different ports or IPs
    • No automatic savepoints. If one statement in a transaction fails, the whol transaction is automatically rolled back. You can’t pick it up again from the last successful statement unless you’ve been issuing SAVEPOINT statements before each statement that might fail. Savepoints have a performance cost, too, so this can slow things down as well as being a pain. In practice Pg’s behaviour is usually a good thing, but it’s occasonally a real pain.
    • Limited XML functionality. Support for in-db XSLT in particular is currently in an add-on module with known issues.
    • Limited performance monitoring and somewhat primitive performance monitoring tools.
    • work_mem is per-sort (etc) not per-connection, so it’s hard to give a query or set of queries the maximum possible RAM to work with w/o risking exceeding available resources or pushing too much cached disk data out. In practice this isn’t too big an issue, but it’s a bit ugly.

    In other words: Pg isn’t perfect. It has quirks, and you should be aware of them. Despite that, it’s *REALLY* good for lots of workloads, and the price is hard to argue with. The generally really helpful and friendly community doesn’t hurt, either.

  5. Heh … looks like your weblog style doesn’t like <ul;><li/></ul> being used. A preview button would be handy.

  6. I’ve now written up my previous post on some of PostgreSQL’s limitations and downsides, from the perspective of a happy PostgreSQL user, here:

    http://blog.ringerc.id.au/2011/12/postgresql-great-even-when-you-can-see.html

    … where it can be read without the munched formatting of the comment as it appears here. Hopefully this’ll help people see whether PostgreSQL really is suitable for them or not and make a better-informed decision without later disillusionment if it turns out not to do what they need.

    • Thanks for sharing both in the horrible formatting of my comments and on your own blog. It’s always good to get people’s opinions of what makes something good and also what to watch out for. Again, thanks!

  7. Dang, I messed up those embedded URLs. Try this comment instead.

    Jeremiah – In your Writeable CTE section, what I’m interpreting you are wanting to do is to immediately insert into a table what you are deleting from a different table. Well, SQL 2008+ does this already with a “dml table source” feeding an insert statement – see the Insert statement in BOL (http://msdn.microsoft.com/en-us/library/ms174335.aspx) for a definition and syntax.

    Your above code, re-written to utilize this:
    INSERT INTO spam_comments
    SELECT comment_id, email_address, created_at, comment_text
    FROM (
    DELETE FROM comments
    OUTPUT deleted.comment_id,
    deleted.email_address,
    deleted.created_at,
    deleted.comment_text
    WHERE comment_text like ‘%spam%’) MyDerivedTableAlias

    Granted, this only works with the INSERT statement, and it has a bunch of restrictions (which can be read in the OUTPUT clause in BOL at http://msdn.microsoft.com/en-us/library/ms177564.aspx), but it is there.

    • You’re correct, much like the people who suggested I can also use OUTPUT are correct: it’s a limited value of correct. I can use workarounds like this one, or OUTPUT, but I’m not looking to immediately insert into another table. I am looking to re-use the output of a DML statement in another query, just as I would re-use the output of any other statement – as a virtual table. The upside of writeable CTEs is that I can combine multiple writeable CTEs and produce summary data or even use the writeable CTE to produce aggregations.

      In theory, I can do all of that with temporary tables/INSERT/OUTPUT. The problem is that when I do this, I’m resorting to a workaround; I’m telling the database exactly how I want to do something. One of the primary advantages of SQL is that SQL lets us tell the database what to do, and then the database figures that out for us. The limitations in SQL Server mean I end up telling SQL Server how to do the work. I like writeable CTEs because they let me tell the database what to do and then I just sit back and let it do the work.

  8. Noel,

    Very good read, and a little tasty morsel of what’s available in other DBMS. I’ve seen a lot DBAs/developers think about this switch because SQL Server is increasing in price and MS is focused on their Azure offerings. MySQL just doesn’t look especially appealing for greenfield projects, considering that it lacks a bunch of features and it is now owned by Oracle.

    I’m curious, have you had any significant systems running on Postgres, like say, 99.995% uptime requirements, > 500 GB of data, and 2K q/s? I’ve found it really hard to find examples I can take before my team to say, “Look, they moved a big system to Postgres, and it worked great!”

    • While I personally haven’t worked with systems that have those requirements (defining 99.995% uptime is tricky), I know that many systems out there are working well within those limits. Emma have a particularly large PostgreSQL installation. Heroku have also chosen PostgreSQL as the RDBMS backend for all applications on Heroku. If you have control of the hardware, hitting those requirements shouldn’t be difficult at all. If you’d like, you can hit me up via this delightful form and we can chat via email a bit more about what you’re trying to do.

    • Two other quick bits of info around PostgreSQL and scalability. You can see some of the current benchmarks here: Scalability, in Graphical Form, Analyzed. As far as other scalable companies using PostgreSQL – Instagram have recently published how they scale with PostgreSQL.

      TL;DR version: Benchmarks show PostgreSQL 9.1 scaling up to around 45,000 transactions per second with 9.2′s code base edging past the 200,000 TPS mark.

    • Brad,

      Just found this on Google, a bit late to the party – but one case commonly pointed to is the .org & .info domain registries operated by Afilias and run on Postgres. These are large, high transactional volume databases with very strict uptime requirements. Core internet infrastructure depends on these systems being available.

      .or was migrated away from Oracle in the early to mid 2000′s.

      • Brad,

        The FAA also uses Postgres/PostGIS. The replaced the Oracle spatial stuff with it.

        If you fly in the US, you depend on Postgres.

  9. completely changed my perspective on postgre, I always thought it was junk. Any leads on a VM to fool around with ?

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 531 items have been purified.