Category Archives: postgresql

Connecting Postgres to Active Directory for Authentication

I have connected our Postgres instance to Active Directory for authentication because I didn’t want to manage passwords for hundreds for users accessing mission critical databases. The authentication is being performed by Active Directory but authorization (managing groups, permissions at table/column/row level) is still be handled by Postgres. The biggest advantage of connecting Postgres to AD for real users, I don’t have to worry on password and password policies because the policies are handled by AD. On the other hand,  users database authentication will be dependent on AD infrastructure. Please make sure AD infrastructure is redundant enough with low latency connection  for your organization before making this decision.

I have come up with simple steps to connect Postgres to Active Directory. TBH, If you can get serviceAccount to be used to connect to organization’s active directory, the rest of the setup is fairly straight forward.

Step-1 :

Get and active directory server name and test connection from Postgres server :

postgres-dbserver$ nc -v 389
Connection to 389 port [tcp/ldap] succeeded!
Step-2 :

Install ldap-utils package so you can use ldap command line tools to test connections.

yum install ldap-utils
apt-get install ldap-utils
Step-3 :

Ask your AD/LDAP admins to get full path of OU and CN for your organization and construct ldapsearch query. I’m using sAMAccountName as username but you can use other attributes for username i.e uid

ldapsearch -v -x -H ldap:// -b "OU=Users,OU=local,DC=example,DC=com" -D "CN=service-account-username,OU=ServiceAccounts,OU=Users,OU=local,DC=example,DC=com" -W "(sAMAccountName=dpatel)"

If you are using uid for the validation,  your search query will look like this …

ldapsearch -v -x -H ldap:// -b "OU=Users,OU=local,DC=example,DC=com" -D "CN=service-account-username,OU=ServiceAccounts,OU=Users,OU=local,DC=example,DC=com" -W "(uid=dpatel)"

If the above query returns the result, you successfully found the user in the Active directory on specified path. If you can’t find user, please work with Active Directory Admin to find the correct basedn (-b option in above query).

Step-4 :

Add following line in pg_hba.conf and reload the config. Postgres reads pg_hba.conf from top to bottom. Please make sure to add this line at the end so authentication for other application users can be performed without active directory.

host all all ldap ldapbasedn="OU=Users,OU=local,DC=example,DC=com" ldapbinddn="CN=service-account-username,OU=ServiceAccounts,OU=Users,OU=local,DC=example,DC=com" ldapbindpasswd="Testing123" ldapsearchattribute="sAMAccountName"

* * Password with special characters like @! doesn’t parse properly in pg_hba.conf. Plus, I had to provide plain text password in pg_hba.conf for AD service account. If you know alternative, please let me know 🙂 

psql> create role dpatel login;
Step-6 :

Test the connection. Looks for Postgres logs for success/error.

Hopefully,  this post will help someone to integrate Postgres with Active Directory using LDAP protocol.

Tracing Tableau to Postgres connectivity issue using Wireshark!

I spent last couple of weeks trying to resolve connection issues from Tableau Server 10.3 to Postgres 9.6.6.  If you are not familiar with Tableau , it  is popular enterprise grade  visualization tool  allow advanced analytic capabilities to understand and visualize  data .  As it is very popular in the industry, it is obvious that tableau has to talk with  a popular database in the industry, which is Postgres!

Issue:    My developers could able to connect to Postgres DB using Tableau Desktop on their laptop without any issue and make LIVE connection to Postgres 9.6 database.  However,  when they publish same dashboard with LIVE connection to database, it was hanging on the tableau server.

To reproduce the issue, we installed and tried to use Tableau Desktop application on Tableau server itself. The Desktop was hanging while connecting to Postgres database. This made it clear that there is some problem on connecting from Tableau server only.  Over the last couple of weeks I have looked various things to understand and resolve the issue:

  1.  Database encoding ; UTF8 Vs ANSI
  2. Tested Postgres ODBC drivers on Tableau Server (Windows 2016)
  3. A lot of troubleshooting because the connection to empty database was working !!
  4. Working with tableau support to explain and track down issue
  5. Windows settings

None of the above helped!

Finally,  Wireshark  came to rescue ! We traced the network traffic on port 5432 on Windows 2016 while it was making the connection to Postgres database running on Linux server.

In the Wireshark logs, when the TCP communicates with the Postgres database, filtering down to the server  indicates that the TCP/IP sequencing is inconsistent. When initiating a connection, sequence numbers are maintained on each side of the connection. In this case, the packets are sent and received from the Postgres Server (PDB), and Tableau Desktop (TD). The sequence number is comprised of both values added together to output a single value to ensure information is accurate.

Typically, TCP data transmission is sequentially ordered. Each packed has an acknowledgement number, known as ACK. The value is equal to the next chronological sequence number from the packet that server has just received.

Like pages in a book, we expect a sequential order. The Wireshark logs indicate the PDB sequence number is not consistent with the TD sequence number as shown below:

1. PDB > TD Sequence=1;Length=14;Next Sequence=15
2. TD > PDB Sequence=1;Length=34;Next Sequence=35;Acknowledge=15
3. PDB > TD Sequence=15;Length=304;Next Sequence=319;Acknowledge=35
4. TD > PDB Sequence=35;Length=174;Next Sequence=209;Acknowledge=319

In the good packets sequence  above, similar to a book of pages, the first line indicates we sequentially expect that if there are 14 pages in the database, the next packet on the next page should from Tableau Desktop should be 15. In the second line, Tableau Desktop returns the response on page 1 and has 34 pages total. The next packet sent will start on page 35 from Postgres. Third, on the database side, PDB indicates to Tableau Desktop, page 15 is the starting page with a total of 304 pages. The next page from Tableau Desktop should start at 319.

Let’s look at problematic lines in below  screenshot for the packets captured  between numbers 7701-7710:

TCP Bad Packet copy

PDB > TD Sequence=1283;Length=32;Next Sequence=1315;Acknowledge=1963
TD > PDB Sequence=1963;Length=40;Next Sequence=2003;Acknowledge=1315
PDB > TD Sequence=1315;Length=22;Next Sequence=1337;Acknowledge=2003
TD > PDB Sequence=2003;Length=343;Next Sequence=2346;Acknowledge=1337
PDB > TD Sequence=9529;Length=1449;Next Sequence=10978;Acknowledge=2346
TD > PDB Sequence=2346;Length=0;;Acknowledge=1337

In the second to last line above, the data has been skipped. PDB expects the next packet should start at byte #1337, but instead the starting point is at byte #9529. Normally, in situations where the bytes don’t match, this would indicate that some of the packets didn’t get captured.

For example, Wireshark missed a few messages. But the ACK value for that packet points to the last packet we captured from TD > PDB, which clearly states it expects PDB to begin its next packet from byte #1337. Not only did our sequence number increment unexpectedly, it increased suspiciously at the value of: 8,192. That’s the exact number of bytes in 8KB (8*2^10=8192).

As a result of the bad packet mismatch, Tableau Desktop machine recognizes the information has been skipped according to the sequence number and re-requests a packet starting at byte #1337, however, PDB does not respond.

To track down packet loss.. you can ping the other server with different packet size…

 PS C:\Windows\system32> ping -l 2000 -f
 Pinging with 2000 bytes of data:
 Reply from bytes=2000 time<1ms TTL=64
 Reply from bytes=2000 time<1ms TTL=64
 Reply from bytes=2000 time<1ms TTL=64
 Reply from bytes=2000 time<1ms TTL=64
 PS C:\Windows\system32> ping -l 2100 -f
 Pinging with 2100 bytes of data:
 Request timed out.
 Request timed out.
 Request timed out.
 Request timed out.

As you can see in output above, when the package size is larger than 200 bytes, the packets loss is 100%.

In terms on root cause, it turned out to be MTU size mismatch issue. Once the MTU size is adjusted on switch and both  Tableau Windows and Postgres Unix database server, the packet loss issue has been resolved and Tableau managed to connect to Postgres without any issue.

It was a roller coaster experience dealing with mysterious packet loss issue. In case, if you come across similar connectivity issue between application to database, you might want to make sure you are not having packet loss issue. Wireshark can be your friend 🙂

Thanks for reading and happy holidays 🙂

Christmas Gift!

Merry Christmas!!

If you are using SQL Server or Oracle databases, I’m giving away my 1 hour of time for rest of this week for free of cost to  discuss how you can save $$$$ by migrating them to PostgreSQL database by keeping same features and achieve better performance.

you can shoot an email to

Happy holidays!

Running Postgres in Docker

For last six months, I have been working on moving Postgres from bare metal & VM based  systems in  Docker. As of today, we have migrated a couple of mission critical Postgres DBs  (~ 2TB) on to Docker environment.

During the migration  journey, I have listed down some of the things to consider running Postgres production instances in to Docker environment.

  1.  Do not use default Postgres Docker image. Start with your own docker image from scratch.
  2.  Handle Postgres service shutdown gracefully in docker with SIGTERM
  3.  OS can be stored in Docker container
  4.  Data volume MUST be stored in persistent storage
  5. Use some kind of framework to manage docker containers
    • Apache Mesos & Aurora
    • OpenStack & Kubernetes
  6.  You can mount NetApp for backups and WAL files on to container
  7.  Make templates for resources for different kind of workloads
    • Aurora job for resource templates
    • postgresql.conf templates
  8.  Use static IPs for services ; DBs, Front end an backend servers
    • It will be easier to control access at container level for better security
    • failover is easy to manage with static IP for master DB server
  9.  Benchmark your existing system and compare with new system
    • Keep eyes on TPS using pgbench and benchmarksql
  10.  Monitoring solution for Postgres DB
    • collectd or other agent based monitoring
      • pg_stat_statements is very useful
    • Docker container should be monitored separately
      •  docker stats
  11.  Backup container for taking backups
  12.  Standby container for setting up standby jobs

I hope it will be useful for someone working on migrating Postgres into Docker environment!

YUM repo location & HTTPS changes for Postgres packages !

As Devrim  blogged about  Postgres YUM repo changes , I wanted to write down procedure I have to follow this morning  🙂

For example,

If you want to update YUM repo for Postgres 9.4 on CentOS x86_64, you can update executing below command:

yum localinstall

Now,  you can follow normal Postgres version upgrade procedure.

Hats off to Devrim to build new packages and all the recent improvements in repo management process! He updated wiki page as well.

Hope this will help someone 🙂

Recovering Postgres database from disk level corruption!!

A couple of weeks ago, I had to deal with corrupted Postgres database cluster. At the end, we couldn’t able to recover some of the data but managed to recover most part of it. Having experience working with dozens of database systems, I’m pleasantly surprised to experience resiliency of Postgres database.

Kudos to Postgres Development team for building the most resilience database in the world 🙂

Here is my Postgres database recovery story 🙂


I'm posting the steps carried out during the recovery process for information purpose only. This post doesn't provide any guarantee that it will work for your use-case and/or environment.

Note: The actual database name has been replaced with “dbname”  and actual table names with “tablename”.

One of the Postgres DB cluster database experienced disk level corruption thus we were hitting this error:

postgres=# \c dbname
FATAL: could not read block 0 in file "base/16389/11930": Input/output error

Uh oh?? Really bad, isn’t it? Fortunately, it wasn’t mission critical system so we managed to take extended  outage and work on partial recovery process because we didn’t want to loose all the data!!

Once we received the complaint, we immediately backed up corrupted  database and created recovery cluster to bring up on different  server so we can go through recovery efforts!!

Trial 1:

As many of you know, the first option is to bring up recovery database cluster with  zero_damaged_pages=on   . You can set the value in Postgres config file and try to  reindex system catalog:

reindexdb -p 5433 --system dbname
reindexdb: could not connect to database dbname: FATAL: index "pg_index_indexrelid_index" contains unexpected zero page at block 0 
HINT:  Please REINDEX it.

Doh! Still, we could still not be able to connect to database !!

Trial 2:

If you aren’t aware, you should note down that there is a way to ignore indexes at system level. We started up recovery cluster with ignore_system_indexes=true setting:

pg_ctl -D /data -o '-c ignore_system_indexes=true'
dbname=# \c dbname

Yay! I could able to connect to DB now!

Trial 3:

Let’s try to reindex the database…

dbname=# reindex database "dbname";
NOTICE: table "pg_catalog.pg_class" was reindexed
2016-08-22 15:53:14.179 PDT rhost=[local] app=psql:user=postgres:db=dbname:ERROR: could not create unique index "pg_statistic_relid_att_inh_index"
2016-08-22 15:53:14.179 PDT rhost=[local] app=psql:user=postgres:db=dbname:DETAIL: Key (starelid, staattnum, stainherit)=(2608, 5, f) is duplicated.
2016-08-22 15:53:14.179 PDT rhost=[local] app=psql:user=postgres:db=dbname:STATEMENT: reindex database "dbname";
ERROR: could not create unique index "pg_statistic_relid_att_inh_index"
DETAIL: Key (starelid, staattnum, stainherit)=(2608, 5, f) is duplicated.

As the table is corrupted with duplicate entries, let’s find out and fix them.

dbname=# select starelid, staattnum, stainherit from pg_catalog.pg_statistic where starelid=2608 order by 2;
starelid | staattnum | stainherit
2608 | 1 | f
2608 | 2 | f
2608 | 3 | f
2608 | 4 | f
2608 | 5 | f
2608 | 5 | f
2608 | 6 | f
2608 | 7 | f
(8 rows)

Let’s remove one of the entry based on XMIN :

dbname=# delete from pg_catalog.pg_statistic where starelid=2608 and staattnum=5 and xmin=1228447;
Trial 4:

Restart REINDEX but it failed again!!

2016-08-22 16:01:29.698 PDT rhost=[local] app=psql:user=postgres:db=dbname:ERROR: 1 constraint record(s) missing for rel tablename
2016-08-22 16:01:29.698 PDT rhost=[local] app=psql:user=postgres:db=dbname:STATEMENT: reindex database "dbname";
ERROR: 1 constraint record(s) missing for rel tablename
Trial 5:

Let’s try to vacuum analzye the table

dbname=# vacuum analyze tablename;
2016-08-22 16:04:01.282 PDT rhost=[local] app=psql:user=postgres:db=dbname: 1 constraint record(s) missing for rel tablename
2016-08-22 16:04:01.282 PDT rhost=[local] app=psql:user=postgres:db=dbname:STATEMENT: vacuum analyze tablename;
ERROR: 1 constraint record(s) missing for rel tablename

hrm…it’s still complaining about constraint

Trial 6:

let’s disable constraint check….

dbname=# update pg_class set relchecks=0 where relname='tablename';
The above update fixed the  the constraint error
Trial 7:

Let’s reindex the database again!

dbname =# reindex database "dbname";

Yay, Reindex is successful.

Once the reindex is successfully completed, we restarted recovery cluster without zero_damaged_page and ignore_system_indices settings.

Partial tables recovery through pg_dump  process:

As the database is corrupted, it makes sense to kick off the pg_dump on the database … we kicked off the pg_dump but it was still showing some of the sequences with errors!!

/usr/lib/postgresql/9.4/bin/pg_dump dbname -p 5433 -Fc >recovery_dbname.dmp
2016-08-22 16:22:09.517 PDT rhost=[local] app=pg_dump:user=postgres:db=dbname:ERROR: invalid page in block 0 of relation base/16389/2825248
2016-08-22 16:22:09.517 PDT rhost=[local] app=pg_dump:user=postgres:db=dbname:STATEMENT: SELECT sequence_name, start_value, increment_by, CASE WHEN increment_by > 0 AND max_value = 9223372036854775807 THEN NULL WHEN increment_by < 0 AND max_value = -1 THEN NULL ELSE max_value END AS max_value, CASE WHEN increment_by > 0 AND min_value = 1 THEN NULL WHEN increment_by < 0 AND min_value = -9223372036854775807 THEN NULL ELSE min_value END AS min_value, cache_value, is_cycled FROM XX_id_seq
pg_dump: [archiver (db)] query failed: ERROR: invalid page in block 0 of relation base/16389/2825248
pg_dump: [archiver (db)] query was: SELECT sequence_name, start_value, increment_by, CASE WHEN increment_by > 0 AND max_value = 9223372036854775807 THEN NULL WHEN increment_by < 0 AND max_value = -1 THEN NULL ELSE max_value END AS max_value, CASE WHEN increment_by > 0 AND min_value = 1 THEN NULL WHEN increment_by < 0 AND min_value = -9223372036854775807 THEN NULL ELSE min_value END AS min_value, cache_value, is_cycled FROM XX_id_seq

We had issue recovering a couple of tables but we managed to recover the most of the tables in the database  !

This was our Postgres database  recovery story.

Hopefully, it will help someone in case they fall into corruption situation. Any feedback or other ideas are welcome 🙂


Analyzing effect on default_statistics_target on ANALYZE

Last week, I received request to increase PostgreSQL’s config parameter default_statistics_target on one of the very large DB clusters.  The developers filled a ticket to increase the default_statistics_target to 10,000 at cluster level because one of the queries has started  full sequencial scan  instead of index scan.

By default, Postges cluster sets default_statistics_target to 100 but you can set up to 10,000. This parameter defines how many values are to be stored in the list of most common values, and also indicates the number of rows to be inspected by ANALYZE process.

Out of curiosity, I dig into Postgres ANALYZE code to figure out how many rows is being scanned by ANALYZE based on default_statistics_target  value. As you can see in line 1763, ANALYZE inspects rows up to multiple of 300 & default_statistics_target.

1763  stats->minrows = 300 * attr->attstattarget;

If the default_statistics_target value is 100, ANALYZE could scan up to 300 * 100= 30,000 rows to gather accurate statistics.

As per Postgres documentation on ANALYZE,  if you increase the value blindly at cluster level it could hurt the smaller queries because query planning time could increase significantly by setting the default_statistics_target value to maximum, which can eventually degrade the peformance for simple queries.  The next logical step for me to find out optimal value for the default_statistics_target for the specific table with slow query. I will probably start with increasing value to 1000 and increase/decrease value based on EXPLAIN ANALYZE query results.

When you try to tune default_statistics _target for your database server,  hopefully this finding will help. Let me know, if you have alternate views or idea for tuning the setting. Thanks!

Let’s hangout w/ Postgres people on Slack!

While casual discussions with Postgres users at  conference and/or online,  I came across this question for at least 3-4 times;  “Is there a Slack channel for Postgres?”  The answer was “No”.

The idea of bringing up Slack channel was cooking in my mind for a long time but i did not have time to execute it! Now, I have brought up Postgres Team Slack channel in end of Nov 2015 and there are already 100+ users in the channel. You can sign up here

Following are the benefits you will get by connecting to Slack channel:

  • If you are using Slack for other communication, it will be easier for you to bring up another company/team on the Slack application.
  • Slack provides integration with number of services i.e Twitter with #postgres and #postgresql hashtags, postgres weekly news and @postgres Twitter handle messages
  • I’m yet to integrate Slack with Postgres gitrepo because I don’t have access to add post-commit hook on the gitrepo. Please email  me @, if you can help me to setup integration for Postgres gitrepo . I will provide instruction and web hook url to add into github repo settings.
  • You can access Slack from your mobile devices
  • If you aren’t connected, you still have history of the chat

Hope, this channel will be useful to drive more people to discuss postgres at scale!!


If you are already using Postgres IRC channel, I’m not insisting you to use this new slack channel. I have come up with slack channel for mainly for slack users who are already using it for other purposes! Thanks for understanding.

Postgres Replication Slot & pg_basebackup

On #postgresql IRC channel, someone was having problem bringing up standby database server from backups taken using pg_basebackup and recovery.conf was configured to use replication slot.

The reporter was following  below steps:

  1. Set max_replication_slots=5, wal_keep_segments=0 and restart primary database
  2. create replication slot :
  3. SELECT * FROM pg_create_physical_replication_slot('standby')
  4. Call pg_basebackup -D – -Ft | bzip2 > backup.tar.bz2
  5. copy backup.tar.bz2 to standby server
  6. start postgresql on standby

The standby server was throwing following error during startup ..

 requested WAL segment 000000010000000C000000AE has already been removed.

Question from Reporter:

Should I set wal_keep_segments to some non-zero value if I want to use replication slots because while reading docs I understand that I shouldn’t set wal_keep_segments if I use max_replication_slots?

Short answer :

As of now, pg_basebackup doesn’t have support for replication slot so it doesn’t have details on status of WAL files being created or removed.

While the feature is being worked on, you can use one of the following solutions to get around the problem for now.

Solution 1:

On primay database server, set wal_keep_segments to some sane value relative to database traffic volume and time to make & transfer backup to bring up standby server. This setting will keep # of WAL files around on the primary server and will not delete them so when you bring up standby server using replication slot, it will start applying WAL files and eventually connect to physical slot created.

Solution 2:

Create slot before taking backup and use the slot at the least once so primary db server will keep the WAL files around.

  1. On primary db server, create slot
  2. connect to slot using pg_receivexlog
  3. /usr/pgsql-9.4/bin/pg_receivexlog -h primary-db-server -p 5432 -U replication –S 'standby1' -n -v -D /var/lib/pgsql/9.4/archive
  4. You can kill the pg_recievexlog process
  5. Take backup using pg_basebackup
  6. copy the backup and startup  standby database
  7. The standby database should be able to connect to slot now!
  8. Don’t forget to drop the slot otherwise Primary DB server will keep the WALs indefinitely

I would prefer Solution #2 over Solution #1 because you don’t have to figure out optimal value for wal_keep_segments. Plus, you can use pg_receivexlog method  for permanent archive solution. If you want to get more details, I will be presenting tutorial on replication topic “Out of the Box replication in Postgres 9.4” at Postgres Conference in San Francisco on Nov 17-18th. I will be provide hands on tutorial to setup robust replication for production db servers without using any third party tools in Postgres 9.4

If you haven’t registered, this is the time to register because the conference schedule looks amazing!  You could still register with the discount code SeeMeSpeak which provides a 20% discount.

Let me know, if you have better solutions than I discussed in the post. I’m looking forward to see you at the conference.