Tag Archives: postgres

Recovering Postgres database from disk level corruption!!

A couple of weeks ago, I had to deal with corrupted Postgres database cluster. At the end, we couldn’t able to recover some of the data but managed to recover most part of it. Having experience working with dozens of database systems, I’m pleasantly surprised¬†to experience resiliency of Postgres database.

Kudos to Postgres Development team for building the most resilience database in the world ūüôā

Here is my Postgres database recovery story ūüôā

Disclaimer:

I'm posting the steps carried out during the recovery process for information purpose only. This post doesn't provide any guarantee that it will work for your use-case and/or environment.

Note:¬†The actual database name has been replaced with “dbname” ¬†and actual table names with “tablename”.

One of the Postgres DB cluster database experienced disk level corruption thus we were hitting this error:

postgres=# \c dbname
FATAL: could not read block 0 in file "base/16389/11930": Input/output error

Uh oh?? Really bad, isn’t it? Fortunately, it wasn’t mission critical system so we managed to take extended ¬†outage and work on partial¬†recovery process because we didn’t want to loose all the data!!

Once we received the complaint, we immediately backed up corrupted  database and created recovery cluster to bring up on different  server so we can go through recovery efforts!!

Trial 1:

As many of you know, the first option is to bring up recovery database cluster with  zero_damaged_pages=on   . You can set the value in Postgres config file and try to  reindex system catalog:

reindexdb -p 5433 --system dbname
reindexdb: could not connect to database dbname: FATAL: index "pg_index_indexrelid_index" contains unexpected zero page at block 0 
HINT:  Please REINDEX it.

Doh! Still, we could still not be able to connect to database !!

Trial 2:

If you aren’t aware, you should note down that there is a way to ignore indexes at system level. We started up recovery cluster with ignore_system_indexes=true¬†setting:

pg_ctl -D /data -o '-c ignore_system_indexes=true'
restarted
dbname=# \c dbname

Yay! I could able to connect to DB now!

Trial 3:

Let’s try to reindex the database…

dbname=# reindex database "dbname";
NOTICE: table "pg_catalog.pg_class" was reindexed
2016-08-22 15:53:14.179 PDT rhost=[local] app=psql:user=postgres:db=dbname:ERROR: could not create unique index "pg_statistic_relid_att_inh_index"
2016-08-22 15:53:14.179 PDT rhost=[local] app=psql:user=postgres:db=dbname:DETAIL: Key (starelid, staattnum, stainherit)=(2608, 5, f) is duplicated.
2016-08-22 15:53:14.179 PDT rhost=[local] app=psql:user=postgres:db=dbname:STATEMENT: reindex database "dbname";
ERROR: could not create unique index "pg_statistic_relid_att_inh_index"
DETAIL: Key (starelid, staattnum, stainherit)=(2608, 5, f) is duplicated.

As the table is corrupted with duplicate entries, let’s find out and fix them.

dbname=# select starelid, staattnum, stainherit from pg_catalog.pg_statistic where starelid=2608 order by 2;
starelid | staattnum | stainherit
----------+-----------+------------
2608 | 1 | f
2608 | 2 | f
2608 | 3 | f
2608 | 4 | f
2608 | 5 | f
2608 | 5 | f
2608 | 6 | f
2608 | 7 | f
(8 rows)

Let’s remove one of the entry based on XMIN :

dbname=# delete from pg_catalog.pg_statistic where starelid=2608 and staattnum=5 and xmin=1228447;
DELETE 1
Trial 4:

Restart REINDEX but it failed again!!

2016-08-22 16:01:29.698 PDT rhost=[local] app=psql:user=postgres:db=dbname:ERROR: 1 constraint record(s) missing for rel tablename
2016-08-22 16:01:29.698 PDT rhost=[local] app=psql:user=postgres:db=dbname:STATEMENT: reindex database "dbname";
ERROR: 1 constraint record(s) missing for rel tablename
Trial 5:

Let’s try to vacuum analzye the table

dbname=# vacuum analyze tablename;
2016-08-22 16:04:01.282 PDT rhost=[local] app=psql:user=postgres:db=dbname: 1 constraint record(s) missing for rel tablename
2016-08-22 16:04:01.282 PDT rhost=[local] app=psql:user=postgres:db=dbname:STATEMENT: vacuum analyze tablename;
ERROR: 1 constraint record(s) missing for rel tablename

hrm…it’s still complaining about constraint

Trial 6:

let’s disable constraint check….

dbname=# update pg_class set relchecks=0 where relname='tablename';
UPDATE 1
The above update fixed the  the constraint error
Trial 7:

Let’s reindex the database again!

dbname =# reindex database "dbname";

Yay, Reindex is successful.

Once the reindex is successfully completed, we restarted recovery cluster without zero_damaged_page and ignore_system_indices settings.

Partial tables recovery through pg_dump  process:

As the database is corrupted, it makes sense to kick off the pg_dump on the database … we kicked off the pg_dump but it was still showing some of the sequences with errors!!

/usr/lib/postgresql/9.4/bin/pg_dump dbname -p 5433 -Fc >recovery_dbname.dmp
2016-08-22 16:22:09.517 PDT rhost=[local] app=pg_dump:user=postgres:db=dbname:ERROR: invalid page in block 0 of relation base/16389/2825248
2016-08-22 16:22:09.517 PDT rhost=[local] app=pg_dump:user=postgres:db=dbname:STATEMENT: SELECT sequence_name, start_value, increment_by, CASE WHEN increment_by > 0 AND max_value = 9223372036854775807 THEN NULL WHEN increment_by < 0 AND max_value = -1 THEN NULL ELSE max_value END AS max_value, CASE WHEN increment_by > 0 AND min_value = 1 THEN NULL WHEN increment_by < 0 AND min_value = -9223372036854775807 THEN NULL ELSE min_value END AS min_value, cache_value, is_cycled FROM XX_id_seq
pg_dump: [archiver (db)] query failed: ERROR: invalid page in block 0 of relation base/16389/2825248
pg_dump: [archiver (db)] query was: SELECT sequence_name, start_value, increment_by, CASE WHEN increment_by > 0 AND max_value = 9223372036854775807 THEN NULL WHEN increment_by < 0 AND max_value = -1 THEN NULL ELSE max_value END AS max_value, CASE WHEN increment_by > 0 AND min_value = 1 THEN NULL WHEN increment_by < 0 AND min_value = -9223372036854775807 THEN NULL ELSE min_value END AS min_value, cache_value, is_cycled FROM XX_id_seq

We had issue recovering a couple of tables but we managed to recover the most of the tables in the database  !

This was our Postgres database  recovery story.

Hopefully, it will help someone in case they fall into corruption situation. Any feedback or other ideas are welcome ūüôā

 

pgbadger log analysis for Postgres RDS instances

Last year, I have presented on¬†Postgres RDS¬†in Amazon at couple of conferences but I haven’t got change to follow¬†up on my experience working with RDS after them. If you aren’t¬† aware, I recently changed job & started working at WithMe, ¬†a disruptive¬†start-up that focuses on three primary solutions for commerce, including WithMe Business, Financial and Retail.

At WithMe, we¬†are using a¬†fair bit¬†of Postgres RDS instances.¬†After joining, I reviewed the setup and found out the¬†pgabadger log analysis is not setup on the database servers. Even though the¬†databases are still in development phase, it makes sense to have reports that will help to provide more insight to software developers on the slow queries. I had other projects (schema(s) design, database migrations,DW system) going on so i wasn’t able to spend time identify and tuning queries to assist¬†software engineers to expedite their develpement so¬†we¬†decided to prioritize log analysis project.

At previous job, I have setup pgbadger reports on number of servers but most of the time I had to deal with real db servers or at the least had ssh access on the server.  On the RDS, you can only have psql access, no shell level access on the instance.  Additionally, You have to deal with AWS tools and access methods.

I followed following steps to setup pgbadger reports for Postgres RDS instances:

  1. Brought up pgadmin ec2 instance to be used as utility server.
  2. Install AWS tool to interact with API
  3. Make sure /root/.aws/aws_credential_file file in place

    [denishpatel@pgadmin1¬†~]$ aws configure –profile oregon
    AWS Access Key ID [None]: XXX
    AWS Secret Access Key [None]: XXX
    Default region name [None]: us-west-2
    Default output format [None]: text

    [denishpatel@pgadmin1¬†~]$ aws rds describe-db-instances –profile oregon –output table

  4. Postgres logging parameters are set in parameter groups :

    log_min_duration_statement=5

    log_line_prefix=’%t:%r:%u@%d:[%p]:’

  5. I have come up with quick shell script to run reports. The  script is located @ https://github.com/denishpatel/pgbadger_on_rds
  6. You can easily cron them. As the new servers will be added, I will be adding new cron entries for the server:

    # pgbadger reports

    0 7 * * * ~/run_pgbadger_rds.sh -i devel-postgres -r us-west-2 -c 1 >/dev/null 2>&1
    15 7 * * * ~/run_pgbadger_rds.sh -i prod-postgres-rds -r us-east-1 -c 1 >/dev/null 2>&1

  7. I have setup nginx web server on pgadmin to server to serve¬†html reports so team doesn’t have to download html files to review them.

In case, you are tasked to setup¬†pgbadger report on Postgres RDS instance, this post¬†might¬†help¬†! I haven’t spent a lot of time writing a script so script provides very bare minimal to run reports¬†¬†but it can have a lot of¬†improvements.¬†Hope to see your contribution to improve the script ūüôā

As always, I’m looking forward to see some comments from¬†your experience working with Postgres RDS ūüôā