Capistrano Multistage file encoding gotcha

January 31st, 2013

We use Capistrano exclusively for deployment to live and pre-live environments, so imagine my surprise when, copying a configuration into files on a new app, that I started seeing this:

Flexo:app dansketcher$ cap production misc:show_path
    triggering load callbacks
  * 2013-01-31 13:17:53 executing `production'
    triggering start callbacks for `misc:show_path'
  * 2013-01-31 13:17:53 executing `multistage:ensure'
  * 2013-01-31 13:17:53 executing `misc:show_path'
  * executing "echo $PATH"
`misc:show_path' is only run for servers matching {}, but no servers matched

The ‘production’ stage was configured, but even adding things like raise '' in the config/deploy/production.rb file failed to have an effect. It was like the file wasn’t even being loaded, except that trying to execute against a non-existent stage (‘work_damn_you’ stage, anyone?) did raise an error.

In the end, out of desperation, I ran dos2unix on the files, which weirdly complained about a binary file. Hrm. I re-saved the files as UTF-8 and all started working.

So be warned.

Monitoring PostgreSQL Streaming Replication

January 27th, 2013

Having migrated to PostgreSQL 9.2 and set up Streaming Replication successfully, the next obvious thing you want to know is, “Is it still working?” Fortunately, PostgreSQL comes out-of-the-box with a few tools that can help

pg_stat_replication

pg_stat_replication is really only useful on a master database server. It shows the slaves connected to the master, and some other opaque information about ‘where it’s up to’, for example

template1=# select * from pg_stat_replication;
  pid  | usesysid |  usename   | application_name | client_addr | client_hostname | client_port |         backend_start        |   state   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
-------+----------+------------+------------------+-------------+-----------------+-------------+------------------------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
 30213 |   xxxxxx | replicator | walreceiver      | 10.10.0.10  |                 |       52889 | 2013-01-24 17:06:06.11849+00 | streaming | 9/42008AD0    | 9/42008AD0     | 9/42008AD0     | 9/42007FB0      |             0 | async
(1 row)

To a noob like me the _location fields are a bit opaque (apart from being hex and probably related to WAL segments), but Alexander Fortin’s post to the postgresql mailing list helps make this a bit clearer.

I’ve actually converted that statement to not require his custom hex_to_int function, like so:

SELECT
    client_addr,
    sent_offset - (
        replay_offset - (sent_xlog - replay_xlog) * 255 * 16 ^ 6 ) AS byte_lag
FROM (
    SELECT
        client_addr,
        ('x' || lpad(split_part(sent_location,   '/', 1), 8, '0'))::bit(32)::bigint AS sent_xlog,
        ('x' || lpad(split_part(replay_location, '/', 1), 8, '0'))::bit(32)::bigint AS replay_xlog,
        ('x' || lpad(split_part(sent_location,   '/', 2), 8, '0'))::bit(32)::bigint AS sent_offset,
        ('x' || lpad(split_part(replay_location, '/', 2), 8, '0'))::bit(32)::bigint AS replay_offset
    FROM pg_stat_replication
) AS s;

which gives an output like this:

 client_addr | byte_lag 
-------------+----------
 10.10.0.10  |        0

meaning that slave 10.10.0.10 is 0 bytes behind the master. Most of the time in my experience it stays pretty low, but how big is too big? As long as it stays below the size of 1-2 WAL segment (use “SHOW wal_segment_size” to work out what your config is) then things are fine. If it drops back further, you might want to find out what’s going on.

SELECT pg_last_xact_replay_timestamp()

There’s a few functions that only work on the slave too, most notably pg_last_xact_replay_timestamp(). It shows the timestamp of the last transaction applied to the slave. With a reasonably busy server, using something like:

SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::INTEGER AS lag_seconds;

is good enough to see how far behind the slave is. However, if there’s not enough activity, although the server is completely caught up, this number will continue to increase as now() gets further and further away from the last committed transaction timestamp. Instead, per this post of the mailing list, compare the location information too:

SELECT
CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() 
THEN 0
ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;

Finally, for anyone using ServerDensity, I updated reinbach’s plugin to include a simple count check of connected slaves on a master, and the slave lag function above.
See it here: https://github.com/dansketcher/sd-postgresql

PostgreSQL on Windows Server 2008 problems

January 25th, 2013

Having been running PostgreSQL 8.4 for a number of years in production (if it ain’t broke…) the time came to do an upgrade (8.4 on Centos 5 was getting a bit dated). We decided, after some discussion, to take up our hosting provider on a managed HA Windows cluster with shared storage and a redundant slave replica. This was mainly to reduce our own internal server management overhead, as they would be taking care of all parts of the servers’ maintenance.

This all sounded great. After some tests at moderate load, we migrated the databases one by one via a piped pg_dump/pg_restore, which was simplicity itself apart from the outage. But then, we started getting some reports of slowness, which was worrying. I updated the application to log slow queries (easier to do this at the application level so that they could go into Splunk) and I was seeing SOME instances (not all) of query execution previously taking ~3sec take anywhere from 9 seconds to 200(!) seconds. As you can imagine, this was very disconcerting.

Of course, beforehand the server was configured using Windows ‘best practice’ settings, which is mostly the same as Linux but to have a smaller shared_buffers and effective_cache_size settings per the PostgreSQL Wiki page.

The primary approach we started to take was to use EXPLAIN ANALYSE extensively to work out if it was just a case of this new server not being ‘run in’ effectively and not having enough stats. Some solid gains were made, but the problem was not ‘average’ query time, it was the ‘worst case’ query time. Some queries would run fine one day but terribly the next, and of course once the PostgreSQL cache was primed it was as speedy as ever. Many indexes were changed and updated, the DB was VACUUM ANALYSEd, all the usual stuff. In a display of desperation, we even ‘defragged’ the disk (which would have felt very retro if it was not so unpleasant)

At no point was RAM, CPU or Disk IO even stretched. The servers were effectively over-specced, as we ramped them up to see if that would help, and it did, but not to the extent needed in those worst-case queries.

Another very annoying thing was that apparently the PostgreSQL server was not able to maintain connections properly. Despite having more than enough in the max_connections setting, we consistently saw these kinds of errors reported by the client (namespace from the ruby-pg client):

PGError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

PGError: no connection to the server

PG::Error: could not receive data from server: Connection timed out

The last of these errors was particularly nasty, as usually the statement was executed successfully, but the client had no idea and no record, so while each DB was internally consistent, inter-db communication processes would get out of sync and needed to be manually reconciled.

So, how did we solve these problems?

We didn’t. We moved everything off the cluster to a new Linux PostgreSQL 9.2 master/slave streaming replication setup. Many many thanks to Bartek Ciszkowski for his excellent writeup, which laid a very solid foundation for our new setup. Since the new DB servers went into production, all of the work we’ve done with indexes has meant the sites are performing faster than ever, and we’ve not seen one of the errors above either (touch wood!).

Incidentally, we use (async) streaming replication with another PITR recovery setup, which uses old-style backup commands rather than pg_basebackup, so that we can rsync update the base regularly without having to store a new copy every time. Also note, because of the implementation of replication, you can’t run these backup commands on the slave, they have to be run on the master, which is a shame. Still, we rsync to the slave for the PITR backup so at least it’s not doing double disk IO.

Useful links:
Relink A PostgreSQL 9.2 Streaming Replication Primer
New Replication and Recovery Features in PostgreSQL 9.1
And of course, the PostgreSQL docs

Unfuddle Git deploy gotcha

September 27th, 2011

Having provisioned a new server, I was updating my Capistrano deploy.rb to handle the new box, and on testing a new deploy, I was getting this:

** [server :: out] == Unfgit: =============================================================
** [server :: out] Permission denied public key error. account: xxxxx; repository: xxxxx; person: none; message: no person found on this account using this public key
** [server :: out] ========================================================================

Annoying, seeing as it worked just find on the old box.

I went onto the machine and confirmed using the ssh key there I could check out the repo properly. Unfuddle support tried to help but could not. I fortunately came across this github resource which explains some of the settings. I think on the third or so re-reading I noticed this real impact of this

Agent Forwarding

If you’re using your own private keys for git you might want to tell Capistrano to use agent forwarding with this command. Agent forwarding can make key management much simpler as it uses your local keys instead of keys installed on the server.

ssh_options[:forward_agent] = true

So, I tried removing that line from my cap deploy.rb, and the checkout worked fine.

Hope this helps someone else!

Configuring Rails and Splunk (Part 3)

September 1st, 2010

Success! I have managed to get Splunk to group Rails logs into logical groups.

First, I changed my Rails web server to be a light forwarder, as I wanted the indexer to do the grouping (as per the behaviour breakdown from the previous part of this series). Then:

$SPLUNK_HOME/etc/system/local/inputs.conf

[default]
host = web.server.example.com

[monitor:///var/www/sites/web.application.example.com/shared/log/production.log*]
disabled = false
followTail = 0
sourcetype = rails_log

[monitor:///var/log/httpd]
disabled = false
followTail = 0
sourcetype = access_combined

I’m not sure if I need to have the [rails_log] declaration on the LightForwarder’s etc/system/local/props.conf file, but it’s there as an empty declaration just in case.

Before we go further, note that my Rails logging uses log4r and a custom formatter, so my logs look like this:

[ApplicationName][INFO] 100819 06:37:07 :: 

Processing SessionsController#new (for 172.0.0.1 at 2010-08-19 06:37:07) [GET]
[ApplicationName][INFO] 100819 06:37:07 ::   Parameters: {"unauthenticated"=>"true", "action"=>"new", "controller"=>"sessions"}
[ApplicationName][INFO] 100819 06:37:07 :: Rendering template within layouts/agent
[ApplicationName][INFO] 100819 06:37:07 :: Rendering sessions/new
[ApplicationName][INFO] 100819 06:37:07 :: Completed in 20ms (View: 12, DB: 5) | 200 OK [https://web.application.example.com/user/sign_in?unauthenticated=true]
[ApplicationName][INFO] 100820 01:01:25 :: 

Processing Index#index (for 172.0.0.1 at 2010-08-20 01:01:25) [GET]
[ApplicationName][INFO] 100820 01:01:25 ::   Parameters: {"action"=>"index", "controller"=>"indes"}
[ApplicationName][INFO] 100820 01:01:25 :: Filter chain halted as [:authenticate] rendered_or_redirected.
[ApplicationName][INFO] 100820 01:01:25 :: Completed in 3ms (View: 1, DB: 4) | 401 Unauthorized [https://web.application.example.com/]

The important thing to note here is that Rails inserts 2 linebreaks before putting “Processing blah blah” so we’ll use that to split. In the props.conf documentation, it recommends using the LINE_BREAKER item to handle multiline breaks, and it seemed easier to me as it uses just one regexp to break the input stream. The important part for this kind of splitting is to know:

* Wherever the regex matches, the start of the first matching group is considered the end of the 
  previous event, and the end of the first matching group is considered the start of the next event.
* The contents of the first matching group is ignored as event text.

What this means is that the LINE_BREAKER regexp (using PCRE) must have a match group that can be discarded for matching. If you grep over the default provided files (`grep -Rn “LINE_BREAKER” /opt/splunk/`), you’ll see that most of the options use a linebreak or multiples thereof as the first match group. We’ll do the same. Other than that, we just want to match on the “ Processing” text as the start delimiter… which comes out to the below…

On the indexer’s $SPLUNK_HOME/etc/system/local/props.conf

[rails_log]
LINE_BREAKER = ([\r\n]).* \:\: [\r\n]+Processing

Now, on the index, the rails_log items are multiline events!

Configuring Rails and Splunk (Part 2)

August 30th, 2010

As my wrestle with configuring Splunk continues, I discovered the following helpful article explaining what happens where when forwarding data to a Splunk server.

Basically, it depends. Which is not surprising, given how complicated Splunk is, but it’s good to have on hand!

Combined with details on configuring linebreaking for multi-line events, perhaps I can get my rails logs handled as one entity in the near future…

Configuring Rails and Splunk (Part 1)

August 19th, 2010

To increase our application intelligence, I’ve started installing Splunk. However, what was not clear to me was the best way of getting my Web/App Rails logs into Splunk. The Web server is on a different box (as is the DB server) so I wanted to know the best way of getting the log data to the Splunk host.

Posts like this imply that Syslog-ng is a good way of getting data to Splunk, but the hidden detail is that if the Splunk server is down, the data is lost. A better way (thanks to the helpful folk on IRC) is to set up a Splunk forwarder on the Web/App/DB hosts and go from there. I went initially with setting them up as (free licenced) Standard Forwarders, so that the web interface is still running for configuration. Remember that any firewall (iptables) will need to be opened for the management ports (8000, 8089 ) and the forward port (9997).

Something else I discovered (but can no longer find in the docs) is that if you configure data sources using the web interface on a Forwarder, but subsequently change the instance to a LightForwarder, you also have to move the configuration files from $SPLUNK_HOME/etc/apps/search/local/ to $SPLUNK_HOME/etc/system/local/

Getting Phusion Passenger to run under SELinux on Centos 5.4

February 25th, 2010

Edit 2010-03-17: Harder than it first seemed! That permission only allows Apache to load Passenger. See here for the full requirements

I’ve just started playing with SELinux on Centos, and while the idea is great, it’s not exactly what I’m used to. Take for example adding Phusion Passenger to Apache. When I first restarted, I got this

[root@localhost modules]# /etc/init.d/httpd restart
Stopping httpd:                                            [FAILED]
Starting httpd: httpd: Syntax error on line 210 of /etc/httpd/conf/httpd.conf: Syntax error on line 1 of /etc/httpd/conf.d/passenger.conf: Cannot load /opt/ruby-enterprise-1.8.7-2009.10/lib/ruby/gems/1.8/gems/passenger-2.2.9/ext/apache2/mod_passenger.so into server: /opt/ruby-enterprise-1.8.7-2009.10/lib/ruby/gems/1.8/gems/passenger-2.2.9/ext/apache2/mod_passenger.so: failed to map segment from shared object: Permission denied
                                                           [FAILED]

Huh?

So it turns out after reading through some google results and then subsequently man httpd_selinux that the following will fix it:

chcon -R -h -t httpd_sys_script_exec_t /opt/ruby-enterprise-1.8.7-2009.10/lib/ruby/gems/1.8/gems/passenger-2.2.9/ext/apache2/mod_passenger.so

And so it does. “httpd_sys_script_exec_t” allows Apache to execute the SO, and we’re all good to restart.

autocomplete sh: <( compgen -d — ” ): No such file or directory

January 15th, 2010

Not much of a title, I know, but something for fellow stumped google-wanderers..

I’ve been playing around with deprec for automatically provisioning new servers for my Rails applications. However, I hit a strange problem where certain command-line auto-completions did not work; I’d get an error message like this:

$ ls
temp

$ cd [hit tab to get autocomplete]
-sh: <( compgen -d -- '' ): No such file or directory
-sh: <( compgen -f -X  -- '' ): No such file or directory

The confusing thing was that only the unpriviliged user exhibited this behaviour, not the root user.

I tried reinstalling the bash-completion package to no avail (even an aptitude purge then aptitude install did not work). Eventually I came across this rather long post on bugs.debian.org where someone else has the same problem. Essentially it comes down to the unpriviliged user using /bin/sh as the default shell; under these circumstances autocompletion is supposed to not function (although presumably without error). This is despite /bin/sh being a symlink to /bin/bash

The fix, then, is actually quite simple. Change the default shell for all your unprivileged users to bash:

# usermod -s /bin/bash dansketcher

and then when you make new users, make sure you select bash explicitly:

# useradd newuser -m -s /bin/bash

Free year-long SSL Certificate

January 10th, 2010

Just discovered StartSSL’s Free SSL Certificate – 1 year (chained) SSL certificate for free.

I’m not affiliated in any way – I just like free stuff.