software development

PostgreSQL on Windows Server 2008 problems

Posted by dansketcher on January 25, 2013
hosting, postgresql, replication, software development / No Comments

Having been running PostgreSQL 8.4 for a number of years in production (if it ain’t broke…) the time came to do an upgrade (8.4 on Centos 5 was getting a bit dated). We decided, after some discussion, to take up our hosting provider on a managed HA Windows cluster with shared storage and a redundant slave replica. This was mainly to reduce our own internal server management overhead, as they would be taking care of all parts of the servers’ maintenance.

This all sounded great. After some tests at moderate load, we migrated the databases one by one via a piped pg_dump/pg_restore, which was simplicity itself apart from the outage. But then, we started getting some reports of slowness, which was worrying. I updated the application to log slow queries (easier to do this at the application level so that they could go into Splunk) and I was seeing SOME instances (not all) of query execution previously taking ~3sec take anywhere from 9 seconds to 200(!) seconds. As you can imagine, this was very disconcerting.

Of course, beforehand the server was configured using Windows ‘best practice’ settings, which is mostly the same as Linux but to have a smaller shared_buffers and effective_cache_size settings per the PostgreSQL Wiki page.

The primary approach we started to take was to use EXPLAIN ANALYSE extensively to work out if it was just a case of this new server not being ‘run in’ effectively and not having enough stats. Some solid gains were made, but the problem was not ‘average’ query time, it was the ‘worst case’ query time. Some queries would run fine one day but terribly the next, and of course once the PostgreSQL cache was primed it was as speedy as ever. Many indexes were changed and updated, the DB was VACUUM ANALYSEd, all the usual stuff. In a display of desperation, we even ‘defragged’ the disk (which would have felt very retro if it was not so unpleasant)

At no point was RAM, CPU or Disk IO even stretched. The servers were effectively over-specced, as we ramped them up to see if that would help, and it did, but not to the extent needed in those worst-case queries.

Another very annoying thing was that apparently the PostgreSQL server was not able to maintain connections properly. Despite having more than enough in the max_connections setting, we consistently saw these kinds of errors reported by the client (namespace from the ruby-pg client):

PGError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

PGError: no connection to the server

PG::Error: could not receive data from server: Connection timed out

The last of these errors was particularly nasty, as usually the statement was executed successfully, but the client had no idea and no record, so while each DB was internally consistent, inter-db communication processes would get out of sync and needed to be manually reconciled.

So, how did we solve these problems?

We didn’t. We moved everything off the cluster to a new Linux PostgreSQL 9.2 master/slave streaming replication setup. Many many thanks to Bartek Ciszkowski for his excellent writeup, which laid a very solid foundation for our new setup. Since the new DB servers went into production, all of the work we’ve done with indexes has meant the sites are performing faster than ever, and we’ve not seen one of the errors above either (touch wood!).

Incidentally, we use (async) streaming replication with another PITR recovery setup, which uses old-style backup commands rather than pg_basebackup, so that we can rsync update the base regularly without having to store a new copy every time. Also note, because of the implementation of replication, you can’t run these backup commands on the slave, they have to be run on the master, which is a shame. Still, we rsync to the slave for the PITR backup so at least it’s not doing double disk IO.

Useful links:
Relink A PostgreSQL 9.2 Streaming Replication Primer
New Replication and Recovery Features in PostgreSQL 9.1
And of course, the PostgreSQL docs

Thinking-Sphinx :with => @geodist returns no results

Posted by dansketcher on January 01, 2010
gems, Rails, rant, software development / No Comments

Having just recently started using Thinking Sphinx, I’ve been toying with the geograhical search that it provides to simply do distance-based searches while querying other information. The Geo-searching part of the manual shows how to do searches limiting distance:

# Searching for places within 10km
Place.search "pancakes", :geo => [@lat, @lng],
  :with => {"@geodist" => 0..10_000}
# Searching for places sorted by closest first
Place.search "pancakes", :geo => [@lat, @lng],
  :order => "@geodist ASC, @relevance DESC"

Unfortunately, the first style of query where @geodist is limited in range – when used as-is – does not work I found in my testing that I could order by @geodist as per the second query but when I added the :with range, I had no results.

The solution was actually simple, but really annoying. You have to use Floats instead of Integers for the range numbers! So changing the first query to:

Place.search "pancakes", :geo => [@lat, @lng],
  :with => {"@geodist" => 0.0..10_000.0}

# or more neatly

Place.search "pancakes", :geo => [@lat, @lng],
  :with => {"@geodist" => Range.new(0.0, 10_000.0)}

makes it all work. Now to get someone to update those docs!
Update: Updated!

Workling support for Synchronous AMQP RabbitMQ Clients and Amazon SQS Queues

Posted by dansketcher on November 16, 2009
gems, Rails, software development / No Comments

As a part of contracting work I have been doing for TouchLocal, I have just opensourced some code I wrote to support new Workling clients. As you may know, Workling is a Rails-oriented system for performing asynchronous processing and optionally returning data from these background workers. However, because of the implementation of the original AMQP client, you could not use RabbitMQ queues from non-evented Mongrel or Phusion Passenger servers (only evented Mongrel or Thin).

Building on the work of celdee-bunny and famoseagle-carrot, I implemented a RabbitMQ workling client that could be used from within Phusion Passenger and Mongrel. The Synchronous AMQP Workling client allows RabbitMQ to be used from Workling without requiring complicated changes to deployment scenarios. Also, I implemented the Return Store functionality, so that RabbitMQ users can get data back from the workers, just like when using Starling.

Additionally, it was useful at the time to add support for an Amazon SQS Workling client, more as an exercise in testing its performance than anything else. As with the SyncAmqpClient, support for the Return Store is present. One of the discoveries in working with Amazon SQS (via the RightAws gem) was the discovery that the default key structure for Workling (which uses colon characters for segment delimiters) is not supported by Amazon. As a result, if you define the keys used for AWS configuration, even if you don’t use them, they will change the Workling key structure. This is not a problem for new implementations, but for existing deployments adding this will mean that Workling cannot see the old queues and you may not be able to access them without removing the AWS configuration… not a deal breaker, but something to be aware of.

So, the TouchLocal github account holds the version of Workling that has these two implementation for now, at least until my pull request to the main branch is accepted :)

touchlocal-openx gem released

Posted by dansketcher on November 04, 2009
gems, hosting, Rails, software development / No Comments

After the last post, I spent a bit of time integrating my work (along with other upgrades) into a fork of the openx code on github and have released a gem version of it on GemCuttter.org

As you may or may not know, GemCutter.org will be the new default gem source by becoming rubygems.org. This is pretty exciting, because previously you had to have you project registered on rubygems.org in order to publish to it, or use the non-standard github gem host. Good stuff.

In any case, the new OpenX gem can be installed now by executing

sudo gem install touchlocal-openx --source "http://gemcutter.org"

# Load it using
require 'rubygems'
gem 'touchlocal-openx' 
require 'openx'

In Rails, include it like this in your Rails::Initializer block:

  config.gem "touchlocal-openx", :lib => "openx", :source => "http://gemcutter.org"

I’m a proud parent!

Minimum OpenX XMLRPC Ruby Client

Posted by dansketcher on September 29, 2009
hosting, Rails, software development / No Comments

2009-11-04 Update: Gem version released

For TouchLocal I am currently reworking some of the internal advertising systems to use OpenX. That way we get better reporting and better reliability, but we can use things like OpenX Direct Selection to make the best use of our existing infrastructure.

While there are 2 Ruby projects (1 a more recent fork of the other), both are oriented around the API for administering OpenX rather than serving ads via the API. So, here is the minimum you need to do from Ruby to get a banner served:

(Based on http://www.openx.org/en/docs/tutorials/Advanced+XML-RPC)

require 'xmlrpc/client'
# The settings are the  HTTP Headers - the PHP client sets many, but this is the minimum requirement
settings = {:cookies => [], :remote_addr => 'localhost'}

# 'what' in our case is a Direct Selection
params = [
  what = 'Plumber',
  campaignId = 0,
  target = '',
  source = '',
  withText = false,
  xmlContext = [{'!=' => 'campaignid:2'}]
]

b = nil
begin
  server = XMLRPC::Client.new2("http://www.example.com/www/delivery/axmlrpc.php")
  b = server.call('openads.view', settings, *params)
  b['html']
rescue Exception => e
  puts e.message
end

In particular, note the content of the xmlContext parameter – that took me quite a few hours of digging through the PHP to work out. It’s an array of hashes, that use the keys “==” and “!=” to include or exclude banners based on the values. The values must be of the form “type:id”, with type being one of “campaignid”, “clientid”, “companionid”, and “bannerdid” (although technically, anything other than the first 3 is treated as a banner ID as of version 2.8.1 of OpenX)

I hope that saves someone the pain that I just went through to discover it!

Boolean.parse

Posted by dansketcher on April 06, 2009
Rails, software development / No Comments

So often in Ruby on Rails I find myself needing to convert a variable to a Boolean, and there is no built in way of doing it. So, how about this:

class Boolean
  def self.parse(obj)
    %w(true t 1 y).include?(obj.to_s.downcase.strip)
  end
end

We use this at work and it’s a nice neat way of ensuring non-repetitive code reuse :)

Rails ActionWebService SOAP error “No valid method call – missing method name”

Posted by dansketcher on June 06, 2007
Rails, software development / 3 Comments

For a Salesforce.com integration project, I need to create a SOAP server to accept messages from Salesforce. Seeing as there is no way directly import a wsdl into ActionWebService, I instead used wsdl2ruby from the soap4r distribution to generate server stubs. Then, I used ActionWebService to emulate the same service. I had problems with using the default layout, so I changed my structure to use a delegated structure:

class NotificationServiceController < ApplicationController
  web_service_dispatching_mode :delegated
  web_service_scaffold :invoke
  web_service :notifications, NotificationService.new
end
class NotificationServiceApi < ActionWebService::API::Base
  inflect_names false
  require_soap_action_header false
  api_method :notifications,
             :expects => [Notifications],
             :returns => [:bool]
end
class NotificationService < ActionWebService::Base
  web_service_api NotificationServiceApi
  def logger
    RAILS_DEFAULT_LOGGER
  end
  def notifications(organization_id, action_id, session_id, enterprise_url, partner_url, notification)
    my_object_id = notification.sObject.id
    ack = false
    begin
      ack = so_something(my_object_id)
    rescue Exception => e
      logger.error("Error processing payment: #{e.message}")
    end
    ack
  end
end

But STILL SOMETHING WAS WRONG. I was getting "No valid method call - missing method name" with the Salesforce outbound message queue reporting "org.xml.sax.SAXParseException: Content is not allowed in prolog." MMmmmmm helpful. The stack trace was showing that rails was trying to process the request as XMLRPC not SOAP, which was all wrong.

The stack trace looked something like this:

RuntimeError (No valid method call - missing method name!):
    /usr/lib/ruby/1.8/xmlrpc/parser.rb:476:in `parseMethodCall'
    /usr/lib/ruby/1.8/xmlrpc/marshal.rb:63:in `load_call'
    /usr/lib/ruby/1.8/xmlrpc/marshal.rb:32:in `load_call'
    /vendor/rails/actionwebservice/lib/action_web_service/protocol/xmlrpc_protocol.rb:36:in `decode_request'
    /vendor/rails/actionwebservice/lib/action_web_service/protocol/xmlrpc_protocol.rb:32:in `decode_action_pack_request'
    /vendor/rails/actionwebservice/lib/action_web_service/protocol/discovery.rb:20:in `discover_web_service_request'
    /vendor/rails/actionwebservice/lib/action_web_service/protocol/discovery.rb:18:in `discover_web_service_request'
    /vendor/rails/actionwebservice/lib/action_web_service/dispatcher/action_controller_dispatcher.rb:49:in `dispatch_web_service_request'
    (eval):1:in `notifications'
..........

Then, I came across patch 7077 (indirectly via this and then this), so, using my new best friend Piston I checked out Rails 1.2.3 into vendor/rails, and patched with the diff in the change. It was not entirely smooth - the xmlrpc.rb file could not be patched, but I merged the change manually.

All done! Works!

One thing I did discover the hard way though is that you need to have the whole stack of definitions with the same naming scheme for this to work. That is, if you've got a NotificationServiceController, you need to ensure that you have a NotificationService and a NotificationServiceApi defined and in use - no other class names will work. No Reuse of API Definitions, which is a bit of a bugger.

DhtmlCalendar and Piston

Posted by dansketcher on June 06, 2007
Rails, software development / No Comments

Over the weekend I decided to try (again) to find a Rails plugin for a Dhtml Calendar. The previous one I tried relied on an Engine, and for some totally unreasonable errrrrrrrr reason, I don’t like that. So I came across this: http://dry.4thebusiness.com/info/dhtml_calendar. At first it was all sunshile, lollipops and rainbows, but I soon realised that the plugin did not like Firefox.

This was, of course, a right pain. A quick test in IE determined it worked fine. In Firefox, it popup worked, but when you select a date the select boxes did not get updated. I waded through the javascript that comes with the plugin, and it seemed ok, but it simply WAS NOT firing in Firefox.

After some Javascript mangulation (that’s my word but you can use it if you want) I figured out that it wasn’t picking up the HTML form. So, even though the documentation says:

“Note: :form_name is optional unless your form is named. If it is named then supply the name of the form.”

I included it anyway.

And it worked. Yay.

Also, in the past I have used svn:externals to include external plugin into my project, and at the most inopportune time the external site as either (and I have had both of these happen):

  1. The external site is inaccessible at deployment time, meaning that your site is offline, or
  2. The external site updates the codebase to suit a new Rails version that you have not upgraded to, meaning that your site is broken and once again, offline

After doing that once or twice, I gave up on svn:externals and just exported the remote source and checked it into my repository, which is a bit shit because it removes the link to the origin of the code. This time I used Piston.

Piston fixes this. It checks out the remote code as an svn export but it stores the synchronization information as svn properties so that it can later be updated, or locked, or even merge the remove changes wth your own changes. Spiffy!

New Job, more code, lots of ‘life’ things

Posted by dansketcher on November 16, 2006
rant, software development / No Comments

So it turns out that sometimes when someone calls a job a Project Managment role, they actually mean Sales. The interview is no help, because it’s all agreement and “we want to go that way” stuff. But you walk in the door and it’s something else.

Not fun.

So anyway, I’ve got another job.

And I’m married now. It’s been a busy coupla months!

The good thing is that I am away from phone calls long enough now to actually do some coding, and it”s a nice change. (truth be told, in the final days at the old place I was debugging Mambo just for something different.

On the plus side, I met some nice people at the old place, namely the PPQ girls and boys. If you’re not in Queensland, you’ve got no chance of knowing who they are, and if you are and you want a personalised plate, they’re who you have NO CHOICE but to use ;) But they’re tops.

Anyway, wine in hand, Ruby on Rails open in Editplus, and I’m home.

Aaaaaahhhhhhhhhh.

gem, rake, webricks hangs – resolved! (damn you windows)

Posted by dansketcher on September 03, 2006
Rails, rant, software development / 2 Comments

You know you’re in for a fun day when you go to the code that you were working on yesterday and it stops working today.

In this case, I tried to start webrick to check my site, and the damn thing would not start. It would get as far as “=> Booting WEBrick…” and then hang. Ok, fine. Google a bit, maybe it’s the firewall? stop the firewall, no good.

Then, I tried to upgrade rails. ‘gem install’ hangs, before it says “upgrading source index”. Hrm. Must be something wrong with ruby. Try “gem list -r” which does not work on this machine but does work on my other workstation. Ok, got to be a ruby problem – upgrade ruby. no good. upgrade again. no good.

Then I uninstalled, rebooted, and reinstalled. “gem install” works! hooray! I install all the dependencies I need, and try to start webrick. Same bloody issue.

At this point, it’s fairly clear that it’s not a ruby or rails issue, it’s got to be a windows issue. I google for (out of desperation) “webrick hang” and I see, 4 down, a blog entry called WEBrick Server Hangs in Windows XP – have a look and it recommends running the following command to flush the Winsock Catalogue:

netsh winsock reset

And everything is fixed. Rake runs, webrick runs, gems still work. Bye bye 3 hours of my life. Thanks Windows. You’re a champ.