Categories
big sur osx Rails

Homebrew on M1 is apparently all working – if you’re up to date

While checking how to install Ruby 3 on my new M1, I saw (also from brandur) that the full homebrew is now ready to roll!

However…..

This is really only true if you’re on the latest and greatest for all your projects. If, for example, you need a ruby <2.4 for something, then OpenSSL needs to be <1.1, and that’s not supported on arm, per https://github.com/rbenv/ruby-build/issues/1353#issuecomment-573414540. So I still set up the ibrew/abrew stuff, and then had to install old OpenSSL in the ibrew variant per above (or was it this? or this?). Btw I used the iTerm Rosetta.app variant to get things installed, and then tried to keep using it as I installed/built things, just to minimise variables to get to functional fast

ibrew install rbenv/tap/[email protected]

This only works on x86_64, so In order to get an old ruby installed, I needed double installs. In fact I needed to use x86_64 readline to get 2.4 & 2.5 installed.

# 2.3, but also works for 1.9.3-p551
RUBY_CONFIGURE_OPTS="--with-openssl-dir=$(/usr/local/bin/brew --prefix [email protected]) --with-readline-dir=/usr/local/opt/readline" arch -x86_64 rbenv install 2.3.3

# 2.4.10, 2.5.8
RUBY_CONFIGURE_OPTS="--with-readline-dir=/usr/local/opt/readline" arch -x86_64 rbenv install 2.4.10

Make sure you install rbenv yourself, not from brew, as you may need to pass some arch details

capybara-webkit still needs an old version of Qt, so you need to install the old 5.5 just like before. Make sure to follow all the steps! Including, adding it to your .profile, and make sure you run source ~/.profile before trying to compile again

pg will also probably freak out about not being able to get a clear line on which homebrew to use (especially as I installed the arm one as well), so you may need to specify

env ARCHFLAGS="-arch x86_64" gem install pg -v '0.21.0' -- --with-pg-config="$(ibrew --prefix postgresql-common)/bin/pg_config"

or

bundle config build.pg --with-pg-config="$(ibrew --prefix postgresql-common)/bin/pg_config"

libv8 and therubyracer continue to be pains in the arse:

brew install [email protected]
bundle config build.libv8 --with-system-v8
bundle config build.therubyracer --with-v8-dir=$(ibrew --prefix [email protected])
bundle install 
Categories
big sur osx postgresql

SHMMAX or SHMALL error on Big Sur

When upgrading to Big Sur with Homebrew, you may find errors on starting Postgres where there’s memory errors.

2021-03-30 17:26:07.203 AEST [47951] FATAL: could not create shared memory segment: Cannot allocate memory
2021-03-30 17:26:07.203 AEST [47951] DETAIL: Failed system call was shmget(key=5432001, size=56, 03600).
2021-03-30 17:26:07.203 AEST [47951] HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMALL parameter. You might need to reconfigure the kernel with larger SHMALL.
The PostgreSQL documentation contains more information about shared memory configuration.
2021-03-30 17:26:07.204 AEST [47951] LOG: database system is shut down
stopped waiting
pg_ctl: could not start server

The short term (until reboot) solution lies here:

https://stackoverflow.com/questions/2017004/setting-shmmax-etc-values-on-mac-os-x-10-6-for-postgresql/10629164#10629164

Basically:

sudo sysctl -w kern.sysv.shmmax=12582912
sudo sysctl -w kern.sysv.shmall=12582912

But for a more permanent switch, since Catalina, editing /etc/sysctl.conf doesn’t work.

After further digging, the solution to this problem is to place a plist file into /Library/LaunchDaemons/ which runs and sets the shared memory parameters on boot. On Catalina this has to be preceded by disabling SIP, and remounting the / file system as writable, before installing the plist file.

https://developer.apple.com/forums/thread/669625

so, go to /Library/LaunchDaemons/, create a file (I called mine memory.plist, and make sure the owner is correct!), add the content below, and then run
sudo launchctl load memory.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
 <key>Label</key>
 <string>shmemsetup</string>
 <key>UserName</key>
 <string>root</string>
 <key>GroupName</key>
 <string>wheel</string>
 <key>ProgramArguments</key>
 <array>
  <string>/usr/sbin/sysctl</string>
  <string>-w</string>
  <string>kern.sysv.shmmax=268435456</string>
  <string>kern.sysv.shmmni=128</string>
  <string>kern.sysv.shmseg=32</string>
  <string>kern.sysv.shmall=65536</string>
 </array>
 <key>KeepAlive</key>
 <false/>
 <key>RunAtLoad</key>
 <true/>
</dict>
</plist>

Once you’ve rebooted you can confirm by running

$ sysctl -A kern.sysv.shmall
kern.sysv.shmall: 65536

$ sysctl -A kern.sysv.shmmax
kern.sysv.shmmax: 268435456
Categories
Uncategorized

Capistrano Multistage file encoding gotcha

We use Capistrano exclusively for deployment to live and pre-live environments, so imagine my surprise when, copying a configuration into files on a new app, that I started seeing this:

Flexo:app dansketcher$ cap production misc:show_path
    triggering load callbacks
  * 2013-01-31 13:17:53 executing `production'
    triggering start callbacks for `misc:show_path'
  * 2013-01-31 13:17:53 executing `multistage:ensure'
  * 2013-01-31 13:17:53 executing `misc:show_path'
  * executing "echo $PATH"
`misc:show_path' is only run for servers matching {}, but no servers matched

The ‘production’ stage was configured, but even adding things like raise '' in the config/deploy/production.rb file failed to have an effect. It was like the file wasn’t even being loaded, except that trying to execute against a non-existent stage (‘work_damn_you’ stage, anyone?) did raise an error.

In the end, out of desperation, I ran dos2unix on the files, which weirdly complained about a binary file. Hrm. I re-saved the files as UTF-8 and all started working.

So be warned.

Categories
hosting monitoring postgresql replication

Monitoring PostgreSQL Streaming Replication

Having migrated to PostgreSQL 9.2 and set up Streaming Replication successfully, the next obvious thing you want to know is, “Is it still working?” Fortunately, PostgreSQL comes out-of-the-box with a few tools that can help

pg_stat_replication

pg_stat_replication is really only useful on a master database server. It shows the slaves connected to the master, and some other opaque information about ‘where it’s up to’, for example

template1=# select * from pg_stat_replication;
  pid  | usesysid |  usename   | application_name | client_addr | client_hostname | client_port |         backend_start        |   state   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
-------+----------+------------+------------------+-------------+-----------------+-------------+------------------------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
 30213 |   xxxxxx | replicator | walreceiver      | 10.10.0.10  |                 |       52889 | 2013-01-24 17:06:06.11849+00 | streaming | 9/42008AD0    | 9/42008AD0     | 9/42008AD0     | 9/42007FB0      |             0 | async
(1 row)

To a noob like me the _location fields are a bit opaque (apart from being hex and probably related to WAL segments), but Alexander Fortin’s post to the postgresql mailing list helps make this a bit clearer.

I’ve actually converted that statement to not require his custom hex_to_int function, like so:

SELECT
    client_addr,
    sent_offset - (
        replay_offset - (sent_xlog - replay_xlog) * 255 * 16 ^ 6 ) AS byte_lag
FROM (
    SELECT
        client_addr,
        ('x' || lpad(split_part(sent_location,   '/', 1), 8, '0'))::bit(32)::bigint AS sent_xlog,
        ('x' || lpad(split_part(replay_location, '/', 1), 8, '0'))::bit(32)::bigint AS replay_xlog,
        ('x' || lpad(split_part(sent_location,   '/', 2), 8, '0'))::bit(32)::bigint AS sent_offset,
        ('x' || lpad(split_part(replay_location, '/', 2), 8, '0'))::bit(32)::bigint AS replay_offset
    FROM pg_stat_replication
) AS s;

which gives an output like this:

 client_addr | byte_lag 
-------------+----------
 10.10.0.10  |        0

meaning that slave 10.10.0.10 is 0 bytes behind the master. Most of the time in my experience it stays pretty low, but how big is too big? As long as it stays below the size of 1-2 WAL segment (use “SHOW wal_segment_size” to work out what your config is) then things are fine. If it drops back further, you might want to find out what’s going on.

SELECT pg_last_xact_replay_timestamp()

There’s a few functions that only work on the slave too, most notably pg_last_xact_replay_timestamp(). It shows the timestamp of the last transaction applied to the slave. With a reasonably busy server, using something like:

SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::INTEGER AS lag_seconds;

is good enough to see how far behind the slave is. However, if there’s not enough activity, although the server is completely caught up, this number will continue to increase as now() gets further and further away from the last committed transaction timestamp. Instead, per this post of the mailing list, compare the location information too:

SELECT
CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() 
THEN 0
ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;

Finally, for anyone using ServerDensity, I updated reinbach’s plugin to include a simple count check of connected slaves on a master, and the slave lag function above.
See it here: https://github.com/dansketcher/sd-postgresql

Categories
hosting postgresql replication software development

PostgreSQL on Windows Server 2008 problems

Having been running PostgreSQL 8.4 for a number of years in production (if it ain’t broke…) the time came to do an upgrade (8.4 on Centos 5 was getting a bit dated). We decided, after some discussion, to take up our hosting provider on a managed HA Windows cluster with shared storage and a redundant slave replica. This was mainly to reduce our own internal server management overhead, as they would be taking care of all parts of the servers’ maintenance.

This all sounded great. After some tests at moderate load, we migrated the databases one by one via a piped pg_dump/pg_restore, which was simplicity itself apart from the outage. But then, we started getting some reports of slowness, which was worrying. I updated the application to log slow queries (easier to do this at the application level so that they could go into Splunk) and I was seeing SOME instances (not all) of query execution previously taking ~3sec take anywhere from 9 seconds to 200(!) seconds. As you can imagine, this was very disconcerting.

Of course, beforehand the server was configured using Windows ‘best practice’ settings, which is mostly the same as Linux but to have a smaller shared_buffers and effective_cache_size settings per the PostgreSQL Wiki page.

The primary approach we started to take was to use EXPLAIN ANALYSE extensively to work out if it was just a case of this new server not being ‘run in’ effectively and not having enough stats. Some solid gains were made, but the problem was not ‘average’ query time, it was the ‘worst case’ query time. Some queries would run fine one day but terribly the next, and of course once the PostgreSQL cache was primed it was as speedy as ever. Many indexes were changed and updated, the DB was VACUUM ANALYSEd, all the usual stuff. In a display of desperation, we even ‘defragged’ the disk (which would have felt very retro if it was not so unpleasant)

At no point was RAM, CPU or Disk IO even stretched. The servers were effectively over-specced, as we ramped them up to see if that would help, and it did, but not to the extent needed in those worst-case queries.

Another very annoying thing was that apparently the PostgreSQL server was not able to maintain connections properly. Despite having more than enough in the max_connections setting, we consistently saw these kinds of errors reported by the client (namespace from the ruby-pg client):

PGError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

PGError: no connection to the server

PG::Error: could not receive data from server: Connection timed out

The last of these errors was particularly nasty, as usually the statement was executed successfully, but the client had no idea and no record, so while each DB was internally consistent, inter-db communication processes would get out of sync and needed to be manually reconciled.

So, how did we solve these problems?

We didn’t. We moved everything off the cluster to a new Linux PostgreSQL 9.2 master/slave streaming replication setup. Many many thanks to Bartek Ciszkowski for his excellent writeup, which laid a very solid foundation for our new setup. Since the new DB servers went into production, all of the work we’ve done with indexes has meant the sites are performing faster than ever, and we’ve not seen one of the errors above either (touch wood!).

Incidentally, we use (async) streaming replication with another PITR recovery setup, which uses old-style backup commands rather than pg_basebackup, so that we can rsync update the base regularly without having to store a new copy every time. Also note, because of the implementation of replication, you can’t run these backup commands on the slave, they have to be run on the master, which is a shame. Still, we rsync to the slave for the PITR backup so at least it’s not doing double disk IO.

Useful links:
Relink A PostgreSQL 9.2 Streaming Replication Primer
New Replication and Recovery Features in PostgreSQL 9.1
And of course, the PostgreSQL docs

Categories
Uncategorized

Unfuddle Git deploy gotcha

Having provisioned a new server, I was updating my Capistrano deploy.rb to handle the new box, and on testing a new deploy, I was getting this:

** [server :: out] == Unfgit: =============================================================
** [server :: out] Permission denied public key error. account: xxxxx; repository: xxxxx; person: none; message: no person found on this account using this public key
** [server :: out] ========================================================================

Annoying, seeing as it worked just find on the old box.

I went onto the machine and confirmed using the ssh key there I could check out the repo properly. Unfuddle support tried to help but could not. I fortunately came across this github resource which explains some of the settings. I think on the third or so re-reading I noticed this real impact of this

Agent Forwarding

If you’re using your own private keys for git you might want to tell Capistrano to use agent forwarding with this command. Agent forwarding can make key management much simpler as it uses your local keys instead of keys installed on the server.

ssh_options[:forward_agent] = true

So, I tried removing that line from my cap deploy.rb, and the checkout worked fine.

Hope this helps someone else!

Categories
Uncategorized

Configuring Rails and Splunk (Part 3)

Success! I have managed to get Splunk to group Rails logs into logical groups.

First, I changed my Rails web server to be a light forwarder, as I wanted the indexer to do the grouping (as per the behaviour breakdown from the previous part of this series). Then:

$SPLUNK_HOME/etc/system/local/inputs.conf

[default]
host = web.server.example.com

[monitor:///var/www/sites/web.application.example.com/shared/log/production.log*]
disabled = false
followTail = 0
sourcetype = rails_log

[monitor:///var/log/httpd]
disabled = false
followTail = 0
sourcetype = access_combined

I’m not sure if I need to have the [rails_log] declaration on the LightForwarder’s etc/system/local/props.conf file, but it’s there as an empty declaration just in case.

Before we go further, note that my Rails logging uses log4r and a custom formatter, so my logs look like this:

[ApplicationName][INFO] 100819 06:37:07 :: 

Processing SessionsController#new (for 172.0.0.1 at 2010-08-19 06:37:07) [GET]
[ApplicationName][INFO] 100819 06:37:07 ::   Parameters: {"unauthenticated"=>"true", "action"=>"new", "controller"=>"sessions"}
[ApplicationName][INFO] 100819 06:37:07 :: Rendering template within layouts/agent
[ApplicationName][INFO] 100819 06:37:07 :: Rendering sessions/new
[ApplicationName][INFO] 100819 06:37:07 :: Completed in 20ms (View: 12, DB: 5) | 200 OK [https://web.application.example.com/user/sign_in?unauthenticated=true]
[ApplicationName][INFO] 100820 01:01:25 :: 

Processing Index#index (for 172.0.0.1 at 2010-08-20 01:01:25) [GET]
[ApplicationName][INFO] 100820 01:01:25 ::   Parameters: {"action"=>"index", "controller"=>"indes"}
[ApplicationName][INFO] 100820 01:01:25 :: Filter chain halted as [:authenticate] rendered_or_redirected.
[ApplicationName][INFO] 100820 01:01:25 :: Completed in 3ms (View: 1, DB: 4) | 401 Unauthorized [https://web.application.example.com/]

The important thing to note here is that Rails inserts 2 linebreaks before putting “Processing blah blah” so we’ll use that to split. In the props.conf documentation, it recommends using the LINE_BREAKER item to handle multiline breaks, and it seemed easier to me as it uses just one regexp to break the input stream. The important part for this kind of splitting is to know:

* Wherever the regex matches, the start of the first matching group is considered the end of the 
  previous event, and the end of the first matching group is considered the start of the next event.
* The contents of the first matching group is ignored as event text.

What this means is that the LINE_BREAKER regexp (using PCRE) must have a match group that can be discarded for matching. If you grep over the default provided files (`grep -Rn “LINE_BREAKER” /opt/splunk/`), you’ll see that most of the options use a linebreak or multiples thereof as the first match group. We’ll do the same. Other than that, we just want to match on the “Processing” text as the start delimiter… which comes out to the below…

On the indexer’s $SPLUNK_HOME/etc/system/local/props.conf

[rails_log]
LINE_BREAKER = ([\r\n]).* \:\: [\r\n]+Processing

Now, on the index, the rails_log items are multiline events!

Categories
Rails splunk

Configuring Rails and Splunk (Part 2)

As my wrestle with configuring Splunk continues, I discovered the following helpful article explaining what happens where when forwarding data to a Splunk server.

Basically, it depends. Which is not surprising, given how complicated Splunk is, but it’s good to have on hand!

Combined with details on configuring linebreaking for multi-line events, perhaps I can get my rails logs handled as one entity in the near future…

Categories
log4r Rails splunk

Configuring Rails and Splunk (Part 1)

To increase our application intelligence, I’ve started installing Splunk. However, what was not clear to me was the best way of getting my Web/App Rails logs into Splunk. The Web server is on a different box (as is the DB server) so I wanted to know the best way of getting the log data to the Splunk host.

Posts like this imply that Syslog-ng is a good way of getting data to Splunk, but the hidden detail is that if the Splunk server is down, the data is lost. A better way (thanks to the helpful folk on IRC) is to set up a Splunk forwarder on the Web/App/DB hosts and go from there. I went initially with setting them up as (free licenced) Standard Forwarders, so that the web interface is still running for configuration. Remember that any firewall (iptables) will need to be opened for the management ports (8000, 8089 ) and the forward port (9997).

Something else I discovered (but can no longer find in the docs) is that if you configure data sources using the web interface on a Forwarder, but subsequently change the instance to a LightForwarder, you also have to move the configuration files from $SPLUNK_HOME/etc/apps/search/local/ to $SPLUNK_HOME/etc/system/local/

Categories
Uncategorized

Getting Phusion Passenger to run under SELinux on Centos 5.4

Edit 2010-03-17: Harder than it first seemed! That permission only allows Apache to load Passenger. See here for the full requirements

I’ve just started playing with SELinux on Centos, and while the idea is great, it’s not exactly what I’m used to. Take for example adding Phusion Passenger to Apache. When I first restarted, I got this

[root@localhost modules]# /etc/init.d/httpd restart
Stopping httpd:                                            [FAILED]
Starting httpd: httpd: Syntax error on line 210 of /etc/httpd/conf/httpd.conf: Syntax error on line 1 of /etc/httpd/conf.d/passenger.conf: Cannot load /opt/ruby-enterprise-1.8.7-2009.10/lib/ruby/gems/1.8/gems/passenger-2.2.9/ext/apache2/mod_passenger.so into server: /opt/ruby-enterprise-1.8.7-2009.10/lib/ruby/gems/1.8/gems/passenger-2.2.9/ext/apache2/mod_passenger.so: failed to map segment from shared object: Permission denied
                                                           [FAILED]

Huh?

So it turns out after reading through some google results and then subsequently man httpd_selinux that the following will fix it:

chcon -R -h -t httpd_sys_script_exec_t /opt/ruby-enterprise-1.8.7-2009.10/lib/ruby/gems/1.8/gems/passenger-2.2.9/ext/apache2/mod_passenger.so

And so it does. “httpd_sys_script_exec_t” allows Apache to execute the SO, and we’re all good to restart.