I love me some tmux.

However sometime between CentOS 5 and CentOS6 something broke and tmux no longer compiles.  Here’s my error message:

...
control.c: In function ‘control_callback’:
control.c:62: warning: implicit declaration of function ‘evbuffer_readln’
control.c:62: warning: nested extern declaration of ‘evbuffer_readln’
control.c:62: error: ‘EVBUFFER_EOL_LF’ undeclared (first use in this function)
control.c:62: error: (Each undeclared identifier is reported only once
control.c:62: error: for each function it appears in.)
control.c:71: warning: implicit declaration of function ‘time’
control.c:71: warning: nested extern declaration of ‘time’
make: *** [control.o] Error 1
[root@testcentos6 tmux-tmux-code]#

After some googling people refer others to utilize the latest version of libevent. Well, I wasn’t using the latest I was using whatever comes with the CentOS repos. So I decided to fix this by removing those:

yum remove libevent libevent-devel libevent-headers

Then compile my own libevent. I don’t do anything out of the ordinary on this server so I’m not even sure if anything I use depends on it:

wget https://github.com/downloads/libevent/libevent/libevent-2.0.21-stable.tar.gz
tar xf libevent-2.0.21-stable.tar.gz
cd libevent-2.0.21-stable

Their README was very basic, so rock it:

./configure
make
make install

I then returned to my tmux source:

cd /usr/local/src/tmux-1.7
make clean
make
make install

Then another issue:

[root@testcentos6 tmux-1.7]# tmux
tmux: error while loading shared libraries: libevent-2.0.so.5: cannot open shared object file: No such file or directory

Luckily this error has been seen before. After another quick google:

ln -s /usr/local/lib/libevent-2.0.so.5 /usr/lib64/libevent-2.0.so.5
tmux

AWESOME!

Resources:

So let me share with you a little story about upgrading application systems.

Where I work there is this server that runs this seemingly antiquated software, however, tis highly robust and is very actively maintained.  This system is designed to process insurance claims.  This system is extremely complex due to the large variety of variables for which claims are processed and amount of data needed to comb through in order to validate the smallest change.

Every once an while we get upgrades to this thing.  Just like any system testing is highly required.  We utilize a test system for which we copy the data, install the enhancement, and tell the team that performs the audit to have at it. One would assume it’s cut and dry.  Here’s why it took four tries for us to be successful:

Deployment One.  It was met with utter disaster.  The users testing the system found that it was not performing in any way like the test system was performing.  The enhancement provided was not operating as expected and it was blatantly obvious something was different between our test and production version of our app server.  After reverting back, we then continued to have issues later that week.  It was later discovered that the instructions provided by the enhancement were missing steps for which were rather crucial for ensuring the system would operate.  These steps may have been done undocumented  when the test system was being utilized  So the proper step was added, and the deployment to live was rescheduled

Deployment Two.  It was met with utter disaster.  The users testing the system found that it was performing, kind of.  As they were auditing the claims they found out that some claims were not being processed per the enhancement.  Due to the lack of knowledge for how many claims this would effect if left in the system for an extended period, it was deemed to roll back the live system.  Another thing to note, the test system was performing flawlessly.  In this case we went to the vendor with results from the audit and later to have the deployment reissued.  After another round of testing, deployment three.

Deployment Three.  It was met with utter disaster.  The users testing the system found that it was processing the claims that were being audited perfectly fine.  Awesome!  Past experience has taught us that we need to perform a full sweep of testing.  Those other tests were failing.  Added bonus  our test system was performing with the exact same results.  With the help of the vendor we were able to provide them extra details but unable to solve the problem before the end of the maintenance window.  We were forced to roll back, yet again.  It was decided that our method of copying data from our live and test system was flawed.  We would copy the data, and then test from that point forth with only specific data files.  This created a scenario for which the systems were not necessarily in-sync.  So each and every day, including the day of the deployment, we would copy the data over and put in the enhancement before testing.  This proved in the past that our testing was flawed when moving the deployment into place.  After more rounds of testing and another reissue of the deployment, number four.

Deployment Four.  It was met with success.  So much success in fact, that not only did our department perfect the documentation and the steps of the deployment, so did the people doing the auditing.  Things were able to be completed in much faster time, with greater attention to the proper details and in general things went very smooth.

Sometimes working in IT can be sucky.  In this case, 4 weekends were destroyed due to this ONE enhancement.  But in the future, with the improved documentation and testing strategies, we can all make better decisions and better plan for future implementations.  Collaboration is also very important.  Only teamwork and close communication with the vendor led us to be successful

Cheers for the next enhancement.

I created a small powershell script to roll through all security groups in active directory.

github.com/jtslear/get_ad_group_memberships

Believe it or not, I find this difficult to google. Plenty of results to find a list of what a member is of, but not many that would display list of groups with each user that is assigned to it!

It isn’t pretty. And by no means do I know powershell at all. But it was pretty quick. Note the read me! Enjoy.

The Problem:

zmconfigd status would indicate that it is not running when viewing the services status on the admin console as well as output from zmcontrol status

The fix:

Ensure that the which command is installed. After that is completed remove a pid file. Then start the service.

# yum install which
# sudo su - zimbra
$ rm -rf /opt/zimbra/log/zmconfigd.pid
$ zmconfigdctl start
$ zmcontrol status

Why?

My suspicion can be found in the source code for zmconfigdctl:

NC=`which nc 2>/dev/null`; NC=${NC:-`which netcat 2>/dev/null`}

Obviously if both which and nc are missing, zimbra will fail. Why?

status=`echo STATUS | $NC -w 10 -i 1 localhost ${zmmtaconfig_listen_port} 2>/dev/null`

This is used to see if it’s running. Zimbra will notify you during install that netcat is required but does not complain about the which command. Even my manager complains that I build servers light on installed packages. Add this to the list of “problems-john-created-due-to-missing-dependencies.”

This service’s purpose is to monitor for specific configuration changes and restart necessary services in order for those new configurations to take place automatically

Now I know.

Useful:

I’ve always had a dream of working in a datacenter in an environment with thousands of servers that work on some sort of High Performance Computing solution. I have zero experience, but here’s my day dream:

The Node:

The actual compute node consists of some sort of server that contains one or more GPU’s, the most and fastest CPU cores, a good chunk of RAM, no RAID card, and no hard drives. My imagination leads me towards a solution similar to utilizing a Dell C410x or the Cubix GPU Xpander with some sort of Host Interface Card to interconnect the systems providing massive GPU compute power. An individual server would be tied to that system and provided a certain amount of GPUs. This type of setup allows us to provide smaller computers without the need for large motherboards with many PCI interfaces. In this case only a single PCI card is required and a set of one or more GPU’s can be mapped to the node. This allows high CPU/GPU per U ratio. Check out Dell C410x and C6100 Solution for some details on a solution that Dell provides. Check out Cubix Xpander Rackmount for proposed solutions from Cubix.

The node would utilize PXE booting capabilities to download an OS that is provided via tftp of which would run entirely inside of RAM. The OS, dubbed ‘image’ for the rest of this document is discussed in more detail below.

The Management Server:

The management server is the system that runs the tftp service that all nodes boot to and download the image. This server is setup in a fashion to have the best network connectivity to send the linux image down to each node quickly. To ensure that he can do this very well, I would suggest having a RAM disk with which the image for the nodes can be stored. A reverse proxy running a couple of back end tftp servers may be very beneficial. If the power goes out, thousands of servers asking for an image is bound to slow down this configuration. This is a risk, but at the cost of avoiding installing hard drives and managing that extra hardware on every node.

The Distribution Server:

As a prefix, I have no experience in this field. The distribution server would be the head hancho that heads all requests for the compute services. Maintaining inventory for all workloads and distributing them to all the nodes in the facility. This server would be the proxy between the data that is submitted for processing and all of the nodes that perform the compute.

DHCP Server:

Keeping with the times, let us run a full IPv6 Environment. It’s been written about, lets rock with it. Special option for pxe booting to send nodes to talk to the management server for tftp services will be required.

The Node OS (Image):

The nodes will be running a very lightweight version of Linux that has been stripped down and heavily modified to rid of as many packages as possible. The kernel would be seriously stripped down to whatever is required to run the system as well as any services that would need to run. Packages left in the image would only be those that are required for services to run. The node is not meant to be a device to do any troubleshooting. If there is a problem with it, reboot it. If there is a suspect hardware issue that may be causing problems, take out the node and utilize a different image of Linux that would be utilized to help a technician discover any potential issues. Running thousands of computers allows this to occur without any major impact to business

Node Compute Cycle:

Over time, updates can be implemented in the image that is built for the nodes. New drivers, patches to packages, updates to scripts and services will warrant the need to download a new image or restart services. In order to remove human interaction and provide the best possibility that nodes will get updates, I envision a service that runs on the node that recognizes what the node is doing at any given time.

This service would also check the management server for an updated image. Should a new image be available it will need some way to notify the compute service or the distribution server to prevent this node from receiving another compute data set. This purpose would be to enable the node to complete its current compute data set, then reboot upon completing. A full cycle will appear as follows:

  • Node hardware boots, PXE boot, download image, OS boots
  • Services begin, node monitor agent checks in, compute service starts
  • Node begins compute
  • Management service checks into the management server
  • Should an updated image be available, a flag is set that would instruct the node to reboot at the completion of the latest compute
  • Node reboots

Network:

Network connectivity would involve switches that support binding multiple Ethernet links together for nodes that may have more than one Ethernet port. Jumbo Packets shall be enabled on all network devices. I imagine redundancy built into the switch infrastructure. Two top of rack switches that are configured with stacking. Fiber connectivity between each switch, again with LACP enabled links to a set of core switches.

Nodes that have more than one NIC should be linked up utilizing LACP or something similar to allow the maximum bandwidth possible between the node and the network.

Future Discussion:

  • Power savings; turn off nodes when not in use