26.12.07

360 Degree Awareness


IT Operations is a discipline inside information technology. Picture the normal 2-dimensional grid of issues surrounding a product and then expand to a 3-dimensional view when you factor in how each product has its own grid of issues and one problem in one area can impact another product in a completely unforeseen way.


Before I came to The Company I Work For, I had a boss who hammered into the team I was on the importance of knowing what boxes held what applications, where they were connected to and what depended on them. In a 400-server data center with shaky power delivery, this wasn’t an academic exercise. After the second or third data center loss through local power problems, we developed a process that had that kind of 360-view of all systems so that we knew what had to be turned on first, second, third and so on.

At The Company I Work For – our application development environment is physically small but houses several dozen applications pertaining to our job. Our PHP/Apache environment shares space on the same server as our code repository and our SQL schema repository. Upgrading the Bugzilla app can have people coming over to us going “Where’s the Wiki?”

It pays to be aware.

Our mandate is to have an environment that requires no more than 25% of a single headcount – we’re always encouraged to find any way to make our job more efficient, happen faster and require less admin overhead. If we can keep the cost down, so much the better. I haven't found a way to put all systems within a Matrix to show their relation to other systems - a configuration management tool, perhaps? If I find one, I'll publish it here.

Site Monitoring and Operations – Part 3

Burn-in, Close-Out and Lessons Learned

The virtual appliance has been running for two weeks continuously. We’re cautiously optimistic - so far, so good. A couple of points to mention, things to watch out for:

VMWare/Virtual Appliances do not run as a Windows service – so when I kill my Windows console session, the virtual appliance dies, too. I’m leaving the console logged in and have Windows Terminal Services configured to leave the connection alive after I close out. In our organization, this isn’t a problem but in others where you have your terminal services configured to kill idle connections after a period of time, this could be a problem.

Nagios Tuning – I’ll be going back in to tighten our warning system in response to Nagios whining when our ISP gets saturated. If the VPN link between our app dev and data center locations becomes saturated because we’re downloading a large file, I can count on Nagios to send at least one ping complaining about ping response or a single dropped packet. We’re not that concerned about a saturated line yet – I’d prefer only to know when there’s a real problem.

Nagios doesn’t update changes you make to its configuration immediately – you have to commit your changes made through Monarch first, then kill all Nagios processes using ps –ef | grep nagios and then kill (process numbers). Recommitting your changes in Monarch (which doesn’t die when you kill Nagios) immediately starts the Nagios processes again and then your changes are visible. Monarch contains a handy backup tool to back up whatever config changes you make – never, ever forget to back up.

Although I spent a longer amount of time than I would have expected getting Exim4 and Sendmail configured, this did give me an opportunity to become more experienced with Linux and that is always a good thing.

23.12.07

Some Thoughts About Office Politics

When I first got started, I thought office politics was something that people in my line of work didn't worry about.

I was wrong.

How human beings interact, co-exist, is everyone's problem. To suggest otherwise is dangerous to your professional development. With that in mind - I started learning more about office politics - not to use them to some malicious advantage but simply to learn how they could help me or harm me. I found the following information worth considering and am passing it along to you.

Read "How to Improve Your Skills at Office Politics"

21.12.07

Site Monitoring and Operations – Part 2

Building Nagios

We earmarked a single server to act as our host for all virtual servers. Named “Wintermute” (some cyberpunk fans will get a laugh out of this), it was a Windows 2003 RC2 server that had previously been our demo/Roadshow server for the NYC show in May.

Downloading and installing VMWare is pretty simple – just follow the default prompts. You can download the Nagios virtual appliance we used here - http://www.vmware.com/vmtn/appliances/directory/372. This particular virtual appliance is helpful because it includes two additional tools to help configure Nagios. Monarch and Monarch-EZ are open-source applications that provide a web-based configuration tool and this makes the job of configuring Nagios easier. Monarch as a web-based tool is fairly intuitive and we added our production servers in about 15-20 minutes.

But what should we check? Our IIS servers are checked for HTTP and Ping availability; This just means that Nagios verifies that the box can be pinged and is responsive on Port 80. Our mail servers are checked for Ping, HTTP, SMTP and POP services since our users access their mail from the web, different mail applications and mobile devices.

What turned into a very-large problem was getting Nagios to email us when it noted a problem on a server. The virtual appliance uses Sendmail and Exim4 to act as a mail transfer agent – I’m not a Unix guy by nature – how was I going to get this to work? After a few hours looking at various pages, I found this page which turned out to be one I wished I had found in the beginning - http://pkg-exim4.alioth.debian.org/README/README.Debian.html. We still needed to modify some configuration and alias files using vi, but the majority of configuration was handled via the dpkg-reconfigure exim4-config command. By performing this reconfiguration – we were able to immediately get Exim4 to send mail to our mail server and configured a distribution list to forward all Nagios traffic to several people. Although it wasn’t an issue, we also checked to allow SMTP traffic from the virtual appliance to our Exchange server.

Now we were ready for testing. Since no one was using POP mail services on our second mail server, I killed our POP virtual server in Exchange 2003 and waited to see how long Nagios would take to notice it. Not long, as it turned out. In about 2 minutes, I had the “Warning” message appear in our web tool and 2 minutes after that, several messages showed up in our mailboxes to let us know that POP wasn’t responding on our mail server. I restarted the POP virtual server and got several “Recovery” messages after that.

As a nice-to-have, we configured a manual DNS entry to allow the virtual appliance to be accessed from http://nagios/nagios. Manual DNS entries for different web servers is a lot easier than IP addresses for your team to remember.

In our last entry, we will talk about the Burn-in process and Lessons Learned. There are a few things you want to watch out for.

- “Site Monitoring and Operations” continues in Part 3…

19.12.07

Site Monitoring and Operations – Part 1

Although The startup I work for has a small infrastructure compared to other organizations, we strive to maintain maximum uptime and respond as quickly as possible when problems come up.

Site monitoring is part of this process – we needed a way to be informed about problems. Nothing is more embarrassing than getting a phone call from a user or customer saying, “Do you guys know your site is down?” We considered performing daily tests on each server – not only was this an inefficient use of our time, it wouldn’t help us if the site was alive at 8am only to be dead around noon. To solve this, we examined several options:

  • Automated Site Test – Using our automated site test application (QTP) – we would run a series of simple login tests to verify the life of the site. This option was rejected for a variety of reasons, but chiefly because we needed a site monitoring system that could provide a single-look format for the health of our environment and although it would handle the site and application, an automated test wouldn’t test our mail capability.
  • Outsourced Site Monitoring – Companies like Cittio and Siteuptime.com offer managed site monitoring. We may visit these in the future, but wanted to keep our budget as low as possible.
  • Commercial Site Monitoring Software – Cricket, Big Brother and Site Scope are site monitoring softwares that are used in data center environments. As with outsourcing – we may visit this in the future, but we were looking for a cheaper way to do this, if it existed.
  • Open-source software – Nagios is an open-source application that runs on Linux to provide site monitoring. Being open-source and free, it was attractive but was going to require that we run a Linux server. We were going to have to allocate a single server or desktop-as-server to run this application.

Another project we had on our plate was examining the use of server virtualization. Our application development environment is housed on three servers and those servers have more than a dozen different commercial and open-source tools installed on them to assist in developing our product. Upgrading a single application can prove problematic, especially if other tools utilize the same underlying program, like ASP, Perl or PHP.

Out of the open-source options for virtualization (We decided early on to stick with open-source and not to go with products like Microsoft’s virtual server), we decided to go with VMware. The decision to do so was pretty simple.

VMWare has been encouraging people to use their product by allowing individuals to build virtual appliances on their own and publish them to the rest of the world. The link to this page is here - http://www.vmware.com/vmtn/appliances/directory/ . A quick browse through the pages and we noted that someone had already built and published a Nagios virtual appliance. We downloaded it and started configuring it for our use.

Ruling by Policy

Quick Note: I wrote this in 2003 and thought I'd share it with you here. I think it illustrates how much I knew at the time and also how much I had to learn. I don't agree with everything that I've written below - I can spend time later giving my own personal rebuttal to this piece.

I don’t go to normal resources for wisdom and advice when it comes to how to get through my job. I’ve learned that there are enough discrepencies in working in Corporate America that my personal motto is “Sense the Irony”. Among my coping mechanisms, I look to sources like “Dilbert” for inspiration and I recommend to anyone in or considering joining this monkey house to go out and read Work Would Be Great If It Weren't for the People. Kernels of truth erupt out of this place. One of them being that if Bill Gates were to hand me a signed check for $2 Million, I’d be out of here so fast you’d never catch me. I checked around and it turns out most of my co-workers feel the same way, so it drills a few holes in that whole “I’m excited about this company” mantra that every HR on the planet is chanting. Of these truths I've gleaned I've added my own bits of wisdom. One of them being from Scott Adams' "The Dilbert Principle". He discusses a lot of different aspects of the corporate system and suggests his own business model to encourage people to contribute to the system by leaving earlier and thus making the time they are here more productive. He mentions in all of this that a company can't really do much topush creativity but it can do a lot to discourage it. His point was: hire the right people, give them the tools and get out of the way. I like his thinking and this is just my personal spin on his idea as it applies to me. The job doesn't really require much creativity but communication is paramount. Heck, I AM the e-mail admin - my paycheck begins and ends with communication. Inter-group communication is something that the powers that be are pushing and I've been thinking about it: this falls in the same arena as creativity. They can't enforce communication but they can go a long way in stopping it. I find it humorous that my company is layering communication upon communication instead of ensuring that the low-level stuff is taking place.

In my new group, Operations, we've got Computer Operators. Tragically, it's not a very glamorous job. They do 12-hour shifts that consist of being there, changing tapes and calling if something blows up. It's thankless and boring but it's necessary to ensure that we have someone here 24 hours a day if something should happen. Like firefighters without the women chasing after you. Some folks in the company have this breathless "We need to get this going ASAP" line that just makes me want to ban the word 'ASAP' from any conversation that takes place. It's not like someone's going to die if they don't get this e-mail sent. It's not like the world will fall down if a print job doesn't go through. Not to hear them tell it. They get so cranky and upset if a single thing isn't working that you may need to call Grief Counseling. Oddly enough, these are the same people who nastily ask why you’re requesting extra in the budget for all these new toys they want to play with.

Before this turns into a bitter "I-hate-my-job" rant, let me change some focus. If I were running the show, I'd model this a lot more like my previous boss Jearome has it. He encourages his people to get the job done, points them in the direction of the people they need to talk with and then steps back. No staff meetings or meetings to discuss meetings. If he has a question, he'll come to you to discuss it or call. Other than that, nothing more to say. He senses the irony and oftentimes when I was hot under the collar about some mundane little thing he'd remind me "Keep cool, it's just a job." Words to live by. So, what would Tim do? Here’s a few suggestions from the trenches:

Good management is more than just writing policy - Any organization can be easily turned into a galumphing, over-managed beast through the overapplication of policy and official decisions. Just look at any governmental organization, GM or AT&T. What makes smaller companies better is that decisions flow faster, people react quicker and the company adapts more rapidly. As a previous employee of McDonalds (for all of three months. Can you imagine getting fired from a ‘trained monkey’ job?) it was the only job I’ve ever held where every single detail of that job was documented on 8 ½ x 11 in 3-ring binders. Everywhere else, some seat-of-the-pants flying was involved. My limited experience has told me that it’s impossible to write a policy to cover every single event that happens within this place, not without hiring double your current staff and tasking one half do the job while the other half writes up documents on how to do it. Sometimes you have to make decisions and go. Later on, you can refer to your original decision in making new ones in the future. If it’s major enough, if it happens often enough, policy can be enacted.

Change for the Better – Be very careful that the decisions you make do not make the problem worse. In my current capacity, I’ve seen a lot of decisions made by management that involve changing our schedules. The new edicts they’ve handed down serve only to punish the subordinates, without whom the situation would have been much worse. This kind of goes back to the first suggestion, but it’s unique in that most managers try to create a blanket fix for problems, but blankets aren’t always the answer. Sometimes, it’s specific people and nothing solves those problems except for a nice eyeball-to-eyeball conversation. If movies about the Mob have taught us anything, it’s that sometimes by making an example of one person, you keep the group in line. Managers’ success depends on the efforts of their people. If those people are making things worse and not listening to direction, they’re making it a clear cut case of them-or-me.

Lead the People, Moses - Leaders are managers, but managers aren’t necessarily leaders although the good ones are. Lao-Tzu once said,"The wicked leader is he who the people despise. The good leader is he who the people revere. The great leader is he who the people say, 'We did it ourselves.” Great management is so subtle that those being led usually don’t realize it. Great ideas are ones that don’t have to be sold. The best form of subtle management is leading by example. There will always be the ‘grunt jobs’ that are dished out in any organization. Covering the weekend pager, working the dead watch in a data center, on and on and on. Why not volunteer to take a few of those shifts? It’d certainly foster a “We’re all in this together” spirit. It would also prove that you believe in the decisions you make. As a grunt, I need to know that you know what you’re doing and that you’re serious about the decisions you make.

No job is perfect. If you find you hate a certain person at work, I guarantee you that his or her clone works wherever you’re going next. It’s best to learn to make peace with them, learn to get along or get around them. At the same time, you need to remember that the higher up the ladder you are, the more you’re likely to make someone’s day or ruin it by your decisions. Like all the backups I make before I work on a server, you need to leave yourself room in case something goes south. Be humble, listen and act on some suggestions and you’ll make yourself that much more fun to work for.

Growing Technically / Growing Professionally

It's important to learn that part of your technical job isn't about tech. It is about people.

Apple gets that right in that they are as passionate about their products working for people as they are about the technology itself. The days of snarky, technically-brilliant people with no social skills are over. The days of 'nerds' are over. If you plan to be in a technical field, you must be prepared to polish your professional skills as much as your technical skills. This is what makes you valuable in IT.

Every job has a technical side - carpenters are very skilled in their own way, as are bakers, mechanics, architects, lawyers and doctors. What everyone realizes, almost instinctively, is that it's not enough just to be good at the technical side of your job. How many horror stories have you heard about the doctor with no bedside manner, the shady contractor and the crooked auto mechanic? You must also be professionally competent.

I published a number of 'professional growth' essays on my older blog area. I didn't write them to put myself on a pedestal - I had simply learned something about myself and wanted to jot it down for future reference. Some people have come back and told me how much they enjoyed them and so - I'll be re-publishing them here to put what I've learned about professional growth in a more strict context. I'll tell you when it was re-published; it's not like you would have had to pay to read the old archives anyway.

Hi, there.

Thanks for visiting. Startupgeek is a blog about several things:

1. The things I've learned technically in the past 12 years.
2. The things I've learned professionally in the past 12 years.
3. New things I learn, both tech and professional.
4. Random thoughts about working in IT.

What the blog is not about:
1. My employer - I'm vetting the blog to ensure their privacy; this isn't a corporate-sponsored blog and this content is not approved as far as what should represent the company. To eliminate that as an issue - I'm leaving them out of it and trusting everyone to respect that.
2. Politics - I'm politically neutral...I don't discuss politics.
3. Product reviews - I don't think I'd ever hope to compete with CNET or Tomshardware.

About Me
I've been working in IT since August of 1995 - in a nutshell, I started in phone support, moved on to desktop support, then to systems administration, then data center operations and finally, deployment/operations for a startup in the Silicon Valley.

My opinion is just that: opinion. I do my best to back up my POV with reasonable, empirical data. I reserve the right to be wrong about something and fortunately, I don't make the same mistakes too many times. I'm hoping this information will help someone else do the same.