Wednesday, March 19, 2008

More Issues

Naturally, the turning off of DNS doctoring has created more issues. This was, to a certain extent, to be expected. But with our public website being down, it was the surefire way to fix it, so we went with that solution.

The issue now lies on the inside network. Anyone who accesses a site via its public FQDN now must use the outside public IP rather than the inside private IP. As long as sites resided in the DMZ, this was taken care of by reconfiguring the static translation to use the outside public instead of the inside private.

There are two issues where this could not be resolved, however. First, if the server resides on the private network, instead of the DMZ. Unfortunately I inherited a bunch of these. I'm trying to get them removed, but still have a ways to go. Fortunately, I haven't found one that was a show stopper yet. If I do, I'll probably have to NAT it on our inside core switch. But I'd like to avoid that if possible.

The second issue is for people who accesses our DMZ services from remote sites across the VPN. Since they now reference a public instead of private address, they traverse the public Internet. This has created issues for people trying to do things like edit the public website remotely. Naturally the firewall doesn't allow this from the public internet.

We've been able to survive the little issues thus far. We also discovered the reason for the problem. Sever static NAT translations that I inherited were applied backwards. This was causing DNS doctoring to occur to the public internet! I'm not sure why this wasn't broken before.

So we know the fix, now I just have to convince my VP that we won't break the public website again. It's looking like it will be at least 2 weeks before we can put the change through. I just hope we don't put in too many band-aids between now and then that we can't go back to the correct way.

Sunday, March 16, 2008

More Issues

Some previous firewall configuration was carried over as part of the migration as well. Unfortunately, this created some more problems. For some reason, DNS doctoring was turned on pretty much everywhere. It was being overused to the point that DNS replies of internet addresses were getting overwritten with DMZ addresses. Naturally, this caused a number of internet services to fail.

Now I really haven't messed wtih DNS doctoring much since the alias days. It seems to be a lot easier now, since all one needs to do is to add the dns option to the static translation.

But the real question is, why wasn't this failing before the upgrade. I don't know. I turned off DNS inspection to kill this. That fixed the issue, but it took time to propogate.

The Default DNS TTL is 3 hours, so anyone who grabbed a bad record would have to wait up to 3 hours for their upstream DNS servers to get corrected.

To check the TTL remaining, nslookup can be used

c:\>nslookup
Default Server: cns.manassaspr.va.dc02.comcast.net
Address: 68.87.73.242
> set debug
> www.somecompany.com
Server: cns.manassaspr.va.dc02.comcast.net
Address: 68.87.73.242
------------------------
Got answer: HEADER: opcode = QUERY, id = 5, rcode = NOERROR
header flags: response, want recursion, recursion avail.
questions = 1, answers = 1, authority records = 0, additional = 0
QUESTIONS: www.somecompany.com, type = A, class = IN
ANSWERS: -> www.somecomapny.com internet address = 10.1.1.100
ttl = 205 (3 mins 25 secs)
------------
Non-authoritative answer:
Name: www.somecompany.com
Address: 10.1.1.100
>

Late Night Troubleshooting

Unfortunately, issues continued late into the night. The joy of working on intermittent issues. On several occasions we thought we had it resolved, only to have the problems return later.

As it turns out, a transparent firewall that permits everything still runs packet inspection. We placed one in between two mail servers. It was inadvertently inspecting SMTP and occasionally killing communication between the servers. This only seemed to happen with a high message load.

Over the years, Cisco has been driving me and many others crazy with smtp. This was actually the first place I looked on our external firewall, and it was not present. I never even thought to look at the transparent firewall until later.

A primer on mailguard, etc.
Since PixOS 4ish, Cisco dabbled in application inspection with the mailguard feature. Since SMTP only has a handful of commands (HELO or EHLO, DATA, etc.) mailguard attempted to only allow these commands and to play with some banners and things as well to help protect the mail server.
In Pix 6ish this was now called fixup smtp 25
And now in Pix 7 it's called inspect esmtp

And over all of these years, the general rule is always to disable it immediately.

There are certain things that Cisco does that amazes me. Why, when after 10 years of configurations being changed out of the box, does Cisco not just make them default? Off the top of my head, some examples are
  • Turn off SMTP proxying/fixup/inspect/mailguard
  • Disable auto-summarization
  • Turn on service password-encryption
  • Turn on service timestamps
  • Automatic "terminal monitor"
I'm sure Cisco would say it's because of compatibility issues with those upgrading from older code. But come on, somebody upgrading to 12.4 should at least have some inkling about auto-summarization.

Saturday, March 15, 2008

Data Center Rebuild

Today we redesigned the core of our datacenter. Some of the configuration tasks included:
  • HSRP with object tracking
  • dot1q trunking to a router
  • transparent firewall BPDU passthrough
  • BGP advertise-map configuration
One of the primary benefits of this configuration was it allowed us to add some functionality to the network, while replacing a number of layer 3 switches with layer 2 ones. This allows us to use less expensive resources and redeploy the freed up devices to more suitable locations. It also simplifies the network, which is generally a good thing.

Just from my experience with the CCIE lab, I changed my normal plan of attack here. In the past I was more one who prefered to gut everything and replace it all at once. Now, I tend to follow lab strategy and do one piece at a time, verifying as I go.

As with the lab experience, things go much slower this way, but it gets done right. When something doesn't come up I know exactly where to look for the issue instead of having to troubleshoot everything.

As expected, there were a few bugs to work out, but just about everything ended up as designed. The great work up front and during the implementation by the whole team really paid off.

Some of the learning experiences were
  • A firewall in transparent mode can pass layer 2 information, but not CDP. A ethertype access list is required to allow BPDUs to pass. This was necessary so that a backup link can be used while letting rstp block the link. If this didn't work, another option would have been the "switchport backup" option for the 2560/3560 series switches, which functions a bit like a layer 2 version of the backup interface router configuration command.
  • It always helps when default gateways are set correctly
  • When using Hyperterminal to paste commands, make sure to set the character delay to prevent buffer overflows
  • Duplicate IP addresses are very bad things
Most of these were minor and corrected quickly.

Saturday, March 8, 2008

How the CCIE helped in the first week

On Friday night I had a BGP implementation to do at our India location. There have been a few transatlantic cable breaks of late so we really needed to get a redundant provider up and running. The change began at 10pm with a relatively simple setup. All I want is a default route and I'll use as-path prepending and weight to ensure the primary circuit is preferred.

While one ISP got it correct, the second did not. My router wasn't maxed out on memory, so I needed to ensure I didn't have the entire BGP table. Sure enough, once the BGP adjacency came up, I began to receive more than I bargained for.

Now the race was on. I had to filter everything but the default route before the router ran out of memory. Of course, just about any CCIE should be able to do this.

ip prefix-list DEFAULT permit 0.0.0.0/0
!
route-map NODEFAULT deny
match ip address prefix-list DEFAULT
route-map NODEFAULT permit 20
!
router bgp xxxxx
neigh xxxx route-map NODEFAULT in

It took about 30 seconds. No DocCD needed. I didn't even have to use the ?. It worked right the first time.

This is why the stress and pressure and time constraints exist in the lab. Unexpected things happen. Sometimes you have to act quickly. Consequences can be severe.

A new purpose

It has been almost a week since I passed the CCIE lab exam. There were a number of congratulations and a celebratory dinner alone with my wife. Aside from that it has been life as usual. No ticker tape parades. No job offers for obscene amounts of money.

My life has changed already though. It's Saturday and I did not wake up at 6am. I slept in until nearly 11. My wife and son are out of town this weekend, so I have been watching movies and hanging out with my dog. At some point I need to lower my son's crib since he's very close to pulling himself up.

The next change is that I'm finding it difficult not to brag. I called a few friends and sent a few more emails. But most people would probably not know unless they paid attention to my new email signature. I'm not sure if I'll keep it in there long-term, but I will for at least a week or two.

I'm also not sure what to do next. I'm planning on casually studying for the security track. I'm really not in a rush. It took me 10 years to get the first CCIE, so that should be too difficult to improve upon.

One thing I did notice. There are a lot of blogs of people either talking about studying for the CCIE or helping others to get there. But what are you supposed to do after you're certified?

There is no official blueprint for what to do after you're certified.

So, the new purpose of this blog is to discuss what happens next.

Sunday, March 2, 2008

Passed!


Now I need to figure out what to do with all my newfound free time.


Ed Balow
CCIE #20152

Saturday, March 1, 2008

CCIE Attempt #2 Task 1/2

Here I am at the hotel in RTP the night before my second attempt. I arrived early enough to have a nice dinner and talk football with the bar patrons for a bit. To my dismay, the Bears traded Berrian to the Vikings. So now they are down two starting wide receivers, and are about to lose Lance Briggs, with only a 1-year extension for Rex that's-just-Gross-Man to show for it. Not looking to be too promising of a season.

The first task of the lab, that many seem to fail, is getting a good night's sleep the night before. I bombed that miserably at my first attempt. I came down to RTP with my family and my dog. My 7 month old son was still collicky at the time and thought he'd help me study until 4am. Unfortunately I don't understand babytalk so all it meant was that I only got about 2 hours of sleep. I ended up getting to the lab about 15 minutes late. By 6 hours in I was feeling pretty drained and didn't have much motivation to do a detailed 2nd verification of everything.

Due to some misconceptions I had that I realized in later studies, I probably would not have passed anyway. But I'm going to do everything I can to get some solid sleep tonight. I did a test drive to the Cisco campus to make sure I know the way. I have a 6am wakeup call scheduled, my cell phone alarm, and hotel room alarm set for that time as well. So I'm calling my sleep preparations task 1/2. Let's just hope I can follow through.

If you happen to read this, Laura, thanks for all of your support this year. You've allowed me to get a heck of a lot of studying done on the weekends and I really appreciate it.