May 2007 – sides of march (Brian DeMarzo)

Google Reader offline with Google Gears

Posted on May 31, 2007 | by brian | Leave a Comment

I noticed a new item in the top right of my Google Reader window today:

Google Reader Offline (new!)

Interesting! I clicked the Offline link and was introduced to Google Gears (beta).

After a short installation and a restart of Firefox, I was greeted with a little green arrow in Google Reader. Clicking on it started a background download of 2,000 recent RSS feed items, which became available to me even when I wasn’t “on the wire”.
Once again, Google shows the rest of the world how powerful tools can be made simple. Will GMail Offline be far behind?

Compiled help files for Castle Project

Posted on May 31, 2007 | by brian | Leave a Comment

I’ve been a fan of the Castle Project since discovering it last year. One thing that always irked me is its spartan documentation, so I did what I should do: I created some.

Using Sandcastle and the Sandcastle Help File Builder (SHFB), I compiled the help files for the latest Castle build (release 3817). All Castle.* libraries are included.

Help files are available in CHM, HxS, and HTML formats, as well as the SHFB build file.

I hope you find these useful!

Excel 2007’s unresponsive autorecovery feature

Posted on May 30, 2007 | by brian | 20 Comments

For the past 15 minutes, the newly-installed Excel 2007 has been unresponsive. Well, not totally unresponsive. It is doing something in the background: saving autorecover information. I know this because, every so often, I can get Excel to show me a little something like the following:

Task Manager periodically says Excel is “not responding” and other times says it is “running” (maybe so, but it’s still not responding to me).

In defense of Excel, this was a rather large document: over 15,000 rows and 26 columns — 12.9MB on disk. However, there’s no reason for any background process to make an application unresponsive — especially when the background process in question is something that supposed to protect you from the application becoming unresponsive.

I also can’t figure out why saving an autorecovery file takes about 20 minutes, when saving a new copy of the same file takes about five seconds.

Disclaimer: I happily use OpenOffice for personal use and Google Spreadsheets for shared documents. Office 2007 was installed on my work desktop so I can evaluate it. Let’s say the evaluation isn’t going so well right now.

How much tax do you pay on your phone bill?

Posted on May 23, 2007 | by brian | 2 Comments

I decided to take a closer look at my phone bill today. The total of the bill was $57.82. Of that, $40.74 was for the actual telephone service; the remaining $17.08 was taxes, stipends, and other government fees. You could surmise that about 29.5% of the cost of my telephone service went to the government in the form of taxes.

Of course, if you did that, you’d be wrong. It’s actually much worse (for you, not for the government).

The $57.82 I use to pay my phone bill is money that I earned and paid taxes on. As a resident of New York City (by many calculations the most-taxed place in the country), I take home about 75 cents for every dollar earned. To take home the $57.82 to pay my phone bill, I have to earn $77.09.

The real cost of my phone bill is $77.09, of which $36.35 was paid in taxes. In other words, 47% of my earnings that go towards paying my phone bill get redirected to the government.

When I get my cell phone bill, I’ll do a similar analysis. I’m not looking forward to it — it might be quite depressing.

The Internet Politician: When?

Posted on May 22, 2007 | by brian | 1 Comment

In his recent column, Republicans, Democrats, Internet Tubes, Gak!, John Dvorak talks about the lack of technology-savvy politicians in our national government. This should come as no surprise, considering the average age of a Congressperson is 56, and most 56-year olds I know are not very computer literate — at least not beyond the basics.

Dvorak goes on to wonder when we’ll see a tech-savvy person in Congress, and he figures it won’t be until 2035. His figure is based on an entry age of 40 (which is plausible), which is about 40 years beyond 1995 because…

[b]y my calculations, the first generation of kids who were totally immersed in the computer age was born somewhere between 1984 and 1995.

Now hold on just a second there, John! Of all people you should know better. I was one of those “kids totally immersed in the computer age”, and I was born in 1970. Since I was nine years old I had a computer in my house (the first was an Atari 800), and I was well entrenched into the development of everything that we take for granted today: spreadsheets, graphical user interfaces, online communications (first bulletin boards, eventually the Internet)… My generation was among the first to take advantage of pagers and cell phones because we were the first who were of the age to understand them and afford them.

In my opinion, you’ll see a truly tech-savvy politician in the next ten years… That is, if you could pry them away from the technology field, which is a heck of a lot more interesting than politics.

When the going gets tough…

Posted on May 18, 2007 | by brian | 3 Comments

The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood; who strives valiantly; who errs, who comes short again and again, because there is no effort without error and shortcoming; but who does actually strive to do the deeds; who knows great enthusiasms, the great devotions; who spends himself in a worthy cause; who at the best knows in the end the triumph of high achievement, and who at the worst, if he fails, at least fails while daring greatly, so that his place shall never be with those cold and timid souls who neither know victory nor defeat. — Theodore Roosevelt, Apr 3 1910

It was about ten years ago when I first stumbled across that quote, and I’ve carried a copy of it in my wallet ever since. I have shared that quote with many people over the years, with the hope that it inspires and strengthens those the same way it inspires and strengthens me.

I’ve been going through a bit of a rough time lately — a new job that started demanding lots of hours, a long commute (averages over two hours a day each way), some financial struggles (this new job pays half what I was making three years ago, and has brought me to a salary level I haven’t seen in over ten years), and a major lack of time to spend on family, friends, and side work (to address the financial situation)…

I’ve spent all too much time over the past week or so feeling sorry for myself (quietly, of course), and finally confessed as much to a friend today. That friend reminded me of the quote posted above — the quote which I told him about years before, but I had forgotten about. As he said, “You are in the arena.”

Needless to say, I feel a heck of a lot better now, and I’m ready to fight the good fight again. Sometimes, you need to stand back, look at things, and refocus your energy in a positive manner.

Five simple rules for creating delimited text files

Posted on May 16, 2007 | by brian | Leave a Comment

Here’s a few tips for those people who provide raw data in text files (think CSV files and the like).

Surround all text fields in single quotes, even if a comma does not exist in a specific instance. By failing to do this, you lose field delimiter consistency on a row-by-row basis, forcing the contents of the field to be parsed separately (i.e. stripping the “optional” quotes).
Use consistent line endings. Pick one and stick with it for all your files. Use either (CR/LF), (LF), or (CR) — and use the same in all your files.
Put column headings in the first row only. This is more a convenience than a necessity. If you make your first row column headings, make sure it is only the first row.
Every row past the first should have an identical schema. Don’t try to be fancy and have different row types in one file. Each file should have the same number and sequence of columns.
Provide delimiters for all columns, even when a row does not have data for them. For example, in a comma-delimited file with five columns, if a row has data in only the first two columns, make sure the row ends with three commas. Failure to do so implies missing or incomplete data.

When text files following these guidelines, I can write a script to import them into a SQL table (using BCP and a simple batch file) in a few minutes. Each guideline that is broken requires additional cleanup steps and more complex data import steps, and adds significant development (and debugging) time that shouldn’t be necessary.

Lost in (Google) translation

Posted on May 14, 2007 | by brian | Leave a Comment

For the past few days, I’ve been working on importing raw play-by-play data for Japanese baseball. Once the import scripts and queries were written, I needed a way to audit the results. To do that, I needed a source for up-to-date statistics on Japanese baseball players.
Yahoo! provides a rather robust web site for the Nippon Professional Baseball (NPB). Unfortunately, the web site is in Japanese, a language I don’t read or have support for on my computer, so the screen was, for the most part, filled with question marks, as seen below.

Yahoo! Sports NPB Baseball (before translation)

By using Google Translator, I was able to transform this into the following:

Yahoo! Sports NPB Baseball (with Google translation)

I wasn’t expecting a perfect translation (it would be silly to do so), but the results were certainly entertaining.

A “base on balls” is a “giving Annie Oakley”.
A “hit batter” is a “giving dead sphere” (the poor batter).
“On base percentage” is “coming out base ratio”.
“Slugging percentage” is “long batting average”.

If you look at a translated hitter’s page, you’ll see this unusual description of a player’s at-bat:

Two racketeers, empty three swing, medium flying it is cheap, the left ? flying, two racketeers

Who says there’s no racketeering in professional baseball today?

Three tips for grief-free project estimates

Posted on May 9, 2007 | by brian | Leave a Comment

Having spent six of the past ten years as a consultant, I’m all to familiar with the practice of estimating. Every client wants an estimate, and every client wants your estimate to be accurate. Of course, clients also don’t want to give you concrete requirements that are needed to give an accurate estimate, either, which compounds the problem.

Scott Hanselman has a nice post about estimating, where he mentions two lessons I learned over the years:

Make your estimate, then double it. I actually took this a step further. If an estimate had to be given based on very sketchy requirements, I’d double it twice (effectively quadrupling it). This practice leads to…
Under-promise and over-perform. Always make sure your estimate gives you sufficient cushion to come in ahead.

A third lesson he doesn’t mention is to be willing to walk away from a client if your estimate is too high. If a client balks at your estimate (even if you double, or quadruple, it), you can either reduce the scope of your proposed work (and thus reduce the estimate) or walk away. I’ve taken projects that I’ve regretted taking after all was said and done, and most of them can be attributed to me skimping on my estimate because the client was scared off by my original (and, usually, more accurate) estimate.

These lessons go not just for programming projects, but for nearly everything in life. Over-estimate, under-promise, over-perform, and don’t shortchange yourself. Words to live by.

Automatically generate (partial) XML format files for BCP

Posted on May 8, 2007 | by brian | Leave a Comment

I’ve been working with a lot of raw data files lately — data files that come in some delimited format which I must move into a SQL database. Most of the files are CSV files, and some of them have over 100 columns.

Using BCP to import CSV files into SQL is a pleasure. The only annoyance is writing those XML format files. Actually, the annoyance is not writing them, it’s going through the CSV files (which, of course, are undocumented) to determine what delimiters are used in each row.

Here’s a sample of what I mean:

2934,128321,2782,"2007-04-32","Excluded",2321,,22

Fortunately, most CSV files are consistent in their use of quotes, but going through dozens of columns to determine the terminators is a pain. The terminators in the above example aren’t just commas; they could also be quotes. Column three, for example, ends with a comma followed by a quote.

To generate the <record> section of my format files, I wrote the following script, which reads the first line of a text file, finds each comma, determines if it is preceded or followed by a quote, and generates the appropriate XML.

//get filename
var args = WScript.Arguments;
if (args.length == 0)
{
	WScript.Echo('Usage: getdelims.vbs <filename>');
	WScript.Quit();
}

var filename = args(0);

//read file
var fso = new ActiveXObject("Scripting.FileSystemObject");
var file = fso.OpenTextFile(filename,1);
var contents = file.ReadLine();
file.Close();
file = null;
fso = null;

//find commas
var cnt = 0;
for (var i = 1; i < contents.length; i++)
{
	if ( contents.substr(i,1) != ',' ) continue;
	cnt++;
	delim = ",";
	if ( contents.substr(i-1,1) == '"' )
		delim = '&quot;,';
	if ( i+1 < contents.length && contents.substr(i+1,1) == '"' )
		delim += '&quot;';
	WScript.Echo('\t<FIELD ID="' + cnt + '" xsi:type="CharTerm" TERMINATOR="' + delim + '" />');
}

The output can be copy/pasted right into your format file. The example content above would generate the following.

        <field ID="1" xsi:type="CharTerm" TERMINATOR="," />
        <field ID="2" xsi:type="CharTerm" TERMINATOR="," />
        <field ID="3" xsi:type="CharTerm" TERMINATOR=",&quot;" />
        <field ID="4" xsi:type="CharTerm" TERMINATOR="&quot;,&quot;" />
        <field ID="5" xsi:type="CharTerm" TERMINATOR="&quot;," />
        <field ID="6" xsi:type="CharTerm" TERMINATOR="," />
        <field ID="7" xsi:type="CharTerm" TERMINATOR="," />

A few minutes writing a script, and I won’t be looking at CSV files with too many columns ever again (at least, not for the reason of writing XML format files).