hayley 365


the dumb things I see

10 Jan 2014

2014-01-10 01:57:21 -0600

One of the downsides of working on Cyclocane is having to work with government agency data, which often doesn't appear to have been very well thought out.

In general, there's the problem that there's absolutely no standard way to deliver the data. JSON? YML? What's that? No, it's super-custom plain text written for humans. And among the government agencies, no one is using the same custom text format. And then there's the fun of how they like to change certain things on a whim. Or at least, it often feels that way.

By far the dumbest thing I've seen is India's Meteorological Department. They deliver their warnings in PDF format and their format appears to change every single time. There's no consistency, and ergo there's no way that I'm going to try to build a scraper against the data. What's so frustrating is that they're not doing anything that even requires a PDF. It appears to have basically been just a word document that someone typed up and could've easily been turned into plain text instead.

And the IMD is not helping themselves either. I often see these government websites go down under the burden of heavy traffic when there's an active tropical cyclone in their neighborhood. So here they are delivering all of their cyclone warnings in a format that's taking like 10-100 times the bandwidth compared to just delivering it in plain text.

So this rant came because Meteo France is doing something stupid. See, they apparently never considered that there might be more than one tropical cyclone active in their area at any given time.

And that's exactly what happened today when there were two different tropical disturbances active at the same time.

So what's the problem? As I understand the system, these meteorological warnings come with headers that instruct the receiving system what filename to use. So how must agencies handle this is by having a few filenames available so that if there is more than one storm going on at the time, then it just uses the next available filename.

Not so with Meteo France. When you look at the headers, they only have one filename available. So basically, what was happening today was that their system would emit one warning with one filename, then emit a different warning but with the same filename, which would result in the first warning being completely overwritten.

So, my first task was to just do a whole bunch of digging to find a site that wasn't vulnerable to this overwriting. I found data from both Unisys and WMO and settled on using data from WMO's site.

So basically then I needed to write the code to support scraping the site.

And in the process, I ended up deciding to knock off a todo list item where I'm now scraping a tiny bit more data out of the Meteo France advisories.

And then later this evening, my Fiji scraper broke again. That's like 3 or 4 times in just a few days.

I write really fragile code apparently.