Monday, December 8, 2008

String concatenation, now in near infinite time!

One particular thing has always surprised me in languages like Java and C#. Consider the following piece of code, taken from this file in ChanServ (a bot in the Spring lobby server for moderation/administration purposes.)

// precondition: there exists a logfile with filename ,
// which may contain a very large amount of lines, which are
// to be inserted in a MySQL database.

ArrayList<Long> stamps = new ArrayList<Long>();
ArrayList
<String> lines = new ArrayList<String>();

String update = "";
BufferedReader in = new BufferedReader(new FileReader(file));
String line;

int lineCount = 0;
while ((line = in.readLine()) != null) {
if (line.trim().equals("")) continue; // empty line

long stamp;
try {
stamp = Long.parseLong(line.substring(0, line.indexOf(' ')));
} catch (NumberFormatException e) {
Log.error("Line from log file " + file.getName() + " does not contain time stamp. Line is: \"" + line + "\"");
return ;
}

if (lineCount == 0) {
update += "INSERT INTO `" + name + "` (stamp, line) values (?, ?)";
} else {
update += ","+ Misc.EOL + "(?, ?)";
}

stamps.add(stamp);
lines.add(line.substring(line.indexOf(' ')+1, line.length()));

lineCount++;
}
in.close();

update += ";";

// it continues by creating a prepared SQL statement and setting
// the parameters using the stamps and lines ArrayLists...

Now, do you already see the issue? Don't worry if you don't - I've seen the same issue in various projects (including commercial projects) already, so you're not the only one who doesn't see it.

Let me give you a hint related to this specific code: One day the box hosting the Spring lobby server went down. So when it came back up I began to restart the lobby server software. First the lobby server itself. All was fine. Then ChanServ, the bot which contained the above code back then. I ran it's startup script, and waited..., and kept waiting...! The script didn't finish. At all.

Any clue yet?

No? Let me explain.

What - after some debugging - happened to be the case, was that this code in ChanServ was being invoked to import some rather large log files into the database. Now take a look at the code again: for every line in the log file, it appends a small string to update. This wouldn't be bad, if appending to a string in Java and .NET wasn't O(n), with n the length of the total string. In other words, the algorithm above as a whole was O(n2) instead of O(n) with respect to the number of lines in the file! (For example, this means that if importing 1000 lines takes 1 second, then importing 1 million lines takes, no, not half an hour, but a whopping 278 hour!)

It was trivial to rewrite the above code using StringBuilder. Afterwards, ChanServ started up and imported the log files into the database in a few minutes only.

StringBuilder update = new StringBuilder();

while ((line = in.readLine()) != null) {

// ...

if (lineCount == 0) {
update.append( "INSERT INTO `" + name + "` (stamp, line) values (?, ?)" );
} else {
update.append( ","+ Misc.EOL + "(?, ?)" );
}

// ...
}

All in all, I want to urge you as C# or Java developer (or any other language with immutable strings) to keep in mind the algorithmic complexity of string concatenation. In simpler terms, I want you to use StringBuilder / StringBuffer whenever you are concatenating more then a few hundred characters into a string, to prevent nasty suprises like the one above.

2 comments:

  1. This is interesting for Spring modders too: Lua has immutable strings. You can use table.concat instead of multiple concatenations, in your scripts.

    ReplyDelete
  2. Yeah, I was wondering whether to include that explicitly or not. In the end I decided not to include it because the post had become fairly long already. Maybe something for another post. Thanks for reading btw :-)

    ReplyDelete