I need to learn more: RegEx

I need to learn more: RegEx

I’m currently working on a new feature for Facebook Stories which involves using the Facebook Graph API to search content on Facebook. The results of a search are going to be displayed, with the search query highlighted in the results using the HTML5 tag.

My first thought was “easy, search the returned text for the search term and wrap in a mark tag, no problem”. The first way I thought of doing this was to break the string up into words, in an array, check each item of the array to see if it is the search term or not and rebuild the string.

Something like:

function mark_search_term(result_string,search_term) {
var str = result_string.split(" "),
newStr = "";

$.each(str, function(){
var a = this;
if (a.toLowerCase().indexOf(search_term.toLowerCase()) == 0) {
newStr = newStr + ' <mark>' + a + '</mark>';
} else if (a == " ") {
newStr = newStr;
} else {
newStr = newStr + " " + a;
}
});

return $.trim(newStr);
}

Great! Except if the search term was more than one word…

So then I played around with indexOf. This allows you to search for a string within a string, so naturally I thought I’d cracked it. Doing some substring() updates to the string to add the open and closing tags and I’d be sorted. Something like:

function mark_search_term(result_string,search_term) {
var str = result_string,
position = str.toLowerCase().indexOf(result_string.toLowerCase());

if (position > -1) {
str = str.substring(0,position+result_string.length) + "</mark>" + str.substring(position+result_string.length);
str = str.substring(0,position) + "<mark>" + str.substring(position);
}

return str;
}

Great! Except this returns only the first instance of the search term in the string. And worse, if you search for “the” it often put the highlight in the middle of a word (oTHEr, togeTHEr etc.). I thought about making it recursive and looking for spaces and special characters around the search term and then I thought…

Wait, why am I not using replace(). It uses RegEx and you can pass in the “/g” flag to make it find ALL instances. At this point its worth repeating the title of this article, I need to learn more RegEx. And I need to recognise earlier when I should be using one rather than trying to dream up other solutions. The problem with RegEx for me is that they quite often end up looking like Matrix style machine code to me, eg: ]*>(.*?) and I don’t know what all the symbols mean. Its worse than reading Luis’ Ruby on Rails code…

So I started with:

var re = new RegExp(search_term,"ig");
var str = result_string.replace(re,"<mark>"+search_term+"</mark>");

This uses the i and g flags to do a case-insensitive search and to globally search the string rather than stopping at the first instance. This highlighted multiple instances of the search term, though it changed the case to match that of the search term and it highlighted in the middle of the words. But I knew I was onto something, I just needed to add more ASCII characters. I contemplated guessing, maybe new ReqExp("\\{]!%^"+search_term+"%^&*","ig") would work. Maybe if I added some emoticons.

So then with this RegEx generator and MDN docs open, I pieced together this:

var re = new RegExp("^\\b(?="+search_term+")(?=\\b)|("+search_term+")(?=\\b)","ig");
var str = result_string.replace(re,"<mark>$1</mark>");

Now we match a string that starts with any word boundary, only if followed by our search term only if followed by a word boundary; or, if our search term is followed by any word boundary. This now searches for our search term at the start of, or in the middle of a sentence in a string and finds them; and doesn’t find the search term in the middle of a word. Bingo!

Then we use backreferences to build the replacement term, in this instance, on either side of the or in the RegEx there is two sets of parentheses and the $1 references the first instance of each.

So now I have what I want and it works great. I won’t claim to fully understand RegEx and I can’t be sure I’ll spot when to use them sooner in the future; I am happy this one worked out.