Regex Fun - Get a substring by length, breaking on word boundary.

Yesterday I had a discussion wherein Regular Expression's were (sort of) jokingly referred to as "magic". Well, today I had cause to write a clever(ish) one, and I thought I would lay it out real quick.

What I needed to do was break a string at between 200 and 250 characters, but it had to be on a word boundry. So, it could run over if needed, but it had to break on a word boundary.

Now, I know you could do this with a couple of other functions mixed together, but I'm going to do it with preg_replace. No idea how this compares on performance. Poorly for large strings, I would guess.

I'll ruin the surprise and show you the answer first:

preg_replace( "/(.{200,}?\b).*/s", '\1', $string );

So let's break this down.


Starting from the left. ( this signifies that I am opening a subpattern. What comes inside of these parentheses are the content I want to save.

The dot, ., is a wildcard that matches any character.

The string {200,} is a repetition marker, allowing for the previous character (the dot, so, anything, remember) to repeat 200 or more times.

The question mark, ? here is really important. By default repetition is "greedy", or, it will suck up as many characters as it is allowed, with precedence being given out left to right. By putting the ? here we make {200,} un-greedy, so it only grabs as much as required for the match to work.

The \b is an escape sequence for a word boundary. Handy!

After that we close the subpattern, )

Next we have .* which is a greedy repetition matching any character. Since this is greedy and {200,} is not, it will pick up any of the slack.

The last little bit is the s after the pattern delimiter. This is a modifier that tells the dot character to match anything, including newlines.


So, to recap, we say...

Give me any character, 200 or more times, ungreedily, followed by a line break, and then collect any following characters greedily.


//  T  h  i  s     i  s     a     t  e  s  t     s  t  r  i  n  g  .
// 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21
$string = "This is a test string.";
echo preg_replace( "/(.{5,}?\b).*/s", '\1', $string );
// This 
echo preg_replace( "/(.{7,}?\b).*/s", '\1', $string );
// This is
echo preg_replace( "/(.{9,}?\b).*/s", '\1', $string );
// This is a
echo preg_replace( "/(.{11,}?\b).*/s", '\1', $string );
// This is a test
echo preg_replace( "/(.{25,}?\b).*/s", '\1', $string );
// This is a test string.

// I'm removing that ? here, see what it does to it...
echo preg_replace( "/(.{3,}\b).*/s", '\1', $string );
// This is a test string
// It made {3,} gobble everything up to the last word boundary. Greedy bugger!

Got it?