How to Parse Words in PHP

The idea of parsing words in natural human language is not as simple as it appears to be. Simply doing explode(" ", $text) won’t work as well as you’d hope on most corpus.

For example, consider the string "Hello PHP!". Here we’d get two words: "Hello" and "PHP!", but "PHP!" is not a word; "PHP" is. You’d think, OK we can simply trim punctuation. Though that too doesn’t work the way you’d expect. Consider the string "There are many A.I. programs for pattern recognition." where "A.I." is actually a valid word that contains punctuation. Though the punctuation is considered to be a part of the word itself (as an abbreviation). This type of complexity goes on and on with many rules and exceptions on how exactly to extract valid words from a text.

So how do we do it right?

Well, it turns out there is actually a body of standards that addresses this problem for several languages. It’s called the Unicode Standard for text segmentation or UAX #29 for short. It turns out it’s already implemented in libicu, which PHP implements as the Intl extension. So we can use it in PHP to do stuff like this:

<?php
$text = "English is a West Germanic language that was first spoken in early medieval England and eventually became a global lingua franca. It is named after the Angles, one of the Germanic tribes that migrated to the area of Great Britain that later took their name, as England. Both names derive from Anglia, a peninsula in the Baltic Sea. The language is closely related to Frisian and Low Saxon, and its vocabulary has been significantly influenced by other Germanic languages, particularly Norse (a North Germanic language), and to a greater extent by Latin and French.

  English has developed over the course of more than 1,400 years. The earliest forms of English, a group of West Germanic (Ingvaeonic) dialects brought to Great Britain by Anglo-Saxon settlers in the 5th century, are collectively called Old English. Middle English began in the late 11th century with the Norman conquest of England; this was a period in which the language was influenced by French.";

$it = IntlBreakIterator::createWordInstance("en_US");
$it->setText($text);
$parts = $it->getPartsIterator();

$punctuation = [
        IntlChar::CHAR_CATEGORY_DASH_PUNCTUATION,
        IntlChar::CHAR_CATEGORY_START_PUNCTUATION,
        IntlChar::CHAR_CATEGORY_END_PUNCTUATION,
        IntlChar::CHAR_CATEGORY_CONNECTOR_PUNCTUATION,
        IntlChar::CHAR_CATEGORY_OTHER_PUNCTUATION,
        IntlChar::CHAR_CATEGORY_INITIAL_PUNCTUATION,
        IntlChar::CHAR_CATEGORY_FINAL_PUNCTUATION,
    ];
foreach ($parts as $word) {
    if (mb_strlen($word) == 1 &&
        (in_array(IntlChar::charType($word), $punctuation) || IntlChar::isUWhiteSpace($word)))
    {
        continue;
    }
    echo "$word\n";
}

Notice how it gets even numbers right, like 1,400, but is able to identify words like in the string "There,are" where it correctly parses "There" and "are" as two separate words. This same standard is similar to how your browser figures out which characters to highlight when you double click on some text. It selects the sequence of characters based on the same rules in the IntlRuleBasedBreakIterator that identifies word instances.