Edit

Share via


Configure and manage word breakers and stemmers for search (SQL Server)

Applies to: SQL Server Azure SQL Database

Word breakers and stemmers perform linguistic analysis on all full-text indexed data. Linguistic analysis does the following two things:

  • Find word boundaries (word-breaking). The word breaker identifies individual words by determining where word boundaries exist based on the lexical rules of the language. Each word (also known as a token) is inserted into the full-text index using a compressed representation to reduce its size.

  • Conjugate verbs (stemming). The stemmer generates inflectional forms of a particular word based on the rules of that language (for example, "running", "ran", and "runner" are various forms of the word "run").

Word breakers and stemmers are language specific

Word breakers and stemmers are language specific, and the rules for linguistic analysis differ for different languages. Language-specific word breakers make the resulting terms more accurate for that language.

To use the word breakers and stemmers provided for all the languages supported by SQL Server, you typically don't have to take any action.

  • Where there's a word breaker for the language family, but not for the specific sublanguage, the major language is used. For example, the French word breaker is used to handle text that is French Canadian.

  • If no word breaker is available for a particular language, the neutral word breaker is used. With the neutral word breaker, words are broken at neutral characters such as spaces and punctuation marks.

Get the list of supported languages

To see the list of languages supported by SQL Server Full-Text Search, use the following Transact-SQL statement. The presence of a language in this list indicates that word breakers are registered for the language.

SELECT *
FROM sys.fulltext_languages;

Get the list of registered word breakers

For Full-Text Search to use the word breakers for a language, they must be registered. For registered word breakers, associated linguistic resources - stemmers, noise words (stopwords), and thesaurus files - also become available to full-text indexing and querying operations.

To see the list of registered word breaker components, use the following statement.

EXECUTE sp_help_fulltext_system_components 'wordbreaker';
GO

For more information, see sp_help_fulltext_system_components.

If you add or remove a word breaker

If you add, remove, or alter a word breaker, you need to refresh the list of Microsoft Windows locale identifiers (LCIDs) that are supported for full-text indexing and querying. For more information, see View or change registered filters and word breakers.

Set the default full-text language option

For a localized version of SQL Server, SQL Server Setup sets the default full-text language option to the language of the server if an appropriate match exists. For a non-localized version of SQL Server, the default full-text language option is English.

When you create or alter a full-text index, you can specify a different language for each full-text indexed column. If no language is specified for a column, the default is the value of the configuration option default full-text language.

Note

All columns listed in a single full-text query function clause must use the same language, unless the LANGUAGE option is specified in the query. The language used for the full-text indexed column being queried determines the linguistic analysis performed on arguments of the full-text query predicates (CONTAINS and FREETEXT) and functions (CONTAINSTABLE and FREETEXTTABLE).

Choose the language for an indexed column

When creating a full-text index, we recommend that you specify a language for each indexed column. If a language isn't specified for a column, the system default language is used. The language of a column determines which word breaker and stemmer are used for indexing that column. Also, the thesaurus file of that language is used by full-text queries on the column.

There are a couple of things to consider when choosing the column language for creating a full-text index. These considerations relate to how your text is tokenized and then indexed by Full-Text Engine. For more information, see Choose a Language When Creating a Full-Text Index.

To view the word breaker language of specific columns, run the following statement.

SELECT language_id AS 'LCID'
FROM sys.fulltext_index_columns;

For more information, see sys.fulltext_index_columns.

Troubleshoot word-breaking timeout errors

A word-breaking timeout error can occur in various situations. For information about these situations and how to respond in each situation, see MSSQLSERVER_30053.

Info about the MSSQLSERVER_30053 error

Property Value
Product Name SQL Server
Event ID 30053
Event Source MSSQLSERVER
Component SQLEngine
Symbolic Name FTXT_QUERY_E_WORDBREAKINGTIMEOUT
Message Text Word breaking timed out for the full-text query string. This can happen if the wordbreaker took a long time to process the full-text query string, or if a large number of queries are running on the server. Try running the query again under a lighter load.

Explanation

A word-breaking timeout error can occur in the following situations:

  • The word breaker for the query language is configured incorrectly; for example, its registry settings are incorrect.

  • The word breaker malfunctions for a specific query string.

  • The word breaker returns too much data for a specific query string. Excess data is treated as a potential buffer overrun attack, and shuts down the filter daemon process (fdhost.exe), which hosts the word-breaking services.

  • The filter daemon process configuration is incorrect.

    The most common configuration problems are password expiration or a ___domain policy that prevents the filter daemon account from logging on.

  • A very heavy query workload is running on the server instance; for example, the word-breaker took a long time to process the full-text query string, or a large number of queries are running on the server. This is the least likely cause.

User action

Select the user action that is appropriate to the probable cause of the timeout, as follows:

Probable cause User action
The word breaker for the query language is configured incorrectly. If you're using a third-party word breaker, it might be incorrectly registered with the operating system. In this case, re-register the word breaker. For more information, see Revert word breakers used by Search to previous version (SQL Server Search).
The word breaker malfunctions for a specific query string. If the word breaker is supported by SQL Server, contact Microsoft Customer Service and Support.
The word breaker returns too much data for a specific query string. If the word breaker is supported by SQL Server, contact Microsoft Customer Service and Support.
The filter daemon process configuration is incorrect. Ensure that you're using the current password and that a ___domain policy isn't preventing the filter daemon account from logging on.
A very heavy query workload is running on the server instance. Try running the query again under a lighter load.

Understand the impact of updated word breakers

Each version of SQL Server typically includes new word breakers that have better linguistic rules and are more accurate than earlier word breakers. Potentially, the new word breakers might behave differently from the word breakers in full-text indexes that were imported from previous versions of SQL Server.

This is significant if a full-text catalog was imported when a database was upgraded to the current version of SQL Server. One or more languages used by the full-text indexes in the full-text catalog might now be associated with new word breakers. For more information, see Upgrade Full-Text Search.