Big Data

The LumisXP has a framework for Big Data that allows the Big Data repository to be customizable. Options are natively included to use Elasticsearch. Connectors for other types of repositories can be implemented and integrated into LumisXP. For information on how to install Elasticsearch, see Installing Elasticsearch.

Features

Each item stored in Big Data is called a document. A document can have several fields, each with its value.

There is a API available in Java to access the Big Data repository. By accessing the repository, it is possible to add and remove documents, and perform searches for documents. For more information about this API, access the javadoc at lumis.portal.bigdata.

To reindex the data stored in the Big Data repository, go to Settings > Frameworks > Big Data. It is usually necessary to perform a reindexing if the data to be stored has undergone some alteration and is outdated, such as when the data structure of a service has changed.

The ProcessActionHandlers default of DOUI and the Content framework update the corresponding documents in the Big Data repository every time an addition, update, or deletion occurs.

Customizations

For each DOUI or Content service, it is possible to configure, at the source level, which big data persister will be used to index its data. Additionally, it is also possible to specify in each field of the source configurations regarding the storage of this field in the big data, which will be used by the default persister.

It is also possible to customize the default indexing form in a custom data type. Fields by default follow the implementation used in their respective data type.

Synonyms

The synonyms registered in the "Synonyms for search" Service are only applied by the user query of a SearchQuery, and are not generically applied in filters on specific fields.

In the native Big Data repository based on Elasticsearch, synonyms function as described below.

In the default analyzer used by the portal, the synonymous words used go through the stemmer token filter. Thus, in English, for example, the words cake and ball represent the same token ( bal), since the stemmer would remove both the last l in ball and the last e in cake. Thus, if there is a registered synonym with the words cake and sweet and the user performs a search for ball, they will find content with the word sweet.

The process of using synonyms by the portal is:

A publisher registers new synonym words.
The portal then writes synonym files in Solr format (one for each language registered in the portal) at: <lumisdata>/shared/data/elasticsearch/lumis-analysis/synonyms-<locale>.txt, where <locale> is the language of the index (en_US for English, for example). The folder <lumisdata>/shared/data/elasticsearch/lumis-analysis should be mapped on each server in the Elasticsearch cluster used by the portal within its config folder (see the synonyms_path property of the documentation), so that the files written by the portal are available in <config>/lumis-analysis/synonyms-<locale>.txt, on the Elasticsearch servers.
At this moment, these synonyms will not yet be available for search.
A process that runs in the background detects that there have been changes in the synonym registration.
For the changes in synonyms to be correctly reloaded by Elasticsearch, this process reloads the analyzers of the appropriate indices.
When an index is created by the portal, using this Big Data API, that index will already be configured to use the synonym file created by the portal in any search.
Note

The field of Big Data lum_all (used in searches) requires that these two analyzers exist in the created indices: lum_all_search_analyzer (used in the search phase) and lum_all_index_analyzer (used in the indexing phase).
LumisXP automatically creates these two analyzers when the language of the index being created is one of the following:
- English (language code en_US)
- Spanish (language code es_ES)
If the portal is using a different language from the above (fr_FR, for example), or another variation of the same language (en_GB or es_MX, for example), the solution must, through index templates, ensure that the required analyzers will be created when an index in the given language is added.

Content Popularity

The LumisXP, by default, calculates the popularity of contents once a day, during the early morning.

A class that implements the lumis.portal.bigdata.IDocumentPopularityProvider interface can be used to customize this calculation.
See the documentation for the methods getDocumentPopularities and getDocumentPopularity for details on how to customize the popularities.

The portal uses, by default, the lumis.portal.bigdata.StandardDocumentPopularityProvider implementation for calculating popularities.
The default implementation has the following configurations:

Maximum number of affected contents

This is the maximum number of contents that will have their popularities altered by the calculation process. All other contents will have a popularity 1 (which does not influence search relevance).
Default value: 1000.

Unique visitor window

This configuration determines the period (in hours) to be taken into account when calculating the contents with the most unique visitors. That is, when the portal calculates which contents had the most unique accesses, it will filter by the accesses made between the current moment and the current moment minus this configuration in hours.
Default: 1440 hours (the equivalent of 60 days).

Unique visitor limit

It is a configurable limit of unique visitors for a content. Between 0 and Unique visitor limit, the greater the number of unique visitors to a content, the greater its influence on popularity. Upon reaching Unique visitor limit, the influence on the content's popularity will reach its maximum value and from that limit onward it will continue to be that same maximum value.
If not filled, the default value for this parameter is calculated as follows:

The portal makes a query to obtain the total number of unique visitors within the configured period in the Unique visitor window.
The default value for Unique visitor limit will be 50% of the total unique visitors. That is, if during the period the portal had 100 unique visitors, the Unique visitor limit parameter will assume a default value of 50. This value will be rounded to the nearest greater integer. For example, if the total number of unique visitors is 33, 50% would be 16.5. This value will be rounded to 17.

Age limit

It is a configurable age limit for a content (in hours). Between 0 (the content has just been published) and Age limit, the smaller the age of the content, the greater its influence on popularity. From Age limit, the age of the content no longer influences the popularity of the content.
Default: 720 hours (the equivalent of 30 days).

This implementation works as follows:

Obtains, limited to the Maximum number of affected contents, the most recently registered contents.
Obtains, limited to the Maximum number of affected contents, the contents that had the most unique visitors during the Unique visitor window.
With these contents, applies the following formula to each to determine its popularity:

Popularity formula

In this formula, the popularity of a content can vary from 1 to 25.
Sorts these contents in descending order according to popularity.
Maintains only Maximum number of affected contents contents in this list.
Resets the popularity of all other contents in the Big Data repository.
Sets the popularity of these Maximum number of affected contents contents in the Big Data repository.