Big Data
The LumisXP has a framework for Big Data that allows the Big Data repository to be customizable. Options are natively included to use Elasticsearch. Connectors for other types of repositories can be implemented and integrated into LumisXP. For information on how to install Elasticsearch, see Installing Elasticsearch.
Features
Each item stored in Big Data is called a document. A document can have several fields, each with its value.
There is a API available in Java to access the Big Data repository. By accessing the repository, it is possible to add and remove documents, and perform searches for documents. For more information about this API, access the javadoc at lumis.portal.bigdata.
To reindex the data stored in the Big Data repository, go to Settings > Frameworks > Big Data. It is usually necessary to perform a reindexing if the data to be stored has undergone some alteration and is outdated, such as when the data structure of a service has changed.
The ProcessActionHandlers default of DOUI and the Content framework update the corresponding documents in the Big Data repository every time an addition, update, or deletion occurs.
Customizations
For each DOUI or Content service, it is possible to configure, at the source level, which big data persister will be used to index its data. Additionally, it is also possible to specify in each field of the source configurations regarding the storage of this field in the big data, which will be used by the default persister.
It is also possible to customize the default indexing form in a custom data type. Fields by default follow the implementation used in their respective data type.
Synonyms
The synonyms registered in the "Synonyms for search" Service are only applied by the user query of a SearchQuery, and are not generically applied in filters on specific fields.
In the native Big Data repository based on Elasticsearch, synonyms function as described below.
In the default analyzer used by the portal, the synonymous words used go through the stemmer token filter. Thus, in English, for example, the words cake
and ball
represent the same token (
bal
), since the stemmer would remove both the last l
in ball
and the last e
in cake
.
Thus, if there is a registered synonym with the words cake
and sweet
and the user performs a search for
ball
, they will find content with the word sweet
.
The process of using synonyms by the portal is:
- A publisher registers new synonym words.
- The portal then writes synonym files in Solr format
(one for each language registered in the portal) at:
<lumisdata>/shared/data/elasticsearch/lumis-analysis/synonyms-<locale>.txt
, where<locale>
is the language of the index (en_US
for English, for example). The folder<lumisdata>/shared/data/elasticsearch/lumis-analysis
should be mapped on each server in the Elasticsearch cluster used by the portal within itsconfig
folder (see thesynonyms_path
property of the documentation), so that the files written by the portal are available in<config>/lumis-analysis/synonyms-<locale>.txt
, on the Elasticsearch servers.
At this moment, these synonyms will not yet be available for search. - A process that runs in the background detects that there have been changes in the synonym registration.
For the changes in synonyms to be correctly reloaded by Elasticsearch, this process reloads the analyzers of the appropriate indices. - When an index is created by the portal, using this Big Data API, that index will already be configured to use the synonym file created by the portal in any search.
Content Popularity
The LumisXP, by default, calculates the popularity of contents once a day, during the early morning.
A class that implements the lumis.portal.bigdata.IDocumentPopularityProvider
interface can be used to
customize this calculation.
See the documentation for the methods getDocumentPopularities
and getDocumentPopularity
for details on how
to customize the popularities.
The portal uses, by default, the lumis.portal.bigdata.StandardDocumentPopularityProvider
implementation for calculating popularities.
The default implementation has the following configurations:
-
Maximum number of affected contents
-
This is the maximum number of contents that will have their popularities altered by the calculation process. All other contents will have a popularity
1
(which does not influence search relevance).
Default value:1000
. -
Unique visitor window
-
This configuration determines the period (in hours) to be taken into account when calculating the contents with the most unique visitors. That is, when the portal calculates which contents had
the most unique accesses, it will filter by the accesses made between the current moment and the current moment minus this configuration in hours.
Default:1440
hours (the equivalent of 60 days). -
Unique visitor limit
-
It is a configurable limit of unique visitors for a content. Between
0
andUnique visitor limit
, the greater the number of unique visitors to a content, the greater its influence on popularity. Upon reachingUnique visitor limit
, the influence on the content's popularity will reach its maximum value and from that limit onward it will continue to be that same maximum value.
If not filled, the default value for this parameter is calculated as follows:
- The portal makes a query to obtain the total number of unique visitors within the configured period in the
Unique visitor window
. - The default value for
Unique visitor limit
will be 50% of the total unique visitors. That is, if during the period the portal had 100 unique visitors, theUnique visitor limit
parameter will assume a default value of50
. This value will be rounded to the nearest greater integer. For example, if the total number of unique visitors is33
, 50% would be16.5
. This value will be rounded to17
.
- The portal makes a query to obtain the total number of unique visitors within the configured period in the
-
Age limit
-
It is a configurable age limit for a content (in hours). Between
0
(the content has just been published) andAge limit
, the smaller the age of the content, the greater its influence on popularity. FromAge limit
, the age of the content no longer influences the popularity of the content.
Default:720
hours (the equivalent of 30 days).
This implementation works as follows:
- Obtains, limited to the
Maximum number of affected contents
, the most recently registered contents. - Obtains, limited to the
Maximum number of affected contents
, the contents that had the most unique visitors during theUnique visitor window
. - With these contents, applies the following formula to each to determine its popularity:
In this formula, the popularity of a content can vary from1
to25
. - Sorts these contents in descending order according to popularity.
- Maintains only
Maximum number of affected contents
contents in this list. - Resets the popularity of all other contents in the Big Data repository.
- Sets the popularity of these
Maximum number of affected contents
contents in the Big Data repository.