Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BundlesCMS][SearchBundle][NodeSearchBundle] Remove all deprecations in our ElasticaSearch related code. #1185

Merged
merged 6 commits into from
Jun 20, 2016
Merged

[BundlesCMS][SearchBundle][NodeSearchBundle] Remove all deprecations in our ElasticaSearch related code. #1185

merged 6 commits into from
Jun 20, 2016

Conversation

Numkil
Copy link

@Numkil Numkil commented Jun 3, 2016

Q A
Bug fix? no
New feature? yes
BC breaks? yes
Deprecations? no
Fixed tickets #1117
  • Remove _analyzer mapping
  • Create documentation/explanation about removing _analyzer mapping
  • Rewrite queries to use aggregations instead of facets
  • Create documentation/explanation about removing facets in favor of aggregations
  • Remove _boost mapping
  • Create documentation/explanation about removing _boost mapping
  • Remove index_name in view_roles mapping
  • Create documentation/explanation about removing index_name mapping
  • Extend upgrade.markdown file

Remove _analyzer mapping

Why is it changed?

At Kunstmaan we are looking to also provide support for Elasticsearch 2.* in the near future. This is one of the preliminary cleanups before we can look at that upgrade.

Using an _analyzer field per document had already been flagged as deprecated since 1.5! See elastic/elasticsearch#9279 for an in depth reasoning as to why they did not want to support it any more.
This meant for us that our current way of using Elasticsearch to search in our pages would not work any more if we decided to upgrade to ElasticSearch 2.*.
A secondary reason for rewriting this part of the code is to improve scaling for websites with a very large amount of pages. For example a high traffic blog with 1000 pages each translated in to 4 different languages. Our old query would have to parse all 4000 generated documents and filter afterwards each time you searched for something. This has been improved as well. The new query will only parse the 1000 relevant translations.

What has changed?

How did our code work before?

When invoking the setup and populate command 1 index would be created, in this index all pages of the website would be indexed. This index had many different analyzers for each language. On all these indexed documents a property lang and analayzer was set. When querying the index all documents would be parsed using the analyzer defined in the field per document. Afterwards the results would be filtered based on the lang field on this document.

How does it work now?

When invoking the setup and populate command a different index will be created for each configured locale. Documents will only be added to the relevant index. Each index has 1 index analyzer and 1 suggestion analyzer specific for the language in the index. No extra properties need to be saved on the documents. When querying, the query will only be run on the relevant index using the index specific suggestion_analyzer. Documents we do not care about in that specific query because of the language will not be parsed anymore and thus no extra filtering afterwards is necessary.

Rewrite queries to use aggregations instead of facets.

What has changed

Facets have been disabled in ElasticSearch 2.*, they have been superseded by aggregations. Aggregations are very simply explained a more powerful implementation for facets.

We already used aggregations in some parts of the code but still kept using facets in other parts to avoid BC breaks. These facets have now been completely removed and replaced by aggregations. This means however that upon upgrading your KunstmaanBundles project you will have to fix small parts of the twig files that are created when using app/console kuma:generate:search.

Remove _boost mapping

Why is it changed?

Together with removing the option for assigning a specific analyzer to a specific document, the ElasticSearch team decided to remove the option to assign a specific _boost on a document level. A more thorough explanation on why document boosting has been removed can be found here. http://blog.brusic.com/2014/02/document-boosting-in-elasticsearch.html

It could make the queries and scoring very difficult to understand because some field on the result documents that you can't really see could change the scoring of your query. This could result in people writing their own NodeSearcher with custom queries and not being able to fine tune the scoring.

What has changed?

How did our code work before?

When populating the indices we would map a _boost field on each document. This _boost field would be incremented, if the page for that document implemented the BoostSearchInterface or if a NodeSearch existed for that node in the database. When querying the indices these _boost fields would magically affect the end score.

How does it work now?

The _boost field has been removed. Because of this we can not do anything at population time anymore. Instead for each query that is now done the top 500 results will be "rescored". This means a second query will be run on top of those results that will affect the end score. This second query will apply extra boosts based on the fact if your page implements the BoostSearchInterface or has a NodeSearch in the database. This is more verbose, easier to understand and configurable than our old implementation. Because of how rescoring works the extra load this second query brings should be minimal.
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-rescore.html

Replace index_name in view_roles mapping

What has changed?

index_name was a really old and obscure option that didn't really do anything anymore. Removing the mapping did not break any of the functionality and makes our mappings more ElasticSearch 2.* compatible. The option has also been removed from our configuration as you really should not be using it anymore. It has been replaced by it's actually useful predecessor "copy_to".

For more information about the index_name mapping check this page from the ElasticSearch wiki:
https://www.elastic.co/guide/en/elasticsearch/reference/1.3/mapping-object-type.html#_include_in_all_2
And it's predecessor:
https://www.elastic.co/guide/en/elasticsearch/guide/current/custom-all.html

BC breaks!

Website with a default SearchPage without changes to the underlying code.

These websites will have to do 2 things to become fully operational again.
First they have to set up their indices again. This can be achieved with the following commands.

 app/console kuma:search:setup && app/console kuma:search:populate full

And the twig files that were generated by the app/console kuma:generate:search have to be fixed. All references to facets in these files have to be rewritten to aggregations. In this commit you can see exactly how to do that. 5e4a7c1

Changed interfaces and implementations

If you have overwritten, implemented, extended any of these interfaces and classes you should check if you need to do some rewriting of your custom code.

_src/Kunstmaan/SearchBundle/Search/AnalysisFactoryInterface_

  • Method setupLanguages has been removed since you want create 1 analysis per index/language.
  • Getter and Setter for the variable stopwords have been removed. They are not used in the default implementation and it does not make sense to have getters and setters in an interface. If you want to create your own AnalysisFactory implementation which uses a custom list of stopwords you can inject that list however you want.
  • Method addStemmerFilter($language) has been added. Usually a better alternative for NGrams.

_src/Kunstmaan/SearchBundle/Search/AnalysisFactory_

  • Names of the analyzers have been changed so that the index can automatically recognize what to use them for. "index_analyzer" is changed to "default" and suggestion_analyzer has been changed to "default_search".

_src/Kunstmaan/NodeSearchBundle/Search/SearcherInterface_

  • Parameter $lang has been removed from the method defineSearch. It does not make sense to provide this parameter anymore since the index in the query is already language specific, and thus no need to filter on this in the query.

_src/Kunstmaan/NodeSearchBundle/Search/AbstractElasticaSearcher_

  • Parameter $lang has been removed from the method defineSearch. It does not make sense to provide this parameter anymore since the index in the query is already language specific, and thus no need to filter on this in the query.
  • Search function has been altered to search in the language specific index.

_src/Kunstmaan/NodeSearchBundle/Configuration/NodePagesConfiguration_

  • Default behaviour for createIndex() creates a specific index for each configured locale
  • Default behaviour for populateIndex() has been altered to reflect the new index structure
  • Mapping _boost has been removed
  • Mapping index_analyzer and suggestion_analyzer are gone because the index now has to detect the function of the analyzers based on their name.

_src/Kunstmaan/NodeSearchBundle/Services/SearchService_

  • The applySearchParams function does not apply facets to the query anymore.

_src/Kunstmaan/NodeSearchBundle/pagerFanta/Adapter/SearcherRequestAdapterInterface_

  • Method getFacets() has been removed.

_src/Kunstmaan/NodeSearchBundle/PagerFanta/Adapter/SearcherRequestAdapter_

  • All references to Facets have been removed.

_src/Kunstmaan/NodeSearchBundle/Search/NodeSearcher_

  • Applying extra boost to the title field should not be hardcoded in the query because it is easily configured in the field mapping and hardcoding this would always override user configuration.
  • Extra rescoring step added to the queries as an alternative to document boosting.

_Configuration_

  • Document field 'lang' has been removed from the default mapping configuration.
  • Document field 'contentanalyzer' has been removed from the default mapping configuration.
  • Parameter 'index_name' is not available anymore. Has been with the option 'copy_to'

@Numkil Numkil changed the title [WIP] Feature/remove decrepated analyzer mapping from elasticsearch code Feature/remove decrepated analyzer mapping from elasticsearch code Jun 3, 2016
@Numkil Numkil changed the title Feature/remove decrepated analyzer mapping from elasticsearch code [WIP] remove all decrepations in our ElasticaSearch related code. Jun 3, 2016
@Numkil Numkil changed the title [WIP] remove all decrepations in our ElasticaSearch related code. [WIP] remove all deprecations in our ElasticaSearch related code. Jun 6, 2016
@Numkil Numkil changed the title [WIP] remove all deprecations in our ElasticaSearch related code. [BundlesCMS][SearchBundle][NodeSearchBundle] Remove all deprecations in our ElasticaSearch related code. Jun 7, 2016
Kevin Jossart added 6 commits June 20, 2016 16:23
…from elasticsearch code

Rewrite Setup and Population of elasticasearch indexes. Instead of 1 shared index with all our pages
together we use an index per language.
Remove decrepated _analyzer mapping and specify index and suggestion analyzer per index.
Rewrite queries -> filtering on language no longer necessary -> better for scaling because less
documents to parse
Cleanup code interfaces and AnalysisFactory to reflect changes made to indexes
…by aggregations

Replace all references to facets by aggregations
Remove all facets code
…rameter

Add method to nodeRepository to retrieve a list off all the registered refEntityNames
Remove all references to _boost in mapping and populate code
add rescoring to our query to boost results in the correct way
Remove deprecated and useless index_name mapping
Allow copy_to mapping in configuration because it is useful replacement for index_name
@jockri jockri added this to the 3.6.0 milestone Jun 20, 2016
@jockri jockri merged commit 6d000b9 into Kunstmaan:master Jun 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants