Investigate additional Search Engines. #178

Jieiku · 2024-07-01T21:17:34Z

elasticlunr was the first implemented in abridge because index generation was directly supported by Zola.

elasticlunr also supports CJK, stemmers, and stop words, so it is a good solution for a wide range of people.

I then implemented both tinysearch and stork, the demos are here:

https://jieiku.github.io/abridge-tinysearch/

https://jieiku.github.io/abridge-stork/

Those demos are static builds from an older version of abridge, I lost interest in stork because it actually used more bandwidth than elasticlunr. I am however interested in getting tinysearch working again.

Zola now supports building a json based index:

getzola/zola#2507

https://www.getzola.org/documentation/content/search/#fuse

I think I may have looked at flexsearch but I cannot remember all the details, it has been a while, another one I am interested in is pagefind: CloudCannon/pagefind#277

I opened a new issue at tinysearch, I don't have time to work on it at the moment: tinysearch/tinysearch#178

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy,FlexSearch,Elasticlunr,Fuse&search=super%20ma

After looking here it looks like flexsearch supports stemmers and CJK nextapps-de/flexsearch#51

I am not sure how automatic that support is, but it looks like flexsearch is worth looking into.

Hysterelius · 2024-07-01T23:27:23Z

I have been having a look at FlexSearch especially due to its faster speeds and lower bundle size, but I am still working out how the index is constructed and whether Zola could support it.

If that doesn't work, I am happy to try and fix tinysearch, but I have limited time currently so it might take me a while either way.

Jieiku · 2024-07-02T06:34:17Z

flexsearch is a smaller script, 5.87 kB instead of 18.05 kB for elasticlunr.

but elasticlunr loads faster, and uses less memory, which might be important on mobile devices.

The author of uFuzzy setup the benchmarks, and he admits he maybe does not have other search engines tuned perfectly, but that he made a best effort when documentation was available, you can see the full table at the bottom of this github page: https://github.com/leeoniya/uFuzzy

On the flipside, flexsearch performs a search in 3ms, and elasticlunr takes 14ms. but I think this difference is inconsequential compared to the enormous difference in load speed and memory usage.

elasticlunr uses 89 MB and takes 978ms to load
flexsearch uses 336 MB and takes 3,088ms to load

here is elasticlunr:

here is flexsearch:

Hysterelius · 2024-07-02T06:46:58Z

Yeah, it is probably a bit pointless going chasing after those extra milliseconds especially on a static site and even more so if it increases load times.

Jieiku · 2024-07-02T06:52:12Z

I have been on the lookout for something that is better across the board...(mostly because elasticlunr is no longer maintained) pagefind is interesting because it chunks the index, so if you had a really enormous site with lots of articles, it would not download the entire index, only the chunks relevant to your search.... but I think until a site is pretty big that elasticlunr would actually perform better than pagefind, but is just my hunch without having actually tried pagefind.

Jieiku · 2024-07-02T07:15:29Z

All of that said, I like the idea of Abridge being flexible, so if anyone wants to submit a pull request to support a given search engine, I would likely add it as long as it didn't cause an unavoidable problem.

Hysterelius · 2024-07-02T08:04:49Z

I was looking at search engine that supports fuzzy matching (elasticlunr, tinysearcch and stork all don't seem to), to help provide a better user experience. That being said, pagefind looks to support this (and also supports indexing other files, like PDFs - which I have also been looking for) and I might have a go at seeing what I can do to support it.

I'm not sure how well I'll go with chunking and manipulating the index - but I'll give it a try :)

Hysterelius · 2024-07-02T08:27:09Z

Just checking what you think of this flow for the search process.

As I think it would have to pass through a node script on the build step which I guess is already done on the process for elasticlunr.

I don't know quite yet how to support multilingual, but at least pagefind seems to support it out of the box.

Jieiku · 2024-07-02T19:49:41Z

Yes that is exactly what I was thinking, I believe if we configured Zola to output a json index that we should be able to send it to pagefind through their node api using this method:

https://pagefind.app/docs/node-api/#indexaddcustomrecord

If the json format that zola outputs is missing some things we need then maybe we can look at the recently added fuse json output and create a pull request to zola adding an additional json index format that is compatible with pagefind.

Once the pagefind index is built I am guessing we would use this to save it:

https://pagefind.app/docs/node-api/#indexwritefiles

Hysterelius · 2024-07-03T03:51:27Z

It is pretty simple to both construct an index, however it does seem to have a heavy reliance on node (I tried to package it using esbuild - and it didn't like it).

I wrote this script based off this information:
CloudCannon/pagefind#277
I was just wondering where zola puts this intermediate data?

import * as pagefind from 'pagefind';

async function createIndex() {
    // Create a new Pagefind index
    const { index } = await pagefind.createIndex({
        forceLanguage: 'en', // Force the language to English
    });

    // Define your data
    const data = [{
        "title": "Abridge Zola Theme",
        "url": "https://abridge.netlify.app/overview-abridge/",
        "meta": "Abridge is a fast and lightweight Zola theme using semantic html, only ~6kb css before svg icons and syntax highlighting, no mandatory JS, and perfect…",
        "body": "Abridge is a fast and lightweight Zola theme using semantic html..."
    }, {
        "title": "Code Blocks and Themes",
        "url": "https://abridge.netlify.app/overview-code-blocks/",
        "meta": "This article shows various Code Blocks allowing to easily compare sublime themes.\n",
        "body": "This article shows various Code Blocks allowing to easily compare sublime themes..."
    }, {
        "title": "Markdown and Style Guide",
        "url": "https://abridge.netlify.app/overview-markdown-and-style/",
        "meta": "This article offers a sample of basic Markdown syntax that can be used in Zola content files, also it shows if basic HTML elements are decorated with …",
        "body": "This article offers a sample of basic Markdown syntax that can be used in Zola content files, also it shows if basic HTML elements are decorated with CSS in a Zola theme..."
    }, {
        "title": "Image Shortcodes",
        "url": "https://abridge.netlify.app/overview-images/",
        "meta": "This post covers the imghover and img shortcodes. Images can also be embedded directly using markdown ![Ferris](ferris.svg), but it is better to use a…",
        "body": "This post covers the imghover and img shortcodes. Images can also be embedded directly using markdown..."
    }, {
        "title": "Rich Content",
        "url": "https://abridge.netlify.app/overview-rich-content/",
        "meta": "Several custom shortcodes are included to augment CommonMark (courtesy of d3c3nt theme), in addition to those already provided by Zola. video, image, …",
        "body": "Several custom shortcodes are included to augment CommonMark (courtesy of d3c3nt theme), in addition to those already provided by Zola. video, image, gif,..."
    }, {
        "title": "Embedded Youtube Videos",
        "url": "https://abridge.netlify.app/overview-embedded-youtube/",
        "meta": "Zola has many shortcodes, and new are easily added, this example shows youtube.\n",
        "body": "Zola has many shortcodes, and new are easily added, this example shows youtube.\nYoutube\nwith yt(id=&quot;the_id_here&quot;)\n\nid: the video id (mandatory)\nplaylist: the playlist id (optional)\nclass: a class to add to the &lt;div&gt; surrounding the iframe (optional)\nautoplay: when set to &quot;true&quot;, the video autoplays on load (optional)\ntitle - set alt title for the iframe (optional, defaults to &quot;Youtube&quot;)\ncookie - set to &quot;true&quot; if you want tracking cookies, otherwise it defaults to false.\n\n\n\t\n\n"
    }, {
        "title": "Embedded Vimeo Videos",
        "url": "https://abridge.netlify.app/overview-embedded-vimeo/",
        "meta": "Zola has many shortcodes, and new are easily added, this example shows vimeo.\n",
        "body": "Zola has many shortcodes, and new are easily added, this example shows vimeo.\nVimeo\nwith vm(id=&quot;id_here&quot;)\n\nid: the video id (mandatory)\nclass: a class to add to the &lt;div&gt; surrounding the iframe (optional)\nautoplay: when set to &quot;true&quot;, the video autoplays on load (optional)\nloop: when set to &quot;true&quot;, the video plays on a loop (optional)\nnoautopause: when set to &quot;true&quot;, the video will not autopause (optional)\ntitle - set alt title for the iframe (optional, defaults to &quot;Vimeo&quot;)\ncookie - set to &quot;true&quot; if you want tracking cookies, otherwise it defaults to false.\n\n\n\t\n\n"
    }, {
        "title": "Mathematical Notations",
        "url": "https://abridge.netlify.app/overview-math/",
        "meta": "You can use KaTeX to render mathematical notations.\nYou can enable the $\\KaTeX$ support globally, per-section or per-page basis.\n",
        "body": "You can use KaTeX to render mathematical notations.\nYou can enable the $\\KaTeX$ support globally, per-section or per-page basis.\nEnable..."
    }];

    // Add each record to the index
    for (const record of data) {
        await index.addCustomRecord({
            url: record.url,
            content: record.body,
            language: 'en',
            meta: {
                title: record.title,
                description: record.meta,
            }
        });
    }

    // Write the index files to disk
    await index.writeFiles({
        outputPath: 'public/pagefind'
    });

    console.log('Index created successfully!');
}

createIndex().catch(console.error);

Then search is even easier (it handles the chunking for you!), it could easily be put in:

  async function search() {
    const pagefind = await import("./public/pagefind/pagefind.js");
    pagefind.init();
    const search = await pagefind.search("zola");
    const oneResult = await search.results[0].data();
    console.log(oneResult);
  }

  search();

The index is spits out is pretty small only 127B for that test data, yet pagefind itself it pretty big ~32kB not including the WASM giving a whole bundle size around 100kB which is massive compared to elasticlunr bundle size (18kB - which I could tell). But the searches are very fast :)

I am just a bit confused, as this would require users to rerun the index build process on every zola build to enable search but is that what abridge is already designed to do?

Jieiku · 2024-07-03T04:12:55Z

That is what I was thinking, yet pagefind will still pull ahead for sites with a ton of content, because as you add content the elasticlunr index gets bigger and bigger.

yes, every time you do a zola build, it generates a new index, that much is true of elasticlunr, tinysearch, stork, etc.

Hysterelius · 2024-07-03T04:17:20Z

That is what I was thinking, yet pagefind will still pull ahead for sites with a ton of content, because as you add content the elasticlunr index gets bigger and bigger.

Does elasticlunr include all those lunr files I see in public/js? As that would lead to a dramatically different size.

And pagefind wasn't bundled, so I am guessing you could get some savings off that.

yes, every time you do a zola build, it generates a new index, that much is true of elasticlunr, tinysearch, stork, etc.

Cool! I am just not quite sure how to hook into that.

Jieiku · 2024-07-03T04:47:07Z

I would configure your config.toml to output the json format:

https://www.getzola.org/documentation/content/search/#fuse

# config.toml
[search]
index_format = "fuse_json"

after you do that just issue a zola build and take a look in the public folder for the json index. (look it over to see if it will be compatible)

Then for your script that you wrote, you could wrap it up in a function within package_abridge.js and call your function after the first zola build (zola build gets called twice within package_abridge.js, once to generate the index, then after minifying the js files zola build is ran a second time to update the integrity hashes for the newly minified files.)

EDIT: Currently you have some static data within your function to feed into pagefind for the index for testing purposes const data =, instead you would use a node module to parse the json index that zola outputs when you configure it as index_format = "fuse_json" and feed that to pagefind instead.

If you look in package_abridge.js you will see that I check the values of many things in config.toml to handle logic. You will find in config.toml search_library = 'elasticlunr' under config.extra we can update the readme for that section to also include search_library = 'pagefind' and when it is set that way we would call your function within package_abridge.js

EDIT: elasticlunr does NOT load all those js files, those files are for other languages, so they only get used on pages that use other languages. (in google chrome you can press ctrl+shift+i and load the abridge demo and search for something in the searchbox and see exactly which files get loaded for elasticlunr)

EDIT: ah yes package_abridge.js can be used to minify and bundle any of the js files that pagefind uses and it should save some space.

or in firefox, which I prefer:

Hysterelius · 2024-07-04T10:31:10Z

I was just wondering if you think it is better for the users to install pagefind as a node package or do you want it bundled in static/js?

Jieiku · 2024-07-04T17:16:52Z

because node is required to build the index, we might as well just have it as a dependency that gets installed as a node package....

Any javascript for the client side search can of coarse go into static/js but anything related to building the index can just be installed as a node package.

Jieiku · 2024-07-28T20:43:40Z

Now that I have merged your pull request for pagefind, I am thinking I will go ahead and close this for now. Thanks again for your work on implementing pagefind.

Hysterelius · 2024-07-29T09:54:33Z

No problem!
Sorry about leaving the wrong sw.js files in

typo3ua · 2024-08-04T18:32:02Z

That is what I was thinking, yet pagefind will still pull ahead for sites with a ton of content, because as you add content the elasticlunr index gets bigger and bigger.

yes, every time you do a zola build, it generates a new index, that much is true of elasticlunr, tinysearch, stork, etc.

...and Jekyll too...

Jieiku · 2024-08-06T01:13:14Z

uploaded a pagefind demo: https://abridge-pagefind.pages.dev/

Jieiku · 2024-08-08T00:14:35Z

@Hysterelius where does this pagefind.js file come from?

https://github.com/Jieiku/abridge/blob/master/static/js/pagefind.js

In case I ever need to update it? I poked around on the pagefind repo but have not found it yet.

If you had to build the file let me know what is required to do that.

I noticed that there are 51 global variables: http://yellowlab.tools:8282/result/gys1dxx7b8/rule/globalVariables

I plan to open an issue with pagefind to see if that could be fixed, but would like to have an understanding of where this file comes from before I do that.

Edit: It looks like the file is created from this line:

abridge/static/js/pagefind.index.cjs

Line 67 in 41ea5d3

const pagefindPath = path.join(__dirname, "pagefind.js");

Edit2: Since we are creating that file, maybe I can wrap it in an anonymous function, will see what I can do.

Edit3: figured it out: 39606e8

http://yellowlab.tools:8282/result/gys4idfdw3/rule/globalVariables

Hysterelius · 2024-08-08T21:05:06Z

I was going to say that if we remove the file from the bundle, the search functions wouldn't be accessible to the search_pagefind.js file - but if you are inlining it into the search file in should be good :)

Jieiku · 2024-08-09T00:54:58Z

yep, I inlined the file. This demo gets built at every commit to abridge: https://abridge-pagefind.pages.dev/

Jieiku mentioned this issue Jul 1, 2024

Fix bug in package_abridge.js #176

Merged

Jieiku closed this as completed Jul 28, 2024

Jieiku mentioned this issue Aug 4, 2024

Can pagefind pull its data from a json index file? CloudCannon/pagefind#277

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate additional Search Engines. #178

Investigate additional Search Engines. #178

Jieiku commented Jul 1, 2024 •

edited

Loading

Hysterelius commented Jul 1, 2024

Jieiku commented Jul 2, 2024 •

edited

Loading

Hysterelius commented Jul 2, 2024

Jieiku commented Jul 2, 2024 •

edited

Loading

Jieiku commented Jul 2, 2024

Hysterelius commented Jul 2, 2024

Hysterelius commented Jul 2, 2024

Jieiku commented Jul 2, 2024 •

edited

Loading

Hysterelius commented Jul 3, 2024 •

edited

Loading

Jieiku commented Jul 3, 2024 •

edited

Loading

Hysterelius commented Jul 3, 2024 •

edited

Loading

Jieiku commented Jul 3, 2024 •

edited

Loading

Hysterelius commented Jul 4, 2024

Jieiku commented Jul 4, 2024

Jieiku commented Jul 28, 2024

Hysterelius commented Jul 29, 2024

typo3ua commented Aug 4, 2024

Jieiku commented Aug 6, 2024 •

edited

Loading

Jieiku commented Aug 8, 2024 •

edited

Loading

Hysterelius commented Aug 8, 2024

Jieiku commented Aug 9, 2024 •

edited

Loading

Investigate additional Search Engines. #178

Investigate additional Search Engines. #178

Comments

Jieiku commented Jul 1, 2024 • edited Loading

Hysterelius commented Jul 1, 2024

Jieiku commented Jul 2, 2024 • edited Loading

Hysterelius commented Jul 2, 2024

Jieiku commented Jul 2, 2024 • edited Loading

Jieiku commented Jul 2, 2024

Hysterelius commented Jul 2, 2024

Hysterelius commented Jul 2, 2024

Jieiku commented Jul 2, 2024 • edited Loading

Hysterelius commented Jul 3, 2024 • edited Loading

Jieiku commented Jul 3, 2024 • edited Loading

Hysterelius commented Jul 3, 2024 • edited Loading

Jieiku commented Jul 3, 2024 • edited Loading

Hysterelius commented Jul 4, 2024

Jieiku commented Jul 4, 2024

Jieiku commented Jul 28, 2024

Hysterelius commented Jul 29, 2024

typo3ua commented Aug 4, 2024

Jieiku commented Aug 6, 2024 • edited Loading

Jieiku commented Aug 8, 2024 • edited Loading

Hysterelius commented Aug 8, 2024

Jieiku commented Aug 9, 2024 • edited Loading

Jieiku commented Jul 1, 2024 •

edited

Loading

Jieiku commented Jul 2, 2024 •

edited

Loading

Jieiku commented Jul 2, 2024 •

edited

Loading

Jieiku commented Jul 2, 2024 •

edited

Loading

Hysterelius commented Jul 3, 2024 •

edited

Loading

Jieiku commented Jul 3, 2024 •

edited

Loading

Hysterelius commented Jul 3, 2024 •

edited

Loading

Jieiku commented Jul 3, 2024 •

edited

Loading

Jieiku commented Aug 6, 2024 •

edited

Loading

Jieiku commented Aug 8, 2024 •

edited

Loading

Jieiku commented Aug 9, 2024 •

edited

Loading