Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated words in word cloud #17

Closed
TDC-mobstation opened this issue Dec 6, 2022 · 4 comments · Fixed by #18
Closed

Duplicated words in word cloud #17

TDC-mobstation opened this issue Dec 6, 2022 · 4 comments · Fixed by #18
Assignees
Labels
bug Something isn't working

Comments

@TDC-mobstation
Copy link

Hello and thanks for a great lib!

We noticed that when we have quite few words in our cloud the resulting image (or svg representation) contains duplicated words. For instance the base text "negative negative negative text quite quite quite quite quite containing words" does after grouping come out like this:
image

Our setup is basically copied from the examples:

        var k = 2;
        var wordCloud = new WordCloudInput(wces)
        {
            Width = 1024 * k,
            Height = 256 * k,
            MinFontSize = 8 * k,
            MaxFontSize = 32 * k,
        };

        var sizer = new LogSizer(wordCloud);
        using var engine = new SkGraphicEngine(sizer, wordCloud);
        var layout = new SpiralLayout(wordCloud);
        var colorizer = new RandomColorizer(); // optional
        var wcg = new WordCloudGenerator<SKBitmap>(wordCloud, engine, layout, colorizer);

        IEnumerable<(LayoutItem Item, double FontSize)> items = wcg.Arrange();

We don't really notice this problem when we have a higher amount of words but is this the expected behavior for word clouds with quite few words in them?

Thanks in advance!

@jjonescz
Copy link
Member

jjonescz commented Dec 6, 2022

Hi, thanks for reaching out. Text preprocessing is not (yet?) part of this library. You haven't shown that part of your code (i.e., where wces come from), but it should look something like this:

var text = "negative negative negative text quite quite quite quite quite containing words";

var freqs = new Dictionary<string, int>();
var whitespaces = new Regex(@"\s+");
foreach (var word in whitespaces.Split(text))
{
    if (!freqs.TryGetValue(word, out var freq))
    {
        freq = 0;
    }
    freqs[word] = freq + 1;
}
var entries = freqs.Select(p => new WordCloudEntry(p.Key, p.Value));
var wordCloud = new WordCloudInput(entries)
{
    // ...
};

That code tokenizes text into words (separating them by whitespace), counts their frequency (populating freqs dictionary) and then creates WordCloudInput. This should ensure there are no duplicate words. I have tried this with your short text sample and it works as expected, producing:

short-text

Let me know if that helps.

@jjonescz jjonescz closed this as not planned Won't fix, can't repro, duplicate, stale Dec 6, 2022
@TDC-mobstation
Copy link
Author

Hello and thanks for the quick reply!

We figured out what our issue was. In our example we did we generated both a svg and an png image, so basically we did both step 5 and step 6 in the readme (https://github.com/knowledgepicker/word-cloud/blob/master/README.md) in the same request.

We guess that they share state in some way and hence we get double of everything?

Just thought it might be interesting for others if they end up with the same thing.

@jjonescz
Copy link
Member

jjonescz commented Dec 9, 2022

We guess that they share state in some way and hence we get double of everything?

You're right, WordCloudGenerator reuses Layout which it probably shouldn't do. I will fix this. Thank you for providing these details!

@jjonescz jjonescz reopened this Dec 9, 2022
@jjonescz jjonescz added the bug Something isn't working label Dec 10, 2022
@jjonescz jjonescz self-assigned this Dec 10, 2022
@jjonescz
Copy link
Member

jjonescz commented Dec 10, 2022

This should be fixed in the latest version 1.2.0 (also on NuGet). Thanks again for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants