Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: webbrowser tool #714

Merged
merged 45 commits into from
Apr 14, 2023
Merged

Conversation

jacobrosenthal
Copy link
Contributor

@jacobrosenthal jacobrosenthal commented Apr 10, 2023

Maybe something better out there exists? I didnt see it in the python repo either, but towards an autogpt like agent I wanted to play with one

The page is trimmed using cheerio to have just text and html links, ~then cut to 3000 tokens then summarized by a chain, so it doesnt bloat the agent context space.

Im not sure itll get merged. I know people are trying to trim deps so we can use langchain as a pure js expereience and I had to fall back to puppeteer because I couldnt get axios headers/csrf/cookies whatever to work. I left a test so maybe someone can get help. but this works often enough and is kinda cool.

Also would need to be hardened. probably actually count tokens instead of being split to ~1000 characters. pass in agent? stuff like that.

langchain-examples:start: cache bypass, force executing 625974587a4ddec7
langchain-examples:start: Loaded agent.
langchain-examples:start: Executing with input "Whats the word of the day on https://www.merriam-webster.com/word-of-the-day?"...
langchain-examples:start: Entering new agent_executor chain...
langchain-examples:start: [
langchain-examples:start:   "Answer the following questions as best you can. You have access to the following tools:\n\nsearch: a search engine. useful for when you need to answer questions about current events. input should be a search query.\ncalculator: Useful for getting the result of a math expression. The input to this tool should be a valid mathematical expression that could be executed by a simple calculator.\nweb-browser: a web browser. useful for when you need to find something or summarize a url. input should be a comma seperated list of \"url\",\"what you want to find on the page\".\n\nUse the following format:\n\nQuestion: the input question you must answer\nThought: you should always think about what to do\nAction: the action to take, should be one of [search\ncalculator\nweb-browser]\nAction Input: the input to the action\nObservation: the result of the action\n... (this Thought/Action/Action Input/Observation can repeat N times)\nThought: I now know the final answer\nFinal Answer: the final answer to the original input question\n\nBegin!\n\nQuestion: Whats the word of the day on https://www.merriam-webster.com/word-of-the-day?\nThought:"
langchain-examples:start: ]
langchain-examples:start:  I need to find the word of the day on a website.
langchain-examples:start: Action: web-browser
langchain-examples:start: Action Input: "https://www.merriam-webster.com/word-of-the-day", "word of the day"
langchain-examples:start:  I need to find the word of the day on a website.
langchain-examples:start: Action: web-browser
langchain-examples:start: Action Input: "https://www.merriam-webster.com/word-of-the-day", "word of the day"
langchain-examples:start: [
langchain-examples:start:   "I need  \"word of the day from the following webpage text, also provide up to 5 html links from within that would be of interest. Text:\n Word of the Day: Foible | Merriam-Webster <iframe src=\"https://www.googletagmanager.com/ns.html?id=GTM-WW4KHXF\"     height=\"0\" width=\"0\" style=\"display:none;visibility:hidden\"></iframe> [](https://www.merriam-webster.com/) Merriam-Webster Logo [](javascript:void(0);) Menu Toggle [](javascript:void(0);) [](https://www.merriam-webster.com/) Merriam-Webster Logo Hello, Username [Log In](https://www.merriam-webster.com/login) [Sign Up](https://www.merriam-webster.com/register) [](javascript:void(0);) Username [](https://www.merriam-webster.com/saved-words) My Words [](https://www.merriam-webster.com/recents) Recents [](https://www.merriam-webster.com/settings) Settings [](https://www.merriam-webster.com/logout) Log Out [Games & Quizzes](https://www.merriam-webster.com/games) [Thesaurus](https://www.merriam-webster.com/thesaurus) [Features](https://www.merriam-webster.com/words-at-play) [Word of the Day](https://www.merriam-webster.com/word-of-the-day) [Shop](https://shop.merriam-webster.c"
langchain-examples:start: ]
langchain-examples:start: 
langchain-examples:start: 
langchain-examples:start: Word of the Day: Foible 
langchain-examples:start: 
langchain-examples:start: HTML Links: 
langchain-examples:start: 1. https://www.merriam-webster.com/
langchain-examples:start: 2. https://www.merriam-webster.com/login
langchain-examples:start: 3. https://www.merriam-webster.com/register
langchain-examples:start: 4. https://www.merriam-webster.com/saved-words
langchain-examples:start: 5. https://www.merriam-webster.com/recents
langchain-examples:start: 
langchain-examples:start: 
langchain-examples:start: Word of the Day: Foible 
langchain-examples:start: 
langchain-examples:start: HTML Links: 
langchain-examples:start: 1. https://www.merriam-webster.com/
langchain-examples:start: 2. https://www.merriam-webster.com/login
langchain-examples:start: 3. https://www.merriam-webster.com/register
langchain-examples:start: 4. https://www.merriam-webster.com/saved-words
langchain-examples:start: 5. https://www.merriam-webster.com/recents
langchain-examples:start: [
langchain-examples:start:   "Answer the following questions as best you can. You have access to the following tools:\n\nsearch: a search engine. useful for when you need to answer questions about current events. input should be a search query.\ncalculator: Useful for getting the result of a math expression. The input to this tool should be a valid mathematical expression that could be executed by a simple calculator.\nweb-browser: a web browser. useful for when you need to find something or summarize a url. input should be a comma seperated list of \"url\",\"what you want to find on the page\".\n\nUse the following format:\n\nQuestion: the input question you must answer\nThought: you should always think about what to do\nAction: the action to take, should be one of [search\ncalculator\nweb-browser]\nAction Input: the input to the action\nObservation: the result of the action\n... (this Thought/Action/Action Input/Observation can repeat N times)\nThought: I now know the final answer\nFinal Answer: the final answer to the original input question\n\nBegin!\n\nQuestion: Whats the word of the day on https://www.merriam-webster.com/word-of-the-day?\nThought: I need to find the word of the day on a website.\nAction: web-browser\nAction Input: \"https://www.merriam-webster.com/word-of-the-day\", \"word of the day\"\nObservation: \n\nWord of the Day: Foible \n\nHTML Links: \n1. https://www.merriam-webster.com/\n2. https://www.merriam-webster.com/login\n3. https://www.merriam-webster.com/register\n4. https://www.merriam-webster.com/saved-words\n5. https://www.merriam-webster.com/recents\nThought:"
langchain-examples:start: ]
langchain-examples:start:  I now know the final answer
langchain-examples:start: Final Answer: The word of the day on https://www.merriam-webster.com/word-of-the-day is Foible.
langchain-examples:start: Finished chain.
langchain-examples:start: Got output The word of the day on https://www.merriam-webster.com/word-of-the-day is Foible.

@vercel
Copy link

vercel bot commented Apr 10, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
langchainjs-docs ✅ Ready (Inspect) Visit Preview Apr 14, 2023 10:35am

@jacobrosenthal
Copy link
Contributor Author

jacobrosenthal commented Apr 10, 2023

The only dep ive seen that might make sense is https://github.com/mixmark-io/turndown meh
could turn html to markdown with cheerio which is basically what im doing. no idea what ITS deps are though

Also chagpt seems to think theres a way to do csrf stuff to get around the 403 like
worked were on axios now, but not fetch yet

@jacobrosenthal
Copy link
Contributor Author

Preliminary tokenizer instead of character splitter, presumably need to take that in as a arg to the browser

Also played with the csrf and headers and found a set of headers that works in axios for the one site i remember that 403'd me, so were not puppeteer at least, no idea if there are shims for axios for serverless, or if we can use headers with fetch instead

@jacobrosenthal
Copy link
Contributor Author

jacobrosenthal commented Apr 11, 2023

Took a look at turndown.

    // turndown does a bad job not converting some elements so strip them
    const $ = cheerio.load(htmlResponse.data);
    $("style, script, svg").remove();
    $("a, img").each((_i, el) => {
      const $el = $(el);
      const attribute = $el.is("a") ? "href" : "src";
      const relativeUrl = $el.attr(attribute);

      // turndown doesnt seem to fix urls
      if (relativeUrl && !relativeUrl.startsWith("http")) {
        const absoluteUrl = new URL(relativeUrl, baseUrl).toString();
        $el.attr(attribute, absoluteUrl);
      }
    });
    const strippedHtml = $.html();

    const turndownService = new TurndownService();
    let text = turndownService.turndown(strippedHtml);
    // we dont want newlines or whitespace for token concerns
    text = text.split("\n").map(line => line.trim()).join(" ");

It seemed I still had to do html cleaning so it wouldnt attempt style and script tags, and whitespace after the fact since markdown is whitespace based. So turndown might be a bit more costly if it matters.

as far as deps People have swapped out the node dom dependency
mixmark-io/turndown#390

Honestly results are pretty similar. we get headings, which i do think might help semantically, but waste more tokens...I think it would be nicer for say ingesting documents

I probably wont pr this, but leave these notes here for whomever comes after

output-me.md
output-turndown.md

Only thing left on my list is to pass in tokenizer. That has a decent amount of abstraction so ill let you all look at it

@jacobrosenthal
Copy link
Contributor Author

A silly idea, instead of cutting off content at context tokens, why not check if its bigger than max tokens, and if so textsplit and use an in memory vector search. hnswlib requires an install right? we dont want the dep.. I could do it manually...
Is it worth embedding 2-10 times?
Would we be ok with a math.js dep?
mathjs.dot(embedding, document.embedding)

@jacobrosenthal
Copy link
Contributor Author

jacobrosenthal commented Apr 11, 2023

Im using (basically) this to ingest some pages, im through like 1000 urls and am catching more errors and things so its starting to be hardened. I havent viewed all 1000 but theyve largely looked ok scrolling by

@jacobrosenthal
Copy link
Contributor Author

I swapped the token concat with in memory vector store from #753
presumably could pass in embeddings, and or the entire summarizer chain and prompt

});
const texts = await textSplitter.splitText(text);

// if we have a summary grab first 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we don't have a summary

langchain/src/agents/tools/planner-gpt.ts Outdated Show resolved Hide resolved
@jacobrosenthal
Copy link
Contributor Author

ok grabbed ~300 urls from a set i have, fetch fails like 5% i think axios doesnt. added tests to show some of the cases Ive seen

not sure what to do. axios is better but I strongly support the serverless case

i think best is to keep axios but use fetch adapter for platforms where it doesnt exist?

different behavior on different platforms but what else to do?

can skip a lot of these tests afterwards or comment them out, just want to show the differences

…rg and default to using fetch adapter outside of node

- Also fix an issue in fetch adapater producing a headers interface that conforms to axios schema
@nfcampos nfcampos merged commit f913496 into langchain-ai:main Apr 14, 2023
RohitMidha23 pushed a commit to RohitMidha23/langchainjs that referenced this pull request Apr 18, 2023
* feat: webbrowser tool

* fix: lint

* refactor: webbrowser splits on tokens not characters

* refactor: webbrowser swap axios with headers for puppeteer

* refactor: webbrowser take headers object in constructor

* chore: webbrowser test fail to use fetch

* refactor: webbrowser default headers

* feat: webbrowser fix lost htmlResponse.status check

* fix: webbrowser dont use br header tag, axios doesnt seem to support

* feat: webbrowser catch errors

* feat: webbrowser only bother parsing utf8 text data

* feat: webbrowser remove noscript tags from selector

* feat: webbrowser remove hack for 404, i think chatgpt can handle it

* feat: webbrowser trim more whitespace

* feat: MemoryVectorStore using mathjs cos

* feat: lint and export inmemoryvectorstore

* feat: webbrowser swap token truncation for an in memory vector store

* refactor: webbrowser

* clean up garbage files

* fix: webbrowser example broken paths

* refactor: webbrowser cleanup example

* chore: webbrowser make search harder

* feat: webbrowser requires url with protocol

* chore: webbrowser serpapi without location is non deterministic

* chore: webbrowser serpapi without location is non deterministic

* feat: webbrowser clean inputs and move request to bottom of prompt

* fix: webbrowser summary tests and fix

* fix: webbrowser example serpapi add gl and hl

* fix: webbrowser support noscript tags

* chore: webbrowser more fiddling

* Remove memory vector store added in separate pr

* Update import paths, split test into unit and integration test

* Move fixture

* Add separate entrypoint because of dependencies

* Sort out the args

* Fix example

* chore: webbrowser update tests

* chore: webbrowser update tests

* feat: webbrowser swap axios for fetch, tests fail

* Revert "feat: webbrowser swap axios for fetch, tests fail"

This reverts commit 261a2ee.

* Downgrade axios to same version used by openai sdk, add axiosConfig arg and default to using fetch adapter outside of node

- Also fix an issue in fetch adapater producing a headers interface that conforms to axios schema

* Fix side effect

* Adjust the prompt for higher chance of providing links with both url and label

* Add docs

---------

Co-authored-by: Nuno Campos <nuno@boringbits.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants