-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: webbrowser tool #714
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
|
Preliminary tokenizer instead of character splitter, presumably need to take that in as a arg to the browser Also played with the csrf and headers and found a set of headers that works in axios for the one site i remember that 403'd me, so were not puppeteer at least, no idea if there are shims for axios for serverless, or if we can use headers with fetch instead |
Took a look at turndown.
It seemed I still had to do html cleaning so it wouldnt attempt style and script tags, and whitespace after the fact since markdown is whitespace based. So turndown might be a bit more costly if it matters. as far as deps People have swapped out the node dom dependency Honestly results are pretty similar. we get headings, which i do think might help semantically, but waste more tokens...I think it would be nicer for say ingesting documents I probably wont pr this, but leave these notes here for whomever comes after output-me.md Only thing left on my list is to pass in tokenizer. That has a decent amount of abstraction so ill let you all look at it |
A silly idea, instead of cutting off content at context tokens, why not check if its bigger than max tokens, and if so textsplit and use an in memory vector search. hnswlib requires an install right? we dont want the dep.. I could do it manually... |
Im using (basically) this to ingest some pages, im through like 1000 urls and am catching more errors and things so its starting to be hardened. I havent viewed all 1000 but theyve largely looked ok scrolling by |
I swapped the token concat with in memory vector store from #753 |
}); | ||
const texts = await textSplitter.splitText(text); | ||
|
||
// if we have a summary grab first 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if we don't have a summary
f93222c
to
261a2ee
Compare
ok grabbed ~300 urls from a set i have, fetch fails like 5% i think axios doesnt. added tests to show some of the cases Ive seen not sure what to do. axios is better but I strongly support the serverless case i think best is to keep axios but use fetch adapter for platforms where it doesnt exist? different behavior on different platforms but what else to do? can skip a lot of these tests afterwards or comment them out, just want to show the differences |
This reverts commit 261a2ee.
…rg and default to using fetch adapter outside of node - Also fix an issue in fetch adapater producing a headers interface that conforms to axios schema
* feat: webbrowser tool * fix: lint * refactor: webbrowser splits on tokens not characters * refactor: webbrowser swap axios with headers for puppeteer * refactor: webbrowser take headers object in constructor * chore: webbrowser test fail to use fetch * refactor: webbrowser default headers * feat: webbrowser fix lost htmlResponse.status check * fix: webbrowser dont use br header tag, axios doesnt seem to support * feat: webbrowser catch errors * feat: webbrowser only bother parsing utf8 text data * feat: webbrowser remove noscript tags from selector * feat: webbrowser remove hack for 404, i think chatgpt can handle it * feat: webbrowser trim more whitespace * feat: MemoryVectorStore using mathjs cos * feat: lint and export inmemoryvectorstore * feat: webbrowser swap token truncation for an in memory vector store * refactor: webbrowser * clean up garbage files * fix: webbrowser example broken paths * refactor: webbrowser cleanup example * chore: webbrowser make search harder * feat: webbrowser requires url with protocol * chore: webbrowser serpapi without location is non deterministic * chore: webbrowser serpapi without location is non deterministic * feat: webbrowser clean inputs and move request to bottom of prompt * fix: webbrowser summary tests and fix * fix: webbrowser example serpapi add gl and hl * fix: webbrowser support noscript tags * chore: webbrowser more fiddling * Remove memory vector store added in separate pr * Update import paths, split test into unit and integration test * Move fixture * Add separate entrypoint because of dependencies * Sort out the args * Fix example * chore: webbrowser update tests * chore: webbrowser update tests * feat: webbrowser swap axios for fetch, tests fail * Revert "feat: webbrowser swap axios for fetch, tests fail" This reverts commit 261a2ee. * Downgrade axios to same version used by openai sdk, add axiosConfig arg and default to using fetch adapter outside of node - Also fix an issue in fetch adapater producing a headers interface that conforms to axios schema * Fix side effect * Adjust the prompt for higher chance of providing links with both url and label * Add docs --------- Co-authored-by: Nuno Campos <nuno@boringbits.io>
Maybe something better out there exists? I didnt see it in the python repo either, but towards an autogpt like agent I wanted to play with one
The page is trimmed using cheerio to have just text and html links, ~then cut to
3000 tokens then summarized by a chain, so it doesnt bloat the agent context space.Im not sure itll get merged. I know people are trying to trim deps so we can use langchain as a pure js expereience and I had to fall back to puppeteer because I couldnt get axios headers/csrf/cookies whatever to work. I left a test so maybe someone can get help. but this works often enough and is kinda cool.
Also would need to be hardened. probably actually count tokens instead of being split to ~1000 characters. pass in agent? stuff like that.