diff --git a/README.md b/README.md index e73d8ea..fa41ae6 100644 --- a/README.md +++ b/README.md @@ -60,6 +60,7 @@ For more examples, you may also find our [Blog](https://haystack.deepset.ai/blog | RAG Evaluation with Prometheus 2 | Open In Colab| | Build quizzes and adventures with Character Codex and llamafile | Open In Colab| | Invoking APIs with `OpenAPITool` | Open In Colab| +| RAG Web Search and Analysis with Apify and Haystack | Open In Colab| | Extract and use website content for RAG with Apify | Open In Colab| | Analyze Your Instagram Comments’ Vibe with Apify and Haystack | Open In Colab| | Conversational RAG using Memory | Open In Colab| diff --git a/index.toml b/index.toml index da35a27..5447f44 100644 --- a/index.toml +++ b/index.toml @@ -13,6 +13,11 @@ title = "Question Answering with Amazon Sagemaker, Chroma and Haystack" notebook = "amazon_sagemaker_and_chroma_for_qa.ipynb" topics = ["RAG"] +[[cookbook]] +title = "RAG: Web Search and Analysis with Apify and Haystack" +notebook = "apify_haystack_rag_web_browser.ipynb" +topics = ["Web Search", "Analysis", "RAG"] + [[cookbook]] title = "RAG: Extract and use website content for question answering with Apify-Haystack integration" notebook = "apify_haystack_rag.ipynb" diff --git a/notebooks/apify_haystack_rag_web_browser.ipynb b/notebooks/apify_haystack_rag_web_browser.ipynb new file mode 100644 index 0000000..ac3a5e7 --- /dev/null +++ b/notebooks/apify_haystack_rag_web_browser.ipynb @@ -0,0 +1,479 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "t1BeKtSo7KzI" + }, + "source": [ + "# Search and browse the web with Apify and Haystack\n", + "\n", + "Want to give any of your LLM applications the power to search and browse the web? In this cookbook, we'll show you how to use the [RAG Web Browser Actor](https://apify.com/apify/rag-web-browser) to search Google and extract content from web pages, then analyze the results using a large language model - all within the Haystack ecosystem using the apify-haystack integration.\n", + "\n", + "This cookbook also demonstrates how to leverage the RAG Web Browser Actor with Haystack to create powerful web-aware applications. We'll explore multiple use cases showing how easy it is to:\n", + "\n", + "1. [Search interesting topics](#search-interesting-topics)\n", + "2. [Analyze the results with OpenAIGenerator](#analyze-the-results-with-openaigenerator)\n", + "3. [Use the Haystack Pipeline for web search and analysis](#use-the-haystack-pipeline-for-web-search-and-analysis)\n", + " \n", + "**We'll start by using the RAG Web Browser Actor to perform web searches and then use the OpenAIGenerator to analyze and summarize the web content**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-7zY6NIsCj_5" + }, + "source": [ + "## Install dependencies" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "collapsed": true, + "id": "r5AJeMOE1Cou", + "outputId": "63663073-ccc5-4306-ae18-e2720d937407" + }, + "outputs": [], + "source": [ + "!pip install apify-haystack==0.1.4 haystack-ai" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h6MmIG9K1HkK" + }, + "source": [ + "## Set up the API keys\n", + "\n", + "You need to have an Apify account and obtain [APIFY_API_TOKEN](https://docs.apify.com/platform/integrations/api).\n", + "\n", + "You also need an OpenAI account and [OPENAI_API_KEY](https://platform.openai.com/docs/quickstart)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "yiUTwYzP36Yr", + "outputId": "d79acadc-bd18-44d3-c812-9b40c51d5124" + }, + "outputs": [], + "source": [ + "import os\n", + "from getpass import getpass\n", + "\n", + "os.environ[\"APIFY_API_TOKEN\"] = getpass(\"Enter YOUR APIFY_API_TOKEN\")\n", + "os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter YOUR OPENAI_API_KEY\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HQzAujMc505k" + }, + "source": [ + "## Search interesting topics\n", + "\n", + "First, we need to understand the output format of the [RAG Web Browser Actor](https://apify.com/apify/rag-web-browser). \n", + "\n", + "For query: `rag web browser` it returns data in the following format:\n", + "\n", + "```json\n", + "[\n", + " {\n", + " \"crawl\": {\n", + " \"httpStatusCode\": 200,\n", + " \"httpStatusMessage\": \"OK\",\n", + " \"loadedAt\": \"2024-11-25T21:23:58.336Z\",\n", + " \"uniqueKey\": \"eM0RDxDQ3q\",\n", + " \"requestStatus\": \"handled\"\n", + " },\n", + " \"searchResult\": {\n", + " \"title\": \"apify/rag-web-browser\",\n", + " \"description\": \"Sep 2, 2024 — The RAG Web Browser is designed for Large Language Model (LLM) applications ...\",\n", + " \"url\": \"https://github.com/apify/rag-web-browser\"\n", + " },\n", + " \"metadata\": {\n", + " \"title\": \"GitHub - apify/rag-web-browser: RAG Web Browser is an Apify Actor to feed your LLM applications ...\",\n", + " \"description\": \"RAG Web Browser is an Apify Actor to feed your LLM applications ...\",\n", + " \"languageCode\": \"en\",\n", + " \"url\": \"https://github.com/apify/rag-web-browser\"\n", + " },\n", + " \"markdown\": \"# apify/rag-web-browser: RAG Web Browser is an Apify Actor ...\"\n", + " }\n", + "]\n", + "```\n", + "\n", + "We will convert this JSON to a Haystack Document using the `dataset_mapping_function` as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "OZ0PAVHI_mhn" + }, + "outputs": [], + "source": [ + "from haystack import Document\n", + "\n", + "def dataset_mapping_function(dataset_item: dict) -> Document:\n", + " return Document(\n", + " content=dataset_item.get(\"markdown\"),\n", + " meta={\n", + " \"title\": dataset_item.get(\"metadata\", {}).get(\"title\"),\n", + " \"url\": dataset_item.get(\"metadata\", {}).get(\"url\"),\n", + " \"language\": dataset_item.get(\"metadata\", {}).get(\"languageCode\")\n", + " }\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xtFquWflA5kf" + }, + "source": [ + "Now set up the `ApifyDatasetFromActorCall` component: " + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "SUWXxT4y55lH" + }, + "outputs": [], + "source": [ + "from apify_haystack import ApifyDatasetFromActorCall\n", + "\n", + "document_loader = ApifyDatasetFromActorCall(\n", + " actor_id=\"apify/rag-web-browser\",\n", + " run_input={\n", + " \"maxResults\": 2,\n", + " \"outputFormats\": [\"markdown\"],\n", + " \"requestTimeoutSecs\": 30\n", + " },\n", + " dataset_mapping_function=dataset_mapping_function,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Check out other `run_input` parameters at [Github for the RAG web browser](https://github.com/apify/rag-web-browser?tab=readme-ov-file#query-parameters).\n", + "\n", + "Note that you can also manualy set your API key as a named parameter `apify_api_token` in the constructor, if not set as environment variable." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GxDNZ7LqAsWV" + }, + "source": [ + "### Run the Actor and fetch results\n", + "\n", + "Let's run the Actor with a sample query and fetch the results. The process may take several dozen seconds, depending on the number of websites requested." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "id": "yiUTwYzP36Yr" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Title: 7 Recent AI Developments: Artificial Intelligence News\n", + "Truncated content: 7 Recent AI Developments: Artificial Intelligence NewsContact phone +1-888-840-3252Koombea [Skip to Content](#maincontent)\n", + "\n", + "[+1-888-840-3252](tel:+18888403252)\n", + "\n", + "[](https://www.koombea.com)\n", + "\n", + "[get in touch](/contact/)\n", + "\n", + "HiTech\n", + "\n", + "9 minutes read\n", + "\n", + "# 7 Recen ...\n", + "---\n", + "Title: Artificial Intelligence News -- ScienceDaily\n", + "Truncated content: Artificial Intelligence News -- ScienceDaily\n", + "\n", + "[Skip to main content](#main)\n", + "\n", + "[![ScienceDaily](/images/sd-logo.png)](/ \"ScienceDaily\")\n", + "\n", + "* * *\n", + "\n", + "Your source for the latest research news\n", + "\n", + "[Follow:](#) [_Facebook_](https://www.facebook.com/sciencedaily) [ ...\n", + "---\n" + ] + } + ], + "source": [ + "query = \"Artificial intelligence latest developments\"\n", + "\n", + "# Load the documents and extract the list of document\n", + "result = document_loader.run(run_input={\"query\": query})\n", + "documents = result.get(\"documents\", [])\n", + "\n", + "for doc in documents:\n", + " print(f\"Title: {doc.meta['title']}\")\n", + " print(f\"Truncated content: \\n {doc.content[:100]} ...\")\n", + " print(\"---\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "45YxSr6v__fI" + }, + "source": [ + "## Analyze the results with OpenAIGenerator\n", + "\n", + "Use the OpenAIGenerator to analyze and summarize the web content." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "id": "yiUTwYzP36Yr" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Summary for 7 Recent AI Developments: Artificial Intelligence News available from https://www.koombea.com/blog/7-recent-ai-developments/: \n", + "AI is making waves in various industries, from robotics to healthcare to brewing beer. As technology advances, AI is beginning to take on more complex tasks and solve unique problems. The implications for AI in app development are also profound, making apps smarter, more efficient, and user-friendly. Stay tuned for even more groundbreaking developments in the world of AI.\n", + "\n", + "Summary for Artificial Intelligence News -- ScienceDaily available from https://www.sciencedaily.com/news/computers_math/artificial_intelligence/: \n", + "[Machine Psychology: A Bridge to General AI?](https://www.sciencedaily.com/releases/2024/12/241219190259.htm)\n", + "\n", + "[Scientists Create AI That 'Watches' Videos by Mimicking the Brain](https://www.sciencedaily.com/releases/2024/12/241209163200.htm)\n", + "\n", + "[Bird-Inspired Drone Can Jump for Take-Off](https://www.sciencedaily.com/releases/2024/12/241206111951.htm)\n", + "\n", + "[Robot That Watched Surgery Videos Performs With Skill of Human Doctor, Researchers Report](https://www.sciencedaily.com/releases/2024/11/241111123037.htm)\n", + "\n", + "Summary for 8 AI and machine learning trends to watch in 2025 | TechTarget available from https://www.techtarget.com/searchenterpriseai/tip/9-top-AI-and-machine-learning-trends: \n", + "The article discusses eight AI and machine learning trends to watch in 2025:\n", + "\n", + "1. Pragmatic approaches to generative AI\n", + "2. Expansion of generative AI beyond chatbots\n", + "3. Rise of AI agents capable of independent action\n", + "4. Evolution of generative AI models into commodities\n", + "5. Domain-specific AI applications and data sets\n", + "6. Importance of AI literacy for everyone\n", + "7. Adaptation to an evolving regulatory environment\n", + "8. Escalation of AI-related security concerns\n", + "\n", + "These trends reflect the current state and future direction of AI and machine learning technologies as they continue to impact various industries and sectors.\n", + "\n" + ] + } + ], + "source": [ + "from haystack.components.generators import OpenAIGenerator\n", + "\n", + "generator = OpenAIGenerator(model=\"gpt-3.5-turbo\")\n", + "\n", + "for doc in documents:\n", + " result = generator.run(prompt=doc.content)\n", + " summary = result[\"replies\"][0] # Accessing the generated text\n", + " print(f\"Summary for {doc.meta.get('title')} available from {doc.meta.get('url')}: \\n{summary}\\n ---\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Use the Haystack Pipeline for web search and analysis\n", + "\n", + "Now let's create a more sophisticated pipeline that can handle different types of content and generate specialized analyses. We'll create a pipeline that:\n", + "\n", + "1. Searches the web using RAG Web Browser\n", + "2. Cleans and filters the documents\n", + "3. Routes them based on content type\n", + "4. Generates customized summaries for different types of content\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "from haystack import Pipeline\n", + "from haystack.components.preprocessors import DocumentCleaner\n", + "from haystack.components.builders import PromptBuilder\n", + "\n", + "# Improved dataset_mapping_function with truncation of the content\n", + "def dataset_mapping_function(dataset_item: dict) -> Document:\n", + " max_chars = 10000\n", + " content = dataset_item.get(\"markdown\", \"\")\n", + " return Document(\n", + " content=content[:max_chars], \n", + " meta={\n", + " \"title\": dataset_item.get(\"metadata\", {}).get(\"title\"),\n", + " \"url\": dataset_item.get(\"metadata\", {}).get(\"url\"),\n", + " \"language\": dataset_item.get(\"metadata\", {}).get(\"languageCode\")\n", + " }\n", + " )\n", + " \n", + "def create_pipeline(query: str) -> Pipeline:\n", + "\n", + " document_loader = ApifyDatasetFromActorCall(\n", + " actor_id=\"apify/rag-web-browser\",\n", + " run_input={\n", + " \"query\": query,\n", + " \"maxResults\": 2,\n", + " \"outputFormats\": [\"markdown\"]\n", + " },\n", + " dataset_mapping_function=dataset_mapping_function,\n", + " )\n", + "\n", + " cleaner = DocumentCleaner(\n", + " remove_empty_lines=True,\n", + " remove_extra_whitespaces=True,\n", + " remove_repeated_substrings=True\n", + " )\n", + "\n", + " prompt_template = \"\"\"\n", + " Analyze the following content and provide:\n", + " 1. Key points and findings\n", + " 2. Practical implications\n", + " 3. Notable conclusions\n", + " Be concise.\n", + "\n", + " Context:\n", + " {% for document in documents %}\n", + " {{ document.content }}\n", + " {% endfor %}\n", + "\n", + " Analysis:\n", + " \"\"\"\n", + "\n", + " prompt_builder = PromptBuilder(template=prompt_template)\n", + "\n", + " generator = OpenAIGenerator(model=\"gpt-3.5-turbo\")\n", + "\n", + " pipe = Pipeline()\n", + " pipe.add_component(\"loader\", document_loader)\n", + " pipe.add_component(\"cleaner\", cleaner)\n", + " pipe.add_component(\"prompt_builder\", prompt_builder)\n", + " pipe.add_component(\"generator\", generator)\n", + "\n", + " pipe.connect(\"loader\", \"cleaner\")\n", + " pipe.connect(\"cleaner\", \"prompt_builder\")\n", + " pipe.connect(\"prompt_builder\", \"generator\")\n", + "\n", + " return pipe\n", + "\n", + "# Function to run the pipeline\n", + "def research_topic(query: str) -> str:\n", + " pipeline = create_pipeline(query)\n", + " result = pipeline.run(data={})\n", + " return result[\"generator\"][\"replies\"]" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Analysis Result:\n", + "1. Key Points and Findings:\n", + "- The global conversation on AI ethics has gained momentum, with calls for inclusive global governance frameworks.\n", + "- Different cultural perspectives shape ethical considerations in AI technologies.\n", + "- AI technologies can have varying social impacts across different cultural contexts.\n", + "- Initiatives like UNESCO's Recommendation on the Ethics of AI aim to promote responsible AI development.\n", + "- Fragmented regulatory approaches in different regions pose challenges for multinational tech companies.\n", + " \n", + "2. Practical Implications:\n", + "- Businesses need to consider cultural diversity and social impacts when implementing AI technologies.\n", + "- Collaboration between industry, academia, and regulators is essential to address ethical challenges.\n", + "- Initiatives promoting responsible AI development, such as global standards like UNESCO's Recommendation, should be supported.\n", + "- Harmonized global standards can facilitate innovation while addressing ethical concerns in AI governance.\n", + "\n", + "3. Notable Conclusions:\n", + "- Local cultural contexts significantly influence perceptions and understandings of AI, highlighting the need for nuanced AI governance.\n", + "- Initiatives like the UNESCO Recommendation on the Ethics of AI and forums like the Global Forum on the Ethics of Artificial Intelligence aim to address ethical challenges in AI governance.\n", + "- The push for more harmonized global standards reflects the industry's commitment to responsible AI development in a culturally diverse landscape.\n" + ] + } + ], + "source": [ + "query = \"latest developments in AI ethics\"\n", + "analysis = research_topic(query)[0]\n", + "\n", + "print(\"Analysis Result:\")\n", + "print(analysis)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can customize the pipeline further by:\n", + "- Adding more sophisticated routing logic\n", + "- Implementing additional preprocessing steps\n", + "- Creating specialized generators for different content types\n", + "- Adding error handling and retries\n", + "- Implementing caching for improved performance\n", + "\n", + "This completes our exploration of using Apify's RAG Web Browser with Haystack for web-aware AI applications. The combination of web search capabilities with sophisticated content processing and analysis creates powerful possibilities for research, analysis and many other tasks." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "pythonenv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.8" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}