Fix mixed prompts - Catalogue is mixed with Graph and Table

Fix prompts for StructRAG
kbeaugrand · Jan 13, 2025 · 1f8d08b · 1f8d08b
2 parents e0dbb9b + b6723a8
commit 1f8d08b
Show file tree

Hide file tree

Showing 5 changed files with 167 additions and 88 deletions.
diff --git a/sample/Program.cs b/sample/Program.cs
@@ -4,6 +4,7 @@
 using KernelMemory.StructRAG;
 using Microsoft.Extensions.Configuration;
 using Microsoft.KernelMemory;
+using Microsoft.KernelMemory.Configuration;
 using Microsoft.KernelMemory.FileSystem.DevTools;
 using Microsoft.KernelMemory.MemoryStorage.DevTools;
 using Microsoft.SemanticKernel;
@@ -31,6 +32,12 @@
     {
         AnswerTokens = 4096
     })
+    .WithCustomTextPartitioningOptions(new TextPartitioningOptions()
+    {
+        MaxTokensPerLine = 100,
+        MaxTokensPerParagraph = 200,
+        OverlappingTokens = 25
+    })
     .WithSimpleTextDb(new SimpleTextDbConfig()
     {
         StorageType = FileSystemTypes.Volatile
@@ -50,7 +57,7 @@ await memory
 
 var question = "In the current landscape where privacy laws are becoming increasingly stringent, and the global economy is experiencing a downturn, how can a technology company strategically leverage advancements in artificial intelligence (AI) to maintain competitive advantage and financial stability?";
 
-var answer = await memory.AskAsync(question);
+var answer = await memory.AskAsync(question, minRelevance: 0.9);
 
 Console.WriteLine("Standard Kernel Memory Answer");
 Console.WriteLine(answer.Result);

diff --git a/src/KernelMemory.StructRAG/Prompts/StructRAG/ConstructCatalogue.txt b/src/KernelMemory.StructRAG/Prompts/StructRAG/ConstructCatalogue.txt
@@ -1,16 +1,71 @@
-Instruction: 
-Extract complete relevant tables from Raw Content based on the requirements described in the Requirement.
-Note that when building a table, it is important to retain the table title and source information, such as which company and report the table comes from.
+Instruction:
+Extract the required directory structure from Raw Content based on the requirements described in the Requirement, which is a hierarchical summary. The number of layers and the number of nodes in each layer are determined according to specific circumstances.
+Please follow the thinking style and output format in Examples, and note that each level of Summary needs to have a number to distinguish between different levels. And each summary needs to be very detailed.
+Note that you need to extract as much relevant information as possible from the Raw Content based on the entity names and person names mentioned in the Retirement, in order to build a complete directory structure.
 
-Hints:
-Firstly, identify the keywords in the Requirement, including entity names and attribute names, and then extract them from the Raw Content based on these keywords.
-If the Raw Content does not contain the information required by the Requirement, then extract the small amount of information most relevant to the Requirement from the Raw Content
-3. When analyzing Requirements and extracting Raw Content, do not translate and maintain the original language
+Examples:
+#################
+#################
+Requirement:
+Query is How do guests perceive the impact of privacy laws on technology development?, please extract relevant catalogues from the document based on the Query.
 
 Raw Content:
-{{$content}}
+Episode 48 - Randall Munroe
+RANDALL MUNROE: Then fell into doing comics where I can spend all day diving into some rabbit hole, and then draw comics about it. And then the next day, move on to a different thing. I found a way to grab all the candy in the candy store.
+KEVIN SCOTT: Hi, everyone. Welcome to Behind the Tech. I�m your host, Kevin Scott, Chief Technology Officer for Microsoft. 
+In this podcast, we�re going to get behind the tech. We�ll talk with some of the people who have made our modern tech world possible and understand what motivated them to create what they did. So, join me to maybe learn a little bit about the history of computing and get a few behind-the-scenes insights into what�s happening today. Stick around.
+CHRISTINA WARREN: Hello, and welcome to Behind the Tech. I�m Christina Warren, Senior Developer Advocate at GitHub.
+KEVIN SCOTT: And I�m Kevin Scott.
+CHRISTINA WARREN: And today, we have a super exciting guest with us, Randall Munroe. He�s famous for creating the webcomic xkcd
+KEVIN SCOTT: Yeah, it�s � Randall is maybe my favorite cartoonist. So I very rarely post cartoons onto social media; 100% of them are Randall�s stuff.
+CHRISTINA WARREN: 100%. 
+KEVIN SCOTT: And so, once a month, once every other month, like he�ll write something that I just think is so fabulously funny that I have to share it with my other nerd friends.
+CHRISTINA WARREN: No, I mean, well � well that � well, that�s what makes the comic so good is � and � and you know, I�m interested to hear what you two talk about. But it�s such a � it covers such a wide spectrum of � of nerd-adjacent topics. And � and because the comic�s been going on so long, they�re literally � it�s like The Simpsons There literally is one for everything. You know, like there�s � there�s an xkcd that you can apply to any situation.
+.......
+.......
+.......
+
+Output:
+In question in requirement, the main topic is about the impact of privacy laws on technology development. Thus we extract "impact of privacy laws on technology development" related information from the raw content. And constructed a hierarchical summary based on the extracted information.
+	1.	First-Level Summary 1: AI Technology and Regulatory Challenges
+	�	The podcast explores the complex relationship between AI advancements and existing legal frameworks, with a particular focus on privacy laws like HIPAA and how they interact with technological innovation.
+(1) Second-Level Summary 1: Regulatory Concerns in Financial Services
+	�	Ethan Mollick highlights concerns that the current regulatory environment in financial services is not well-suited to address the unique challenges posed by AI, particularly the uncertainty surrounding the applicability of existing regulations.
+	�	(a) Third-Level Summary 1: Innovation Hindered by Regulatory Ambiguity
+	�	Mollick discusses how the lack of clarity in regulations impedes the ability of industries, like finance, to fully harness the potential of AI technologies.
+	�	(b) Third-Level Summary 1: Need for Adaptive Regulations
+	�	He advocates for a more dynamic and responsive regulatory framework that can evolve alongside technological advancements, ensuring both safety and innovation.
+(2) Second-Level Summary 2: AI in Healthcare and Privacy Concerns
+	�	The podcast also delves into the intersection of AI experimentation in healthcare and the need to comply with privacy regulations like HIPAA.
+	�	(a) Third-Level Summary 2: Balancing Privacy and AI Benefits
+	�	Discussions emphasize the challenge of ensuring privacy while leveraging AI to improve healthcare systems and access to medical services.
+	�	(b) Third-Level Summary 2: Ethical Considerations in AI Use
+	�	Mollick touches on concerns over AI misuse, such as �data rape,� and underscores the importance of regulating AI to promote positive outcomes while preventing harmful practices.
+	2.	First-Level Summary 2: The Call for Responsive AI Regulation
+	�	Mollick and other guests advocate for a regulatory approach that allows for experimentation and innovation, particularly in areas like healthcare, while mitigating potential risks.
+(1) Second-Level Summary 1: The Need for Smart and Responsive Regulation
+	�	Mollick calls for a �fast, smart, responsive regulation� that monitors emerging harms in AI and carves out space for experimentation in critical sectors like medicine.
+	�	(a) Third-Level Summary 1: Evolving with Technological Advancements
+	�	He stresses that regulations must evolve as quickly as the technology itself to ensure they are effective in addressing both the opportunities and risks associated with AI.
+(2) Second-Level Summary 2: AI as a General-Purpose Technology
+	�	The conversation highlights the far-reaching implications of AI, recognizing it as a general-purpose technology with the potential to significantly impact various sectors.
+	�	(a) Third-Level Summary 2: Promoting Innovation While Protecting Rights
+	�	Experts argue that while privacy laws are crucial to prevent misuse, they must also be flexible enough to allow for innovation, ensuring AI�s positive potential is not stifled.
+	�	(b) Third-Level Summary 2: The Need for Balance
+	�	The guests suggest that a balanced approach to regulation is necessary, one that promotes innovation while protecting individual rights and societal interests.
+	3.	First-Level Summary 3: Conclusion on the Future of AI Regulation
+	�	The episode concludes with a call for a balanced regulatory framework that can adapt to the evolving nature of AI, ensuring that both privacy and innovation are protected.
+(1) Second-Level Summary 1: Regulatory Agility for AI�s Future
+	�	Experts emphasize that regulations must be agile enough to keep pace with AI developments, ensuring that the technology can be used safely while minimizing potential harms.
+	�	(a) Third-Level Summary 1: Agility in Regulation
+	�	The need for regulatory frameworks that evolve in tandem with technological advancements is underscored as a key factor in supporting AI�s positive societal impact.
+#################
+#################
 
 Requirement:
 {{$instruction}}
 
-Output:
+Raw Content:
+{{$raw_content}}
+
+Output:
diff --git a/src/KernelMemory.StructRAG/Prompts/StructRAG/ConstructGraph.txt b/src/KernelMemory.StructRAG/Prompts/StructRAG/ConstructGraph.txt
@@ -1,16 +1,88 @@
-Instruction: 
-Extract complete relevant tables from Raw Content based on the requirements described in the Requirement.
-Note that when building a table, it is important to retain the table title and source information, such as which company and report the table comes from.
+Instruction:
+Extract the required triplets from Raw Content according to the requirements described in the Requirement
+The output of a triplet is in the format of {{'head ':'... ',' relation ':'... ',' tail ': [...', '...']}}.
+Note that not all triples in the text need to be extracted. You need to analyze the relationships and entities mentioned in the Requirement and only extract the relevant triples
+Note that the head and tail you output should be kept as complete as possible. They may not be just a word or phrase, but can also be a sentence or a paragraph of text. Try to be consistent with the original text and do not make any abbreviations.
 
-Hints:
-Firstly, identify the keywords in the Requirement, including entity names and attribute names, and then extract them from the Raw Content based on these keywords.
-If the Raw Content does not contain the information required by the Requirement, then extract the small amount of information most relevant to the Requirement from the Raw Content
-3. When analyzing Requirements and extracting Raw Content, do not translate and maintain the original language
+Examples:
+#################
+#################
+Requirement:
+It is necessary to construct a graph based on a given document, where the entity is the title of the paper, the relationship is a reference, and the title of the given document is used as the head, while the titles of other papers are used as the tail
+
+Noting:
+You only need to consider the following paper titles,
+Generative AI and Large Language Models for Cyber Security: All Insights You Need
+WHEN LLMs MEET CYberSECURITY: A SYStEMATIC LITERATURE REVIEW
+Can Large Language Models Be an Alternative to Human Evaluations?
+LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs' Vulnerability Reasoning
+Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers
 
 Raw Content:
-{{$content}}
+# Generative AI and Large Language Models for Cyber Security: All Insights You Need 
+Mohamed Amine Ferrag, Fatima Alwahedi, Ammar Battah, Bilel Cherif, Abdechakour Mechri,<br>and Norbert Tihanyi
+#### Abstract
+The rapid evolution of cyber threats requires innovative approaches to enhance cybersecurity defenses. In this paper, 
+Index Terms-Generative AI, LLM, Transformer, Security, Cyber Security.
+M. A. Ferrag is the corresponding author.
+## LIST OF ABBREVIATIONS
+AI Artificial Intelligence
+## I. INTRODUCTION
+The history of Natural Language Processing (NLP) dates back to the 1950s when the Turing test was developed. However, NLP has seen significant advancements in 
+[141] ZySec-AI, "Zysec-ai: Project zysec," Webpage, accessed: 2024-05-01. [Online]. Available: https://github.com/ZySec-AI/project-zysec
+[205] M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontana et al., "Purple llama cyberseceval: A secure coding benchmark for language models," arXiv preprint arXiv:2312.04724, 2023.
+[206] Z. Liu, "Secqa: A concise question-answering dataset for evaluating large language models in computer security," arXiv preprint arXiv:2312.15838, 2023.
+[207] M. Bhatt, S. Chennabasappa, Y. Li, C. Nikolaidis, D. Song, S. Wan, F. Ahmad, C. Aschermann, Y. Chen, D. Kapil, D. Molnar, S. Whitman, and J. Saxe, "Cyberseceval 2: A wide-ranging cybersecurity evaluation suite for large language models," 2024.
+[208] N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.K. Dombrowski, S. Goel, L. Phan et al., "The wmdp benchmark: Measuring and reducing malicious use with unlearning," arXiv preprint arXiv:2403.03218, 2024.
+[209] Y. Sun, D. Wu, Y. Xue, H. Liu, W. Ma, L. Zhang, M. Shi, and Y. Liu, "Llm4vuln: A unified evaluation framework for decoupling and enhancing llms\' vulnerability reasoning," 2024.
+[210] Z. Liu, J. Shi, and J. F. Buford, "Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity." [Online]. Available: http://aics.site/AICS2024/AICS_CyberBench.pdf
+
+Output:
+Among the paper titles that need to be considered, "Generative AI and Large Language Models for Cyber Security: All Insights You Need" is the title of the given document, so it should be used as the head. Among the other paper titles that need to be considered, "Llm4vuln: A unified evaluation framework for decoupling and enhancing llms \'vulnerability reasoning" appears in the reference of the given document, so it should be used as the tail. The remaining paper titles that need to be considered do not appear in the given document, so they are not considered.
+{{"head": "Generative AI and Large Language Models for Cyber Security: All Insights You Need", "relation": "reference", "tail": ["Llm4vuln: A unified evaluation framework for decoupling and enhancing llms\' vulnerability reasoning"]}}
+#################
+#################
 
 Requirement:
-{{$instruction}}
+It is necessary to construct a graph based on a given document, where the entity is the title of the paper, the relationship is a reference, and the title of the given document is used as the head, while the titles of other papers are used as the tail
+
+Noting:
+You only need to consider the following paper titles,
+Generative AI and Large Language Models for Cyber Security: All Insights You Need
+WHEN LLMs MEET CYberSECURITY: A SYStEMATIC LITERATURE REVIEW
+Can Large Language Models Be an Alternative to Human Evaluations?
+LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs' Vulnerability Reasoning
+Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers
+
+Raw Content:
+# LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs\' Vulnerability Reasoning 
+Daoyuan $\\mathrm{{Wu}}^{{*}}$<br>Nanyang Technological University<br>Singapore, Singapore<br>daoyuan.wu@ntu.edu.sg<br>Wei Ma<br>Nanyang Technological University<br>Singapore, Singapore<br>ma_wei@ntu.edu.sg
+Yue Xue<br>MetaTrust Labs<br>Singapore, Singapore<br>xueyue@metatrust.io<br>Lyuye Zhang<br>Nanyang Technological University<br>Singapore, Singapore<br>zh0004ye@e.ntu.edu.sg
+Miaolei Shi<br>MetaTrust Labs<br>Singapore, Singapore<br>stan@metatrust.io
+Yang Liu<br>Nanyang Technological University<br>Singapore, Singapore<br>yangliu@ntu.edu.sg
+#### Absract
+Large language models (LLMs) have demonstrated significant potential for many downstream tasks, including those requiring humanlevel intelligence, such as vulnerability detection. However, recent attempts to use LLMs for vulnerability detection are still preliminary, as they lack an in-depth understanding of a subject LLM\'s vulnerability reasoning capability - whether it originates from the model itself or from external assistance, such as invoking tool support and retrieving vulnerability knowledge.
+## REFERENCES
+[1] 2023. Ethereum Whitepaper. https://ethereum.org/whitepaper
+[2] 2023. Solidity Programming Language. https://soliditylang.org
+[21] Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, and David Wagner 2023. DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection. In Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses. ACM, Hong Kong China, 654-668. https://doi.org/10.1145/3607199.3607242
+[22] Cheng-Han Chiang and Hung-yi Lee. 2023. Can Large Language Models Be an Alternative to Human Evaluations?. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 15607-15631. https://doi.org/10.18653/v1/2023.acllong. 870
+[23] Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei 2023. Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers. arXiv:2212.10559 (May 2023). https //doi.org/10.48550/arXiv.2212.10559 arXiv:2212.10559 [cs].
 
 Output:
+Among the paper titles that need to be considered, 'LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs' Vulnerability Reasoning' is the title of the given document, so it should be used as the head. Among the other paper titles that need to be considered, 'Why Can GPT Learn In Context?'? Language Models Implicitly Perform Gradient Descent as Meta Optizers "and" Can Large Language Models Be an Alternative to Human Evaluations? "Appear in the references of the given document, so they should be considered as tails. The remaining paper titles that need to be considered are not included in the given document, so they are not considered.
+{{"head": "LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs\' Vulnerability Reasoning", "relation": "reference", "tail": ["Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers.", "Can Large Language Models Be an Alternative to Human Evaluations?"]}}
+#################
+#################
+
+Requirement:
+{{$instruction}}
+
+Noting:
+You only need to consider the following paper titles,
+{{$titles}}
+
+Raw Content:
+{{$raw_content}}
+
+Output: