update readme

Improved chat mode
Use DnD API
2023-10-07 09:38:32 +02:00 · 2023-10-07 08:48:08 +02:00 · 2023-10-07 08:32:07 +02:00 · 2023-10-07 00:57:45 +02:00 · 2023-10-07 00:42:49 +02:00 · 2023-10-07 00:38:32 +02:00
4 changed files with 140 additions and 27 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,3 @@
-/data
+/transcripts
 /index
 /.idea
--- a/README.MD
+++ b/README.MD
@@ -2,8 +2,10 @@
 ## What is Darknet Diaries?
 Well, let's ask our LLM:
-> &gt; What is Darknet Diaries?
+> What is Darknet Diaries?
-> > Darknet Diaries is a podcast that explores true stories from the dark side of the internet. Hosted by Jack Rhysider, each episode delves into various topics related to hacking, cybercrime, and the underground world of the internet. The podcast features interviews with individuals who have been involved in hacking incidents, cyber espionage, or have experienced the consequences of online criminal activities. Darknet Diaries aims to educate and entertain listeners by providing an in-depth look into the fascinating and often mysterious world of the darknet.
+>> Darknet Diaries is a podcast that explores true stories from the dark side of the internet. Hosted by Jack Rhysider, each episode delves into various topics related to hacking, cybercrime, and the underground world of the internet. The podcast features interviews with individuals who have been involved in hacking incidents, cyber espionage, or have experienced the consequences of online criminal activities. Darknet Diaries aims to educate and entertain listeners by providing an in-depth look into the fascinating and often mysterious world of the darknet.
 [darknetdiaries.com](http://darknetdiaries.com/)
 ## How to run
 ### Install dependencies
@@ -28,4 +30,43 @@ python3 main.py
 On the first run, it will generate the index. This can take a while, but it will be cached on disk for the next runs.
-You can then ask it any questions about Darknet Diaries! Currently, it does hallucinate a lot about episode numbers and titles. Other than that, it's pretty accurate!
+You can then ask it any questions about Darknet Diaries!
 ## Examples
 > What is the intro of the podcast?
 >> Typically, the host Jack Rhysider starts the podcast with a brief introduction about the episode's topic, followed by "These are true stories from the dark side of the internet. I'm Jack Rhysider. This is Darknet Diaries." Please note that the exact wording may vary from episode to episode.
 (this is exactly the intro)
 ### Episode 137: Predator
 [link to transcript](https://darknetdiaries.com/transcript/137/)
 > What is the spyware predator ?
 >> The Predator is a type of spyware that is capable of infecting a phone and erasing its tracks of the infection process. Once installed, it can watch what's going on with the camera, listen in on the microphone, scrape passwords, read texts, and report the person's location. It is typically delivered through a phishing text message that encourages the recipient to click on a link, triggering a series of exploits on the phone. The Predator spyware is supposedly only sold to intelligence agencies.
 >
 > What happened with predator and the egyptian journalists?
 >> Predator spyware was found on the phones of an Egyptian politician living in exile and an Egyptian journalist. This led to the suspicion that the Egyptian government might be using this technology to spy on their civil society. The use of the spyware was not to stop terrorists or catch criminals, but seemingly to monitor the stories the journalist was working on, which was considered a misuse of the technology.
 >
 > How and when was Predator first discovered?
 >> The Predator spyware was found on the phones of three people in Greece's civil society, including a journalist and an opposition leader. However, the specific timeline or details about the initial discovery are not provided.
 >
 >Which episode should I listen to in order to learn more about this topic?
 >> To learn more about the Predator spyware, you should listen to episode 137 titled "Predator" from the Darknet Diaries.
 ### Episode 130: Jason's Pen Test
 [link to transcript](https://darknetdiaries.com/transcript/130/)
 > Who is jason haddix
 >> Jason Haddix is a well-known figure in the cybersecurity industry. He has held various roles in the field, including Director of Technical Operations at Bugcrowd and Head of Trust and Security at HackerOne. Haddix is particularly known for his expertise in bug bounty programs and penetration testing. He has also contributed to the cybersecurity community by developing methodologies and tools for security testing.
 >
 > in which episode is he featured
 >> Jason Haddix is featured in Episode 130, titled "Jason's Pen Test".
 >
 > Can you summarize that episode
 >> Episode 130, titled "Jason's Pen Test," features a professional penetration tester named Jason who worked for HP. Jason shares his experiences conducting hundreds of penetration tests for Fortune 500 companies, primarily focusing on network hacking. He explains the process of testing a company's security measures, both physically and digitally.
 >>
 >>One of his memorable stories involves a physical penetration test where he had to break into a former employer's building. He used his knowledge of the building's layout and security mechanisms to gain access, even falling through a ceiling into a server room at one point.
 >>
 >>The episode also covers a project where Jason was tasked with hacking into a large, worldwide bank. His job was to examine the bank's mobile app for any potential security vulnerabilities that could expose customer or sensitive information. The episode provides a detailed look into the world of penetration testing, highlighting the importance of robust security measures in both physical and digital spaces.
 > 
 > How many downloads does this episode have?
 >> Episode 130 of Darknet Diaries, titled "JASON'S PEN TEST", has 667,528 downloads.
--- a/download_transcripts.py
+++ b/download_transcripts.py
@@ -1,13 +1,30 @@
 import requests
 import os
 from bs4 import BeautifulSoup
 import json
 folder_path = "transcripts"
 if not os.path.exists(folder_path):
    os.makedirs(folder_path)
 for i in range(1, 139):
    try:
        url = f"https://darknetdiaries.com/transcript/{i}"
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
    pre_section = soup.find('pre')
-    if pre_section:
+        transcript = soup.find('pre').get_text()
-        text = pre_section.get_text()
+
-        with open(f"data/episode_{i}.txt", "w") as f:
+        url = f"https://api.darknetdiaries.com/{i}.json"
-            f.write(text)
+        r = requests.get(url)
        parsed_json = json.loads(r.text)
        title = parsed_json["episode_name"]
        number = parsed_json["episode_number"]
        downloads = parsed_json["total_downloads"]
        with open(f"{folder_path}/episode_{number}.txt", "w") as f:
            f.write(f"{title}\n{downloads}\n{transcript}")
        print(f"{number} {title}")
    except Exception:
        print(f"Failed scraping episode {i}")
--- a/main.py
+++ b/main.py
@@ -1,29 +1,40 @@
-from llama_index import (SimpleDirectoryReader, ServiceContext, StorageContext, PromptTemplate,
+from llama_index import (ServiceContext, StorageContext,
                         load_index_from_storage, Document, set_global_service_context)
 from llama_index.node_parser import SimpleNodeParser
 from llama_index import VectorStoreIndex
-from llama_index.llms import OpenAI
+from llama_index.llms import OpenAI, ChatMessage, MessageRole
 from llama_index.prompts import ChatPromptTemplate
 from llama_index import set_global_handler
 from llama_index.chat_engine.types import ChatMode
 import os
 import re
-llm = OpenAI(model="gpt-3.5-turbo", temperature=0, max_tokens=256)
+# set_global_handler("simple")
 llm = OpenAI(model="gpt-4", temperature=0, max_tokens=256)
 service_context = ServiceContext.from_defaults(llm=llm)
 set_global_service_context(service_context)
 if not os.path.exists("./index/lock"):
    documents = []
-    for filename in os.listdir("./data"):
+    for filename in os.listdir("./transcripts"):
        episode_number = re.search(r'\d+', filename).group()
-        with open("./data/" + filename, 'r') as f:
+        with open("./transcripts/" + filename, 'r') as f:
            title = f.readline().strip()
            downloads = f.readline().strip()
            content = f.read()
        document = Document(
            text=content,
            doc_id=filename,
            metadata={
-                "episode_number": episode_number
+                "episode_number": episode_number,
                "episode_title": title,
                "episode_downloads": downloads,
                "episode_url": f"https://darknetdiaries.com/episode/{episode_number}/"
            }
        )
        documents.append(document)
    documents = SimpleDirectoryReader('./data').load_data()
    parser = SimpleNodeParser.from_defaults()
    nodes = parser.get_nodes_from_documents(documents)
@@ -31,21 +42,65 @@ if not os.path.exists("./index/lock"):
    index.storage_context.persist(persist_dir="./index")
    open("./index/lock", 'a').close()
 else:
    print("Loading index...")
    storage_context = StorageContext.from_defaults(persist_dir="./index")
    index = load_index_from_storage(storage_context)
-template = (
+chat_text_qa_msgs = [
-    "You are now an expert on the Darknet Diaries podcast. \n"
+    ChatMessage(
-    "Please answer this question by referring to the podcast: {query_str}\n"
+        role=MessageRole.SYSTEM,
        content=(
            "You have been trained on the Darknet Diaries podcast transcripts with data from october 6 2023."
            "You are an expert about it and will answer as such. You know about every episode up to number 138."
            "Always answer the question, even if the context isn't helpful."
            "Mention the number and title of the episodes you are referring to."
        )
    ),
    ChatMessage(
        role=MessageRole.USER,
        content=(
            "Context information is below.\n"
            "---------------------\n"
            "{context_str}\n"
            "---------------------\n"
            "Given the context information and not prior knowledge,"
            "answer the question: {query_str}\n"
        )
    )
 ]
 text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)
 chat_refine_msgs = [
    ChatMessage(
        role=MessageRole.SYSTEM,
        content="Always answer the question, even if the context isn't helpful.",
    ),
    ChatMessage(
        role=MessageRole.USER,
        content=(
            "We have the opportunity to refine the original answer "
            "(only if needed) with some more context below.\n"
            "------------\n"
            "{context_msg}\n"
            "------------\n"
            "Given the new context, refine the original answer to better "
            "answer the question: {query_str}. "
            "If the context isn't useful, output the original answer again.\n"
            "Original Answer: {existing_answer}"
        ),
    ),
 ]
 refine_template = ChatPromptTemplate(chat_refine_msgs)
 chat_engine = index.as_chat_engine(
    text_qa_template=text_qa_template,
    refine_template=refine_template,
    chat_mode=ChatMode.OPENAI
 )
 qa_template = PromptTemplate(template)
 query_engine = index.as_query_engine(text_qa_template=qa_template)
 while True:
    try:
-        user_prompt = input("Prompt: ")
+        chat_engine.chat_repl()
        response = query_engine.query(user_prompt)
        print(response)
    except KeyboardInterrupt:
        break
Author	SHA1	Message	Date
Romain Quinet	fbcf0168f9	update readme	2023-10-07 09:38:32 +02:00
Romain Quinet	1a46ea4816	Improved chat mode	2023-10-07 08:48:08 +02:00
Romain Quinet	bf3fd878ac	Use DnD API	2023-10-07 08:32:07 +02:00
Romain Quinet	c47ff3d9ed	Added more metadata	2023-10-07 00:57:45 +02:00
Romain Quinet	a582d89c57	updated examples	2023-10-07 00:42:49 +02:00
Romain Quinet	d0dd93a5ab	updated examples	2023-10-07 00:38:32 +02:00
Romain Quinet	4e587aed9e	document metadata	2023-10-07 00:31:01 +02:00
Romain Quinet	395de571a4	removed unused code	2023-10-06 23:55:36 +02:00
Romain Quinet	fb42ee0eee	updated readme	2023-10-06 23:52:02 +02:00
Romain Quinet	be2c064d40	Better prompts	2023-10-06 23:45:47 +02:00
Romain Quinet	3d734a3064	Chat mode	2023-10-06 23:22:10 +02:00
Romain Quinet	b41d8288b7	Use GPT4	2023-10-06 22:52:42 +02:00
Romain Quinet	a77d41c6ec	Better prompt	2023-10-06 22:43:39 +02:00
Romain Quinet	96c692aef7	Include episode title and number into data	2023-10-06 22:43:28 +02:00