Compare commits

...

14 Commits

Author SHA1 Message Date
Romain Quinet
fbcf0168f9 update readme 2023-10-07 09:38:32 +02:00
Romain Quinet
1a46ea4816 Improved chat mode 2023-10-07 08:48:08 +02:00
Romain Quinet
bf3fd878ac Use DnD API 2023-10-07 08:32:07 +02:00
Romain Quinet
c47ff3d9ed Added more metadata 2023-10-07 00:57:45 +02:00
Romain Quinet
a582d89c57 updated examples 2023-10-07 00:42:49 +02:00
Romain Quinet
d0dd93a5ab updated examples 2023-10-07 00:38:32 +02:00
Romain Quinet
4e587aed9e document metadata 2023-10-07 00:31:01 +02:00
Romain Quinet
395de571a4 removed unused code 2023-10-06 23:55:36 +02:00
Romain Quinet
fb42ee0eee updated readme 2023-10-06 23:52:02 +02:00
Romain Quinet
be2c064d40 Better prompts 2023-10-06 23:45:47 +02:00
Romain Quinet
3d734a3064 Chat mode 2023-10-06 23:22:10 +02:00
Romain Quinet
b41d8288b7 Use GPT4 2023-10-06 22:52:42 +02:00
Romain Quinet
a77d41c6ec Better prompt 2023-10-06 22:43:39 +02:00
Romain Quinet
96c692aef7 Include episode title and number into data 2023-10-06 22:43:28 +02:00
4 changed files with 140 additions and 27 deletions

2
.gitignore vendored
View File

@@ -1,3 +1,3 @@
/data
/transcripts
/index
/.idea

View File

@@ -2,8 +2,10 @@
## What is Darknet Diaries?
Well, let's ask our LLM:
> > What is Darknet Diaries?
> > Darknet Diaries is a podcast that explores true stories from the dark side of the internet. Hosted by Jack Rhysider, each episode delves into various topics related to hacking, cybercrime, and the underground world of the internet. The podcast features interviews with individuals who have been involved in hacking incidents, cyber espionage, or have experienced the consequences of online criminal activities. Darknet Diaries aims to educate and entertain listeners by providing an in-depth look into the fascinating and often mysterious world of the darknet.
> What is Darknet Diaries?
>> Darknet Diaries is a podcast that explores true stories from the dark side of the internet. Hosted by Jack Rhysider, each episode delves into various topics related to hacking, cybercrime, and the underground world of the internet. The podcast features interviews with individuals who have been involved in hacking incidents, cyber espionage, or have experienced the consequences of online criminal activities. Darknet Diaries aims to educate and entertain listeners by providing an in-depth look into the fascinating and often mysterious world of the darknet.
[darknetdiaries.com](http://darknetdiaries.com/)
## How to run
### Install dependencies
@@ -28,4 +30,43 @@ python3 main.py
On the first run, it will generate the index. This can take a while, but it will be cached on disk for the next runs.
You can then ask it any questions about Darknet Diaries! Currently, it does hallucinate a lot about episode numbers and titles. Other than that, it's pretty accurate!
You can then ask it any questions about Darknet Diaries!
## Examples
> What is the intro of the podcast?
>> Typically, the host Jack Rhysider starts the podcast with a brief introduction about the episode's topic, followed by "These are true stories from the dark side of the internet. I'm Jack Rhysider. This is Darknet Diaries." Please note that the exact wording may vary from episode to episode.
(this is exactly the intro)
### Episode 137: Predator
[link to transcript](https://darknetdiaries.com/transcript/137/)
> What is the spyware predator ?
>> The Predator is a type of spyware that is capable of infecting a phone and erasing its tracks of the infection process. Once installed, it can watch what's going on with the camera, listen in on the microphone, scrape passwords, read texts, and report the person's location. It is typically delivered through a phishing text message that encourages the recipient to click on a link, triggering a series of exploits on the phone. The Predator spyware is supposedly only sold to intelligence agencies.
>
> What happened with predator and the egyptian journalists?
>> Predator spyware was found on the phones of an Egyptian politician living in exile and an Egyptian journalist. This led to the suspicion that the Egyptian government might be using this technology to spy on their civil society. The use of the spyware was not to stop terrorists or catch criminals, but seemingly to monitor the stories the journalist was working on, which was considered a misuse of the technology.
>
> How and when was Predator first discovered?
>> The Predator spyware was found on the phones of three people in Greece's civil society, including a journalist and an opposition leader. However, the specific timeline or details about the initial discovery are not provided.
>
>Which episode should I listen to in order to learn more about this topic?
>> To learn more about the Predator spyware, you should listen to episode 137 titled "Predator" from the Darknet Diaries.
### Episode 130: Jason's Pen Test
[link to transcript](https://darknetdiaries.com/transcript/130/)
> Who is jason haddix
>> Jason Haddix is a well-known figure in the cybersecurity industry. He has held various roles in the field, including Director of Technical Operations at Bugcrowd and Head of Trust and Security at HackerOne. Haddix is particularly known for his expertise in bug bounty programs and penetration testing. He has also contributed to the cybersecurity community by developing methodologies and tools for security testing.
>
> in which episode is he featured
>> Jason Haddix is featured in Episode 130, titled "Jason's Pen Test".
>
> Can you summarize that episode
>> Episode 130, titled "Jason's Pen Test," features a professional penetration tester named Jason who worked for HP. Jason shares his experiences conducting hundreds of penetration tests for Fortune 500 companies, primarily focusing on network hacking. He explains the process of testing a company's security measures, both physically and digitally.
>>
>>One of his memorable stories involves a physical penetration test where he had to break into a former employer's building. He used his knowledge of the building's layout and security mechanisms to gain access, even falling through a ceiling into a server room at one point.
>>
>>The episode also covers a project where Jason was tasked with hacking into a large, worldwide bank. His job was to examine the bank's mobile app for any potential security vulnerabilities that could expose customer or sensitive information. The episode provides a detailed look into the world of penetration testing, highlighting the importance of robust security measures in both physical and digital spaces.
>
> How many downloads does this episode have?
>> Episode 130 of Darknet Diaries, titled "JASON'S PEN TEST", has 667,528 downloads.

View File

@@ -1,13 +1,30 @@
import requests
import os
from bs4 import BeautifulSoup
import json
folder_path = "transcripts"
if not os.path.exists(folder_path):
os.makedirs(folder_path)
for i in range(1, 139):
try:
url = f"https://darknetdiaries.com/transcript/{i}"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
pre_section = soup.find('pre')
if pre_section:
text = pre_section.get_text()
with open(f"data/episode_{i}.txt", "w") as f:
f.write(text)
transcript = soup.find('pre').get_text()
url = f"https://api.darknetdiaries.com/{i}.json"
r = requests.get(url)
parsed_json = json.loads(r.text)
title = parsed_json["episode_name"]
number = parsed_json["episode_number"]
downloads = parsed_json["total_downloads"]
with open(f"{folder_path}/episode_{number}.txt", "w") as f:
f.write(f"{title}\n{downloads}\n{transcript}")
print(f"{number} {title}")
except Exception:
print(f"Failed scraping episode {i}")

85
main.py
View File

@@ -1,29 +1,40 @@
from llama_index import (SimpleDirectoryReader, ServiceContext, StorageContext, PromptTemplate,
from llama_index import (ServiceContext, StorageContext,
load_index_from_storage, Document, set_global_service_context)
from llama_index.node_parser import SimpleNodeParser
from llama_index import VectorStoreIndex
from llama_index.llms import OpenAI
from llama_index.llms import OpenAI, ChatMessage, MessageRole
from llama_index.prompts import ChatPromptTemplate
from llama_index import set_global_handler
from llama_index.chat_engine.types import ChatMode
import os
import re
llm = OpenAI(model="gpt-3.5-turbo", temperature=0, max_tokens=256)
# set_global_handler("simple")
llm = OpenAI(model="gpt-4", temperature=0, max_tokens=256)
service_context = ServiceContext.from_defaults(llm=llm)
set_global_service_context(service_context)
if not os.path.exists("./index/lock"):
documents = []
for filename in os.listdir("./data"):
for filename in os.listdir("./transcripts"):
episode_number = re.search(r'\d+', filename).group()
with open("./data/" + filename, 'r') as f:
with open("./transcripts/" + filename, 'r') as f:
title = f.readline().strip()
downloads = f.readline().strip()
content = f.read()
document = Document(
text=content,
doc_id=filename,
metadata={
"episode_number": episode_number
"episode_number": episode_number,
"episode_title": title,
"episode_downloads": downloads,
"episode_url": f"https://darknetdiaries.com/episode/{episode_number}/"
}
)
documents.append(document)
documents = SimpleDirectoryReader('./data').load_data()
parser = SimpleNodeParser.from_defaults()
nodes = parser.get_nodes_from_documents(documents)
@@ -31,21 +42,65 @@ if not os.path.exists("./index/lock"):
index.storage_context.persist(persist_dir="./index")
open("./index/lock", 'a').close()
else:
print("Loading index...")
storage_context = StorageContext.from_defaults(persist_dir="./index")
index = load_index_from_storage(storage_context)
template = (
"You are now an expert on the Darknet Diaries podcast. \n"
"Please answer this question by referring to the podcast: {query_str}\n"
chat_text_qa_msgs = [
ChatMessage(
role=MessageRole.SYSTEM,
content=(
"You have been trained on the Darknet Diaries podcast transcripts with data from october 6 2023."
"You are an expert about it and will answer as such. You know about every episode up to number 138."
"Always answer the question, even if the context isn't helpful."
"Mention the number and title of the episodes you are referring to."
)
),
ChatMessage(
role=MessageRole.USER,
content=(
"Context information is below.\n"
"---------------------\n"
"{context_str}\n"
"---------------------\n"
"Given the context information and not prior knowledge,"
"answer the question: {query_str}\n"
)
)
]
text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)
chat_refine_msgs = [
ChatMessage(
role=MessageRole.SYSTEM,
content="Always answer the question, even if the context isn't helpful.",
),
ChatMessage(
role=MessageRole.USER,
content=(
"We have the opportunity to refine the original answer "
"(only if needed) with some more context below.\n"
"------------\n"
"{context_msg}\n"
"------------\n"
"Given the new context, refine the original answer to better "
"answer the question: {query_str}. "
"If the context isn't useful, output the original answer again.\n"
"Original Answer: {existing_answer}"
),
),
]
refine_template = ChatPromptTemplate(chat_refine_msgs)
chat_engine = index.as_chat_engine(
text_qa_template=text_qa_template,
refine_template=refine_template,
chat_mode=ChatMode.OPENAI
)
qa_template = PromptTemplate(template)
query_engine = index.as_query_engine(text_qa_template=qa_template)
while True:
try:
user_prompt = input("Prompt: ")
response = query_engine.query(user_prompt)
print(response)
chat_engine.chat_repl()
except KeyboardInterrupt:
break