Compare commits
14 Commits
f4a9c9bed7
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
fbcf0168f9 | ||
|
|
1a46ea4816 | ||
|
|
bf3fd878ac | ||
|
|
c47ff3d9ed | ||
|
|
a582d89c57 | ||
|
|
d0dd93a5ab | ||
|
|
4e587aed9e | ||
|
|
395de571a4 | ||
|
|
fb42ee0eee | ||
|
|
be2c064d40 | ||
|
|
3d734a3064 | ||
|
|
b41d8288b7 | ||
|
|
a77d41c6ec | ||
|
|
96c692aef7 |
2
.gitignore
vendored
2
.gitignore
vendored
@@ -1,3 +1,3 @@
|
|||||||
/data
|
/transcripts
|
||||||
/index
|
/index
|
||||||
/.idea
|
/.idea
|
||||||
47
README.MD
47
README.MD
@@ -2,8 +2,10 @@
|
|||||||
## What is Darknet Diaries?
|
## What is Darknet Diaries?
|
||||||
Well, let's ask our LLM:
|
Well, let's ask our LLM:
|
||||||
|
|
||||||
> > What is Darknet Diaries?
|
> What is Darknet Diaries?
|
||||||
> > Darknet Diaries is a podcast that explores true stories from the dark side of the internet. Hosted by Jack Rhysider, each episode delves into various topics related to hacking, cybercrime, and the underground world of the internet. The podcast features interviews with individuals who have been involved in hacking incidents, cyber espionage, or have experienced the consequences of online criminal activities. Darknet Diaries aims to educate and entertain listeners by providing an in-depth look into the fascinating and often mysterious world of the darknet.
|
>> Darknet Diaries is a podcast that explores true stories from the dark side of the internet. Hosted by Jack Rhysider, each episode delves into various topics related to hacking, cybercrime, and the underground world of the internet. The podcast features interviews with individuals who have been involved in hacking incidents, cyber espionage, or have experienced the consequences of online criminal activities. Darknet Diaries aims to educate and entertain listeners by providing an in-depth look into the fascinating and often mysterious world of the darknet.
|
||||||
|
|
||||||
|
[darknetdiaries.com](http://darknetdiaries.com/)
|
||||||
|
|
||||||
## How to run
|
## How to run
|
||||||
### Install dependencies
|
### Install dependencies
|
||||||
@@ -28,4 +30,43 @@ python3 main.py
|
|||||||
|
|
||||||
On the first run, it will generate the index. This can take a while, but it will be cached on disk for the next runs.
|
On the first run, it will generate the index. This can take a while, but it will be cached on disk for the next runs.
|
||||||
|
|
||||||
You can then ask it any questions about Darknet Diaries! Currently, it does hallucinate a lot about episode numbers and titles. Other than that, it's pretty accurate!
|
You can then ask it any questions about Darknet Diaries!
|
||||||
|
## Examples
|
||||||
|
|
||||||
|
> What is the intro of the podcast?
|
||||||
|
>> Typically, the host Jack Rhysider starts the podcast with a brief introduction about the episode's topic, followed by "These are true stories from the dark side of the internet. I'm Jack Rhysider. This is Darknet Diaries." Please note that the exact wording may vary from episode to episode.
|
||||||
|
|
||||||
|
(this is exactly the intro)
|
||||||
|
|
||||||
|
### Episode 137: Predator
|
||||||
|
[link to transcript](https://darknetdiaries.com/transcript/137/)
|
||||||
|
|
||||||
|
> What is the spyware predator ?
|
||||||
|
>> The Predator is a type of spyware that is capable of infecting a phone and erasing its tracks of the infection process. Once installed, it can watch what's going on with the camera, listen in on the microphone, scrape passwords, read texts, and report the person's location. It is typically delivered through a phishing text message that encourages the recipient to click on a link, triggering a series of exploits on the phone. The Predator spyware is supposedly only sold to intelligence agencies.
|
||||||
|
>
|
||||||
|
> What happened with predator and the egyptian journalists?
|
||||||
|
>> Predator spyware was found on the phones of an Egyptian politician living in exile and an Egyptian journalist. This led to the suspicion that the Egyptian government might be using this technology to spy on their civil society. The use of the spyware was not to stop terrorists or catch criminals, but seemingly to monitor the stories the journalist was working on, which was considered a misuse of the technology.
|
||||||
|
>
|
||||||
|
> How and when was Predator first discovered?
|
||||||
|
>> The Predator spyware was found on the phones of three people in Greece's civil society, including a journalist and an opposition leader. However, the specific timeline or details about the initial discovery are not provided.
|
||||||
|
>
|
||||||
|
>Which episode should I listen to in order to learn more about this topic?
|
||||||
|
>> To learn more about the Predator spyware, you should listen to episode 137 titled "Predator" from the Darknet Diaries.
|
||||||
|
|
||||||
|
### Episode 130: Jason's Pen Test
|
||||||
|
[link to transcript](https://darknetdiaries.com/transcript/130/)
|
||||||
|
> Who is jason haddix
|
||||||
|
>> Jason Haddix is a well-known figure in the cybersecurity industry. He has held various roles in the field, including Director of Technical Operations at Bugcrowd and Head of Trust and Security at HackerOne. Haddix is particularly known for his expertise in bug bounty programs and penetration testing. He has also contributed to the cybersecurity community by developing methodologies and tools for security testing.
|
||||||
|
>
|
||||||
|
> in which episode is he featured
|
||||||
|
>> Jason Haddix is featured in Episode 130, titled "Jason's Pen Test".
|
||||||
|
>
|
||||||
|
> Can you summarize that episode
|
||||||
|
>> Episode 130, titled "Jason's Pen Test," features a professional penetration tester named Jason who worked for HP. Jason shares his experiences conducting hundreds of penetration tests for Fortune 500 companies, primarily focusing on network hacking. He explains the process of testing a company's security measures, both physically and digitally.
|
||||||
|
>>
|
||||||
|
>>One of his memorable stories involves a physical penetration test where he had to break into a former employer's building. He used his knowledge of the building's layout and security mechanisms to gain access, even falling through a ceiling into a server room at one point.
|
||||||
|
>>
|
||||||
|
>>The episode also covers a project where Jason was tasked with hacking into a large, worldwide bank. His job was to examine the bank's mobile app for any potential security vulnerabilities that could expose customer or sensitive information. The episode provides a detailed look into the world of penetration testing, highlighting the importance of robust security measures in both physical and digital spaces.
|
||||||
|
>
|
||||||
|
> How many downloads does this episode have?
|
||||||
|
>> Episode 130 of Darknet Diaries, titled "JASON'S PEN TEST", has 667,528 downloads.
|
||||||
@@ -1,13 +1,30 @@
|
|||||||
import requests
|
import requests
|
||||||
|
import os
|
||||||
from bs4 import BeautifulSoup
|
from bs4 import BeautifulSoup
|
||||||
|
import json
|
||||||
|
|
||||||
|
folder_path = "transcripts"
|
||||||
|
|
||||||
|
if not os.path.exists(folder_path):
|
||||||
|
os.makedirs(folder_path)
|
||||||
|
|
||||||
for i in range(1, 139):
|
for i in range(1, 139):
|
||||||
|
try:
|
||||||
url = f"https://darknetdiaries.com/transcript/{i}"
|
url = f"https://darknetdiaries.com/transcript/{i}"
|
||||||
r = requests.get(url)
|
r = requests.get(url)
|
||||||
soup = BeautifulSoup(r.text, 'html.parser')
|
soup = BeautifulSoup(r.text, 'html.parser')
|
||||||
pre_section = soup.find('pre')
|
|
||||||
|
|
||||||
if pre_section:
|
transcript = soup.find('pre').get_text()
|
||||||
text = pre_section.get_text()
|
|
||||||
with open(f"data/episode_{i}.txt", "w") as f:
|
url = f"https://api.darknetdiaries.com/{i}.json"
|
||||||
f.write(text)
|
r = requests.get(url)
|
||||||
|
parsed_json = json.loads(r.text)
|
||||||
|
title = parsed_json["episode_name"]
|
||||||
|
number = parsed_json["episode_number"]
|
||||||
|
downloads = parsed_json["total_downloads"]
|
||||||
|
|
||||||
|
with open(f"{folder_path}/episode_{number}.txt", "w") as f:
|
||||||
|
f.write(f"{title}\n{downloads}\n{transcript}")
|
||||||
|
print(f"{number} {title}")
|
||||||
|
except Exception:
|
||||||
|
print(f"Failed scraping episode {i}")
|
||||||
|
|||||||
85
main.py
85
main.py
@@ -1,29 +1,40 @@
|
|||||||
from llama_index import (SimpleDirectoryReader, ServiceContext, StorageContext, PromptTemplate,
|
from llama_index import (ServiceContext, StorageContext,
|
||||||
load_index_from_storage, Document, set_global_service_context)
|
load_index_from_storage, Document, set_global_service_context)
|
||||||
from llama_index.node_parser import SimpleNodeParser
|
from llama_index.node_parser import SimpleNodeParser
|
||||||
from llama_index import VectorStoreIndex
|
from llama_index import VectorStoreIndex
|
||||||
from llama_index.llms import OpenAI
|
from llama_index.llms import OpenAI, ChatMessage, MessageRole
|
||||||
|
from llama_index.prompts import ChatPromptTemplate
|
||||||
|
from llama_index import set_global_handler
|
||||||
|
from llama_index.chat_engine.types import ChatMode
|
||||||
import os
|
import os
|
||||||
import re
|
import re
|
||||||
|
|
||||||
llm = OpenAI(model="gpt-3.5-turbo", temperature=0, max_tokens=256)
|
# set_global_handler("simple")
|
||||||
|
|
||||||
|
llm = OpenAI(model="gpt-4", temperature=0, max_tokens=256)
|
||||||
service_context = ServiceContext.from_defaults(llm=llm)
|
service_context = ServiceContext.from_defaults(llm=llm)
|
||||||
set_global_service_context(service_context)
|
set_global_service_context(service_context)
|
||||||
|
|
||||||
if not os.path.exists("./index/lock"):
|
if not os.path.exists("./index/lock"):
|
||||||
documents = []
|
documents = []
|
||||||
for filename in os.listdir("./data"):
|
for filename in os.listdir("./transcripts"):
|
||||||
episode_number = re.search(r'\d+', filename).group()
|
episode_number = re.search(r'\d+', filename).group()
|
||||||
with open("./data/" + filename, 'r') as f:
|
with open("./transcripts/" + filename, 'r') as f:
|
||||||
|
title = f.readline().strip()
|
||||||
|
downloads = f.readline().strip()
|
||||||
content = f.read()
|
content = f.read()
|
||||||
document = Document(
|
document = Document(
|
||||||
text=content,
|
text=content,
|
||||||
|
doc_id=filename,
|
||||||
metadata={
|
metadata={
|
||||||
"episode_number": episode_number
|
"episode_number": episode_number,
|
||||||
|
"episode_title": title,
|
||||||
|
"episode_downloads": downloads,
|
||||||
|
"episode_url": f"https://darknetdiaries.com/episode/{episode_number}/"
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
|
documents.append(document)
|
||||||
|
|
||||||
documents = SimpleDirectoryReader('./data').load_data()
|
|
||||||
parser = SimpleNodeParser.from_defaults()
|
parser = SimpleNodeParser.from_defaults()
|
||||||
nodes = parser.get_nodes_from_documents(documents)
|
nodes = parser.get_nodes_from_documents(documents)
|
||||||
|
|
||||||
@@ -31,21 +42,65 @@ if not os.path.exists("./index/lock"):
|
|||||||
index.storage_context.persist(persist_dir="./index")
|
index.storage_context.persist(persist_dir="./index")
|
||||||
open("./index/lock", 'a').close()
|
open("./index/lock", 'a').close()
|
||||||
else:
|
else:
|
||||||
|
print("Loading index...")
|
||||||
storage_context = StorageContext.from_defaults(persist_dir="./index")
|
storage_context = StorageContext.from_defaults(persist_dir="./index")
|
||||||
index = load_index_from_storage(storage_context)
|
index = load_index_from_storage(storage_context)
|
||||||
|
|
||||||
template = (
|
chat_text_qa_msgs = [
|
||||||
"You are now an expert on the Darknet Diaries podcast. \n"
|
ChatMessage(
|
||||||
"Please answer this question by referring to the podcast: {query_str}\n"
|
role=MessageRole.SYSTEM,
|
||||||
|
content=(
|
||||||
|
"You have been trained on the Darknet Diaries podcast transcripts with data from october 6 2023."
|
||||||
|
"You are an expert about it and will answer as such. You know about every episode up to number 138."
|
||||||
|
"Always answer the question, even if the context isn't helpful."
|
||||||
|
"Mention the number and title of the episodes you are referring to."
|
||||||
|
)
|
||||||
|
),
|
||||||
|
ChatMessage(
|
||||||
|
role=MessageRole.USER,
|
||||||
|
content=(
|
||||||
|
"Context information is below.\n"
|
||||||
|
"---------------------\n"
|
||||||
|
"{context_str}\n"
|
||||||
|
"---------------------\n"
|
||||||
|
"Given the context information and not prior knowledge,"
|
||||||
|
"answer the question: {query_str}\n"
|
||||||
|
)
|
||||||
|
)
|
||||||
|
]
|
||||||
|
text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)
|
||||||
|
|
||||||
|
chat_refine_msgs = [
|
||||||
|
ChatMessage(
|
||||||
|
role=MessageRole.SYSTEM,
|
||||||
|
content="Always answer the question, even if the context isn't helpful.",
|
||||||
|
),
|
||||||
|
ChatMessage(
|
||||||
|
role=MessageRole.USER,
|
||||||
|
content=(
|
||||||
|
"We have the opportunity to refine the original answer "
|
||||||
|
"(only if needed) with some more context below.\n"
|
||||||
|
"------------\n"
|
||||||
|
"{context_msg}\n"
|
||||||
|
"------------\n"
|
||||||
|
"Given the new context, refine the original answer to better "
|
||||||
|
"answer the question: {query_str}. "
|
||||||
|
"If the context isn't useful, output the original answer again.\n"
|
||||||
|
"Original Answer: {existing_answer}"
|
||||||
|
),
|
||||||
|
),
|
||||||
|
]
|
||||||
|
refine_template = ChatPromptTemplate(chat_refine_msgs)
|
||||||
|
|
||||||
|
chat_engine = index.as_chat_engine(
|
||||||
|
text_qa_template=text_qa_template,
|
||||||
|
refine_template=refine_template,
|
||||||
|
chat_mode=ChatMode.OPENAI
|
||||||
)
|
)
|
||||||
qa_template = PromptTemplate(template)
|
|
||||||
query_engine = index.as_query_engine(text_qa_template=qa_template)
|
|
||||||
|
|
||||||
while True:
|
while True:
|
||||||
try:
|
try:
|
||||||
user_prompt = input("Prompt: ")
|
chat_engine.chat_repl()
|
||||||
response = query_engine.query(user_prompt)
|
|
||||||
print(response)
|
|
||||||
except KeyboardInterrupt:
|
except KeyboardInterrupt:
|
||||||
break
|
break
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user