deps-managment-and-dotenv #1

Open
EndMove wants to merge 3 commits from EndMove/darknet_diaries_llm:deps-managment-and-dotenv into master
6 changed files with 157 additions and 108 deletions

14
.editorconfig Normal file
View File

@ -0,0 +1,14 @@
root = true
[*]
charset = utf-8
end_of_line = lf
indent_size = 4
indent_style = space
insert_final_newline = true
trim_trailing_whitespace = true
max_line_length = 120
[*.md]
trim_trailing_whitespace = false
max_line_length = 0

4
.gitignore vendored
View File

@ -1,3 +1,5 @@
.env
/transcripts /transcripts
/index /index
/.idea /.idea
/venv

View File

@ -9,9 +9,15 @@ Well, let's ask our LLM:
## How to run ## How to run
### Install dependencies ### Install dependencies
I have no idea what the correct way to install dependencies with python is. Somehow install these libraries and their dependencies: It is recommended to use a python version greater than or equal to ``3.10.0``.
- llama_index Another stuff recommended, is to create a venv or use an IDE that supports venv creation, so all dependencies are installed locally to the project and not globally. If not, you can use https://virtualenv.pypa.io/en/latest/ to artificially create isolated environments.
- beautifulsoup4
Install the dependencies required to run the project by running the following command at the project root :
```shell
pip install -r requirements.txt
```
### Execution ### Execution
Download transcripts: Download transcripts:
```shell ```shell
@ -31,6 +37,7 @@ python3 main.py
On the first run, it will generate the index. This can take a while, but it will be cached on disk for the next runs. On the first run, it will generate the index. This can take a while, but it will be cached on disk for the next runs.
You can then ask it any questions about Darknet Diaries! You can then ask it any questions about Darknet Diaries!
## Examples ## Examples
> What is the intro of the podcast? > What is the intro of the podcast?
@ -69,4 +76,4 @@ You can then ask it any questions about Darknet Diaries!
>>The episode also covers a project where Jason was tasked with hacking into a large, worldwide bank. His job was to examine the bank's mobile app for any potential security vulnerabilities that could expose customer or sensitive information. The episode provides a detailed look into the world of penetration testing, highlighting the importance of robust security measures in both physical and digital spaces. >>The episode also covers a project where Jason was tasked with hacking into a large, worldwide bank. His job was to examine the bank's mobile app for any potential security vulnerabilities that could expose customer or sensitive information. The episode provides a detailed look into the world of penetration testing, highlighting the importance of robust security measures in both physical and digital spaces.
> >
> How many downloads does this episode have? > How many downloads does this episode have?
>> Episode 130 of Darknet Diaries, titled "JASON'S PEN TEST", has 667,528 downloads. >> Episode 130 of Darknet Diaries, titled "JASON'S PEN TEST", has 667,528 downloads.

View File

@ -5,26 +5,30 @@ import json
folder_path = "transcripts" folder_path = "transcripts"
if not os.path.exists(folder_path): if __name__ == '__main__':
Review

c'est vraiment nécessaire d'avoir ça dans un fichier qui n'est pas importé par d'autres fichiers? C'est juste un script pas un module

c'est vraiment nécessaire d'avoir ça dans un fichier qui n'est pas importé par d'autres fichiers? C'est juste un script pas un module
Review

yup pour spécifier que c'est un script "executable"

yup pour spécifier que c'est un script "executable"
os.makedirs(folder_path) if not os.path.exists(folder_path):
os.makedirs(folder_path)
for i in range(1, 139): for i in range(1, 139):
try: try:
url = f"https://darknetdiaries.com/transcript/{i}" # fetch transcript
r = requests.get(url) url = f"https://darknetdiaries.com/transcript/{i}"
soup = BeautifulSoup(r.text, 'html.parser') r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
transcript = soup.find('pre').get_text() transcript = soup.find('pre').get_text()
url = f"https://api.darknetdiaries.com/{i}.json" # fetch transcript metadata
r = requests.get(url) url = f"https://api.darknetdiaries.com/{i}.json"
parsed_json = json.loads(r.text) r = requests.get(url)
title = parsed_json["episode_name"] parsed_json = json.loads(r.text)
number = parsed_json["episode_number"] title = parsed_json["episode_name"]
downloads = parsed_json["total_downloads"] number = parsed_json["episode_number"]
downloads = parsed_json["total_downloads"]
with open(f"{folder_path}/episode_{number}.txt", "w") as f: # write transcript
f.write(f"{title}\n{downloads}\n{transcript}") with open(f"{folder_path}/episode_{number}.txt", "w", encoding='utf-8') as f:
print(f"{number} {title}") f.write(f"{title}\n{downloads}\n{transcript}")
except Exception: print(f"{number} {title}")
print(f"Failed scraping episode {i}") except Exception as err:
print(f"Failed scraping episode {i} : [{err}]")

174
main.py
View File

@ -4,103 +4,109 @@ from llama_index.node_parser import SimpleNodeParser
from llama_index import VectorStoreIndex from llama_index import VectorStoreIndex
from llama_index.llms import OpenAI, ChatMessage, MessageRole from llama_index.llms import OpenAI, ChatMessage, MessageRole
from llama_index.prompts import ChatPromptTemplate from llama_index.prompts import ChatPromptTemplate
from llama_index import set_global_handler # from llama_index import set_global_handler
from llama_index.chat_engine.types import ChatMode from llama_index.chat_engine.types import ChatMode
from dotenv import load_dotenv
import os import os
import re import re
# set_global_handler("simple") # set_global_handler("simple")
llm = OpenAI(model="gpt-4", temperature=0, max_tokens=256) # load .env
load_dotenv()
OPEN_API_KEY = os.getenv('OPEN_API_KEY')
Review

pas nécessaire, llama-index prend déjà la clé API depuis les envvar

pas nécessaire, llama-index prend déjà la clé API depuis les envvar
Review

pour ceux qui ne passent pas par la :p genre moi ^^

pour ceux qui ne passent pas par la :p genre moi ^^
# config llm context
llm = OpenAI(model="gpt-4", temperature=0, max_tokens=256, api_key=OPEN_API_KEY)
Outdated
Review

merci pour la clé gratuite xD

merci pour la clé gratuite xD

oupss ;')

oupss ;')
service_context = ServiceContext.from_defaults(llm=llm) service_context = ServiceContext.from_defaults(llm=llm)
set_global_service_context(service_context) set_global_service_context(service_context)
if not os.path.exists("./index/lock"): if __name__ == '__main__':
Review

no

no
documents = [] if not os.path.exists("./index/lock"):
for filename in os.listdir("./transcripts"): documents = []
episode_number = re.search(r'\d+', filename).group() for filename in os.listdir("./transcripts"):
with open("./transcripts/" + filename, 'r') as f: episode_number = re.search(r'\d+', filename).group()
title = f.readline().strip() with open("./transcripts/" + filename, 'r') as f:
downloads = f.readline().strip() title = f.readline().strip()
content = f.read() downloads = f.readline().strip()
document = Document( content = f.read()
text=content, document = Document(
doc_id=filename, text=content,
metadata={ doc_id=filename,
"episode_number": episode_number, metadata={
"episode_title": title, "episode_number": episode_number,
"episode_downloads": downloads, "episode_title": title,
"episode_url": f"https://darknetdiaries.com/episode/{episode_number}/" "episode_downloads": downloads,
} "episode_url": f"https://darknetdiaries.com/episode/{episode_number}/"
) }
documents.append(document) )
documents.append(document)
parser = SimpleNodeParser.from_defaults() parser = SimpleNodeParser.from_defaults()
nodes = parser.get_nodes_from_documents(documents) nodes = parser.get_nodes_from_documents(documents)
index = VectorStoreIndex(nodes, show_progress=True) index = VectorStoreIndex(nodes, show_progress=True)
index.storage_context.persist(persist_dir="./index") index.storage_context.persist(persist_dir="./index")
open("./index/lock", 'a').close() open("./index/lock", 'a').close()
else: else:
print("Loading index...") print("Loading index...")
storage_context = StorageContext.from_defaults(persist_dir="./index") storage_context = StorageContext.from_defaults(persist_dir="./index")
index = load_index_from_storage(storage_context) index = load_index_from_storage(storage_context)
chat_text_qa_msgs = [ chat_text_qa_msgs = [
ChatMessage( ChatMessage(
role=MessageRole.SYSTEM, role=MessageRole.SYSTEM,
content=( content=(
"You have been trained on the Darknet Diaries podcast transcripts with data from october 6 2023." "You have been trained on the Darknet Diaries podcast transcripts with data from october 6 2023."
"You are an expert about it and will answer as such. You know about every episode up to number 138." "You are an expert about it and will answer as such. You know about every episode up to number 138."
"Always answer the question, even if the context isn't helpful." "Always answer the question, even if the context isn't helpful."
"Mention the number and title of the episodes you are referring to." "Mention the number and title of the episodes you are referring to."
) )
),
ChatMessage(
role=MessageRole.USER,
content=(
"Context information is below.\n"
"---------------------\n"
"{context_str}\n"
"---------------------\n"
"Given the context information and not prior knowledge,"
"answer the question: {query_str}\n"
)
)
]
text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)
chat_refine_msgs = [
ChatMessage(
role=MessageRole.SYSTEM,
content="Always answer the question, even if the context isn't helpful.",
),
ChatMessage(
role=MessageRole.USER,
content=(
"We have the opportunity to refine the original answer "
"(only if needed) with some more context below.\n"
"------------\n"
"{context_msg}\n"
"------------\n"
"Given the new context, refine the original answer to better "
"answer the question: {query_str}. "
"If the context isn't useful, output the original answer again.\n"
"Original Answer: {existing_answer}"
), ),
), ChatMessage(
] role=MessageRole.USER,
refine_template = ChatPromptTemplate(chat_refine_msgs) content=(
"Context information is below.\n"
"---------------------\n"
"{context_str}\n"
"---------------------\n"
"Given the context information and not prior knowledge,"
"answer the question: {query_str}\n"
)
)
]
text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)
chat_engine = index.as_chat_engine( chat_refine_msgs = [
text_qa_template=text_qa_template, ChatMessage(
refine_template=refine_template, role=MessageRole.SYSTEM,
chat_mode=ChatMode.OPENAI content="Always answer the question, even if the context isn't helpful.",
) ),
ChatMessage(
role=MessageRole.USER,
content=(
"We have the opportunity to refine the original answer "
"(only if needed) with some more context below.\n"
"------------\n"
"{context_msg}\n"
"------------\n"
"Given the new context, refine the original answer to better "
"answer the question: {query_str}. "
"If the context isn't useful, output the original answer again.\n"
"Original Answer: {existing_answer}"
),
),
]
refine_template = ChatPromptTemplate(chat_refine_msgs)
while True: chat_engine = index.as_chat_engine(
try: text_qa_template=text_qa_template,
chat_engine.chat_repl() refine_template=refine_template,
except KeyboardInterrupt: chat_mode=ChatMode.OPENAI
break )
while True:
try:
chat_engine.chat_repl()
except KeyboardInterrupt:
break

16
requirements.txt Normal file
View File

@ -0,0 +1,16 @@
# =====================
# Required dependencies
# =====================
# general deps
requests~=2.31.0
llama-index~=0.8.40
beautifulsoup4~=4.12.2
python-dotenv~=1.0.0
# llama sub deps
transformers~=4.34.0
torch~=2.1.0
# =====================
# Development dependencies
# =====================