[RAG] Information Retrieval 대회: 스코어가 안나온다...

728x90

왜...

도대체 왜.....!

이렇게 스코어가 낮게 나오는것인가

이번 대회에서는 과학 상식을 질문하는 시나리오를 가정하고 과학 상식 문서 4200여개를 미리 검색엔진에 색인해 둡니다.

대화 메시지 또는 질문이 들어오면 과학 상식에 대한 질문 의도인지 그렇지 않은 지 판단 후에 과학 상식 질문이라면 검색엔진으로부터 적합한 문서들을 추출하고 이를 기반으로 답변을 생성합니다.

만일 과학 상식 이외의 질문이라면 검색엔진을 활용할 필요 없이 적절한 답을 바로 생성합니다.

마지막으로, 본 프로젝트는 모델링에 중점을 둔 대회가 아니라 RAG(Retrieval Augmented Generation) 시스템의 개발에 집중하고 있습니다. 이 대회는 여러 모델과 다양한 기법, 그리고 앙상블을 활용하여 모델의 성능을 향상시키는 일반적인 모델링 대회와는 다릅니다. 대신에 검색 엔진이 올바른 문서를 색인했는지, 그리고 생성된 답변이 적절한지 직접 확인하는 것이 중요한 대회입니다.

따라서, 참가자들은 작은 규모의 토이 데이터셋(10개 미만)을 사용하여 초기 실험을 진행한 후에 전체 데이터셋에 대한 평가를 진행하는 것을 권장합니다. 실제로 RAG 시스템을 구축할 때에도 이러한 방식이 일반적으로 적용되며, 이를 통해 실험을 더욱 효율적으로 진행할 수 있습니다. 따라서 이번 대회는 2주간 진행되며, 하루에 제출할 수 있는 횟수가 5회로 제한됩니다.

- FAISS를 이용

- SemanticChunker로 문서를 split(원본 4272개 -> 8540개)

- 임베딩 및 LLM 모델로는 OpenAI 모델을 사용

- standalone_query도 잘 추출되는 것을 확인

- FAISS에서 기본으로 제공하는 유클리드 거리로 유사도 검색

- 최종 선정된 3개 문서의 내용을 토대로 작성했던 answer를 상위 첫번째 문서만을 토대로 작성하도록 변경

(첫번째 문서는 잘 검색되나, 2~3번째 문서가 연관성이 낮은 문서들로 추출되는 경우가 종종 있었음)

을 하였으나,

ElasticSearch로 단순 단어 매칭하여 검색(임베딩도 필요없음)하는것보다 스코어가 낮게 나왔다.......... 😢

양치를 하다가 문득,

'혹시 과학상식에 대한 문서들이어서

유클리드 거리로 유사도를 계산하는 것보다 단순 단어 매칭으로 문서를 검색하는 것이 스코어가 더 잘나올수도 있을까?'

라는 생각이 들었다.

오늘은 일단 자고 🥱

내일 다시 해봐야겠다... 😤

1. 문서 로드

class JSONLLoader(BaseLoader):
    def __init__(self, file_path: str):
        self.file_path = file_path

    def load(self):
        documents = []
        seq_num = 1
        
        with open(self.file_path, 'r', encoding='utf-8') as file:
            for line in file:
                data = json.loads(line)
                doc = Document(
                    page_content=data['content'],
                    metadata={
                        'docid': data['docid'],
                        'src': data.get('src', ''),  # 'src' 필드가 없을 경우 빈 문자열 사용
                        'source': self.file_path,
                        'seq_num': seq_num,
                    }
                )
                documents.append(doc)
                seq_num += 1
        
        return documents

file_path = "/data/ephemeral/home/upstage-ai-final-ir2/HM/data/documents.jsonl"
loader = JSONLLoader(file_path)
documents = loader.load()

print(f"문서의 수: {len(documents)}")

2. 문서 분할

# SemanticChunker
semantic_text_splitter = SemanticChunker(
    OpenAIEmbeddings(), add_start_index=True)

# documents를 split
semantic_split_documents = semantic_text_splitter.split_documents(documents)

print(f"원본 문서의 수: {len(documents)}")
print(f"분할된 문서의 수: {len(semantic_split_documents)}")

3. 임베딩 및 벡터저장소 생성

- ElasticSearch는 es 설치 및 생성, 인덱스 설정을 별도로 했었는데

- FAISS는 FAISS.from_documents() 메서드 이 한줄로 모든 설정이 끝남 👍

# semantic_split_documents로 벡터저장소 생성 
vectorstore = FAISS.from_documents(documents=semantic_split_documents, embedding=OpenAIEmbeddings())

#>> FAISS.from_documents() 
#>> semantic_split_documents의 내용을 OpenAI 임베딩 모델을 통해 고차원 벡터로 변환 
#>> FAISS 인덱스 생성 
#>> 위에서 생성된 문서의 벡터를 FAISS 인덱스에 추가

4. Retriever 생성 및 예제 확인

- standalone_query가 핵심단어만 나오도록 queury_prompt를 작성하고(원하는대로 잘 출력되는것을 확인함)

- 그런데 위에서 문서분할을 했기 때문에, 하나의 query에 대해 동일한 문서('docid')가 중복으로 검색되는 경우가 발생

- 이번 대회에서는 'docid' 기준으로 3개를 선정해야하기 때문에 애초부터 k*3개를 검색하고

-> 'docid' 기준 중복을 제거하여 상위 3개를 return

# OpenAI LLM 초기화 
llm = OpenAI(temperature=0)

# 유사도 검색 테스트 
query = "나무의 분류에 대해 조사해 보기 위한 방법은?"

# Standalone Query Generator 프롬프트 템플릿
standalone_query_prompt = PromptTemplate(
    input_variables=["question"],
    template="""질문 query를 요약하려고 합니다. 핵심내용을 포함한 주제를 출력해주세요.
    아래는 예시입니다. 
    
    원래의 질문: "금성이 밝게 보이는 이유가 뭐야?"
    생성할 standalone_query: "금성 밝기 원인"

    원래의 질문: {question}
    독립적인 질문:
    """
)

# Standalone Query Generator 체인 생성
standalone_query_chain = LLMChain(llm=llm, prompt=standalone_query_prompt)

# Standalone Query 생성
standalone_query = standalone_query_chain.run(query)

print(f"Original Query: {query}")
print(f"Standalone Query: {standalone_query}")


# 생성된 Standalone Query를 사용하여 검색 (K*3개 검색)
K = 3
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": K*3})
search_result = retriever.get_relevant_documents(standalone_query)

# 중복 docid 제거 및 상위 3개 선택
unique_docs = []
seen_docids = set()

for doc in search_result:
    docid = doc.metadata.get('docid')
    if docid not in seen_docids:
        unique_docs.append(doc)
        seen_docids.add(docid)
        if len(unique_docs) == K:
            break

# 결과 출력
for i, doc in enumerate(unique_docs, 1):
    print(f"\n문서 {i}:")
    print(f"내용: {doc.page_content[:100]}...")  # 처음 100자만 출력
    print(f"메타데이터: {doc.metadata}")
    print("---")

5. 제출용 output 생성

def answer_question(query):
    # Standalone Query 생성
    standalone_query = standalone_query_chain.run(query)

    # 검색 수행 (K*3개 검색)
    K = 3
    retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": K*3})
    search_result = retriever.get_relevant_documents(standalone_query)

    # 중복 docid 제거 및 상위 3개 선택
    unique_docs = []
    seen_docids = set()

    for doc in search_result:
        docid = doc.metadata.get('docid')
        if docid not in seen_docids:
            unique_docs.append(doc)
            seen_docids.add(docid)
            if len(unique_docs) == K:
                break

    # RAG 프롬프트 가져오기
    rag_prompt = hub.pull("rlm/rag-prompt")

    # 첫 번째 문서의 내용만 사용
    context = unique_docs[0].page_content if unique_docs else ""

    # LLM 체인 생성
    llm = OpenAI(temperature=0)
    rag_chain = LLMChain(llm=llm, prompt=rag_prompt)

    # 답변 생성 (첫 번째 문서만 참고)
    answer = rag_chain.run(context=context, question=query)
    
    standalone_query = standalone_query_chain.run(query).strip('"')  # 따옴표 제거

    # topk 및 references 정보 추출
    topk = [doc.metadata.get('docid') for doc in unique_docs]
    references = [
        {
            "score": doc.metadata.get('score', 0),
            "content": doc.page_content
        }
        for doc in unique_docs
    ]

    return {
        "standalone_query": standalone_query,
        "topk": topk,
        "answer": answer,
        "references": references
    }

def eval_rag(eval_filename, output_filename):
    with open(eval_filename) as f, open(output_filename, "w") as of:
        idx = 0
        for line in f:
            j = json.loads(line)
            print(f'Test {idx}\nQuestion: {j["msg"]}')
            response = answer_question(j["msg"])
            print(f'Answer: {response["answer"]}\n')

            output = {
                "eval_id": j["eval_id"],
                "standalone_query": response["standalone_query"],
                "topk": response["topk"],
                "answer": response["answer"],
                "references": response["references"]
            }
            of.write(f'{json.dumps(output, ensure_ascii=False)}\n')
            idx += 1

728x90

'RAG' 카테고리의 다른 글

Barclays Bank와의 채팅 (1)	2024.07.24
[RAG] AutoRAG 설명 (0)	2024.06.24
[RAG] Retrieval 평가지표 (0)	2024.06.24
[RAG] Hybrid Retrieval(matching + cosine similarity) (0)	2024.06.24

BusyBee

[RAG] Information Retrieval 대회: 스코어가 안나온다...

1. 문서 로드

2. 문서 분할

3. 임베딩 및 벡터저장소 생성

4. Retriever 생성 및 예제 확인

5. 제출용 output 생성

'RAG' 카테고리의 다른 글

티스토리툴바

[RAG] Information Retrieval 대회: 스코어가 안나온다...

1. 문서 로드

2. 문서 분할

3. 임베딩 및 벡터저장소 생성

4. Retriever 생성 및 예제 확인

5. 제출용 output 생성

'RAG' 카테고리의 다른 글

관련글

티스토리툴바