chroma的相似度

2024-09-07 16:16:31 阅读:3 编辑
   Args:
            query (str): Query text to search for.
            k (int): Number of results to return. Defaults to 4.
            filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.        Returns:
            List[Tuple[Document, float]]: List of documents most similar to
            the query text and cosine distance in float for each.
            Lower score represents more similarity.
        """docs = vectordb.similarity_search_with_score(query)
docs[0]
//(Document(page_content='Tonight. I c..inds, who will...xcellence.', metadata={'source': '../../../state_of_the_union.txt'}),1.1972057819366455)eg2: 加filter
docs = vectordb.similarity_search_with_score(query=ask, k=2, filter=dict(source='./uploads/yisheng_update.docx')) 
// [(Document(page_content='更多详细咨询营养师...的原因。', metadata={'source': './uploads/yisheng_update.docx'}), 186.72679092063976), (Document(page_content='益生菌就定植于..最直接的关系,能够在.的...畅,自然和谐统一的状态。', metadata={'source': './uploads/yisheng_update.docx'}), 221.41649154602675)]相关度数值(score)还是挺迷的,到底多少算是相关,多少算是不相关呢?这里,我做了一些测试。 用的都是中文,向量计算embedding分别采用了OpenAIEmbeddings, text2vec这两个库,计算出的数据经过一番比较,得到:
OpenAIEmbeddings中,低于0.3的相关度高,高于0.4的基本不相关;
text2vec中,低于256的算是相关度高,高于300的就基本不相关了!eg:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from dotenv import dotenv_valuesenv_vars = dotenv_values('.env')# 加载数据库
persist_directory = './chromac'
collection = 'bigccx'
embedding = OpenAIEmbeddings(
    model="text-embedding-ada-002",
    openai_api_key=env_vars['OPENAI_API_KEY']
)vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding, collection_name=collection)# 查询相似度的文本
def queryVectorDB(ask): 
    s = vectordb.similarity_search_with_score(query=ask, k=1) 
    if len(s) == 0:
        return ""
    else:
        if s[0][1] < 0.3:   # 文本关联强则返回,不相关则不返回. shiba < 256  openai < 0.3
            return s[0][0].page_content
        else:
            return ""
eg2:  text2vec模型质量一般,推荐使用OpenAIEmbeddings
from langchain.vectorstores import Chroma
import sentence_transformers
from