In the Hugging Face community event I learned how to use FAISS(Facebook AI Similarity Search) to find documents that are most semantically similar to a given query. The goal of this project is to extend this idea to build a retrieval and reranking system, where the retriever returns possibly relevant results, while the reranker evaluates the how relevant these hits are to the query.
An example of the architecture might looks as follows (taken from the
sentence-transformers models on the Hub are great for the reranking task.
Wikipedia is usually a good corpus to test retrieval systems on and you can find a dump in various languages here:
Implementing the full retriever-reranking architecture might be a challenge, so a simpler place to start is with a single long document. You can then chunk that document into paragraphs and compute the relevancy scores across each paragraph
Complete code that is used to develop this app during the event can be found at my git hub:
Desired project outcomes:
- Create a Streamlit or Gradio app on Spaces that allows a user to enter a search query about a document (or a whole corpus of documents), and returns the top 5 most relevant paragraphs.
Demo for this app can be found at Hugging face space :