LongCite is an open-source project by Tsinghua University designed to enhance the credibility and verifiability of large language models (LLMs) in long-text question-answering tasks. It generates fine-grained sentence-level citations, allowing users to verify the accuracy of the model's responses. The project includes the LongBench-Cite evaluation benchmark, the CoF automated data construction process, the LongCite-45k dataset, and the LongCite-8B and LongCite-9B models trained on this dataset. These models can process long texts and provide accurate answers with direct citations, improving transparency and reliability.
What is LongCite?
LongCite is an open-source project by Tsinghua University designed to enhance the credibility and verifiability of large language models (LLMs) in long-text question-answering tasks. It generates fine-grained sentence-level citations, allowing users to verify the accuracy of the model's responses. The project includes the LongBench-Cite evaluation benchmark, the CoF automated data construction process, the LongCite-45k dataset, and the LongCite-8B and LongCite-9B models trained on this dataset. These models can process long texts and provide accurate answers with direct citations, improving transparency and reliability.
Main Features of LongCite
- Generate Fine-Grained Citations: LongCite enables language models to generate precise, sentence-level citations when answering long-text questions, allowing users to trace back to specific information in the original text.
- Improve Answer Faithfulness: LongCite helps ensure that the model's answers are more faithful to the original text, reducing "hallucinations" (i.e., generating information that does not match the original text).
- Enhance Verifiability: Users can verify the authenticity and accuracy of answers based on the fine-grained citations provided by the model, increasing the credibility of the model's output.
- Automated Data Construction: LongCite uses the CoF (Coarse to Fine) process to automatically generate high-quality long-text question-answering data with fine-grained citations, providing rich annotated resources for model training.
- Evaluation Benchmark: LongCite introduces the LongBench-Cite evaluation benchmark to measure the model's ability to generate citations in long-text question-answering tasks, including correctness and citation quality.
Technical Principles of LongCite
- Long-Text Processing Capability: LongCite supports large language models with ultra-long context windows (e.g., GLM-4-9B-1M, Gemini 1.5), capable of processing and understanding texts up to tens of thousands of words.
- Fine-Grained Citation Generation: LongCite trains models to generate precise, sentence-level citations, allowing each answer to be traced back to specific sentences in the original text, enhancing answer verifiability.
- Automated Data Construction Process (CoF): Uses the Self-Instruct method to automatically generate question-answer pairs from long texts. Retrieves sentence blocks related to the answers from the long text and generates block-level citations. Based on block-level citations, extracts specific sentences supporting each statement to generate sentence-level citations.
- Supervised Fine-Tuning (SFT): Fine-tunes large language models using the high-quality dataset generated by the CoF process with fine-grained citations, improving the model's performance in long-text question-answering tasks.
Project Links for LongCite
Application Scenarios of LongCite
- Academic Research: Researchers and scholars use LongCite to query extensive literature and obtain detailed answers with citations, supporting research work.
- Legal Consultation: Legal professionals use LongCite to analyze legal documents, obtaining specific legal provisions or case citations to support legal analysis and case studies.
- Financial Analysis: Financial analysts and investors use LongCite to understand complex financial reports and market research, obtaining accurate citations for key data and trends.
- Medical Consultation: Medical professionals rely on LongCite to query medical literature, obtaining diagnostic and treatment recommendations based on the latest research findings with citations.
- News Reporting: Journalists and news agencies use LongCite to verify information in reports, ensuring the accuracy of published news content and providing reliable source citations.