LongCite

LongCite

by Tsinghua University
LongCite is an open-source project by Tsinghua University designed to enhance the credibility and verifiability of large language models (LLMs) in long-text question-answering tasks. It generates fine-grained sentence-level citations, allowing users to verify the accuracy of the model's responses. The project includes the LongBench-Cite evaluation benchmark, the CoF automated data construction process, the LongCite-45k dataset, and the LongCite-8B and LongCite-9B models trained on this dataset. These models can process long texts and provide accurate answers with direct citations, improving transparency and reliability.

What is LongCite?

LongCite is an open-source project by Tsinghua University designed to enhance the credibility and verifiability of large language models (LLMs) in long-text question-answering tasks. It generates fine-grained sentence-level citations, allowing users to verify the accuracy of the model's responses. The project includes the LongBench-Cite evaluation benchmark, the CoF automated data construction process, the LongCite-45k dataset, and the LongCite-8B and LongCite-9B models trained on this dataset. These models can process long texts and provide accurate answers with direct citations, improving transparency and reliability.

Main Features of LongCite

  • Generate Fine-Grained Citations: LongCite enables language models to generate precise, sentence-level citations when answering long-text questions, allowing users to trace back to specific information in the original text.
  • Improve Answer Faithfulness: LongCite helps ensure that the model's answers are more faithful to the original text, reducing "hallucinations" (i.e., generating information that does not match the original text).
  • Enhance Verifiability: Users can verify the authenticity and accuracy of answers based on the fine-grained citations provided by the model, increasing the credibility of the model's output.
  • Automated Data Construction: LongCite uses the CoF (Coarse to Fine) process to automatically generate high-quality long-text question-answering data with fine-grained citations, providing rich annotated resources for model training.
  • Evaluation Benchmark: LongCite introduces the LongBench-Cite evaluation benchmark to measure the model's ability to generate citations in long-text question-answering tasks, including correctness and citation quality.

Technical Principles of LongCite

  • Long-Text Processing Capability: LongCite supports large language models with ultra-long context windows (e.g., GLM-4-9B-1M, Gemini 1.5), capable of processing and understanding texts up to tens of thousands of words.
  • Fine-Grained Citation Generation: LongCite trains models to generate precise, sentence-level citations, allowing each answer to be traced back to specific sentences in the original text, enhancing answer verifiability.
  • Automated Data Construction Process (CoF): Uses the Self-Instruct method to automatically generate question-answer pairs from long texts. Retrieves sentence blocks related to the answers from the long text and generates block-level citations. Based on block-level citations, extracts specific sentences supporting each statement to generate sentence-level citations.
  • Supervised Fine-Tuning (SFT): Fine-tunes large language models using the high-quality dataset generated by the CoF process with fine-grained citations, improving the model's performance in long-text question-answering tasks.

Project Links for LongCite

Application Scenarios of LongCite

  • Academic Research: Researchers and scholars use LongCite to query extensive literature and obtain detailed answers with citations, supporting research work.
  • Legal Consultation: Legal professionals use LongCite to analyze legal documents, obtaining specific legal provisions or case citations to support legal analysis and case studies.
  • Financial Analysis: Financial analysts and investors use LongCite to understand complex financial reports and market research, obtaining accurate citations for key data and trends.
  • Medical Consultation: Medical professionals rely on LongCite to query medical literature, obtaining diagnostic and treatment recommendations based on the latest research findings with citations.
  • News Reporting: Journalists and news agencies use LongCite to verify information in reports, ensuring the accuracy of published news content and providing reliable source citations.

Model Capabilities

Model Type
language
Supported Tasks
Long-Text Question-Answering Citation Generation Verification
Tags
LLM Citation Verifiability Natural Language Processing Open Source Academic Research Legal Consultation Financial Analysis Medical Consultation News Reporting

Usage & Integration

Pricing
free
License
Open Source Apache-2.0

Screenshots & Images

Primary Screenshot
Additional Images

Stats

0 Views
0 Likes
484 GitHub Stars

Community & Support

Similar Models

LongWriter by Tsinghua University and Zhipu AI
0
Pixtral12B by Mistral AI
0