LongCite

by Tsinghua University

LongCite is a project by Tsinghua University aimed at improving the credibility and verifiability of large language models (LLMs) in long-text question-answering tasks by generating fine-grained sentence-level citations.

What is LongCite?

LongCite is an open-source project by Tsinghua University designed to enhance the credibility and verifiability of large language models (LLMs) in long-text question-answering tasks. It generates fine-grained sentence-level citations, allowing users to verify the accuracy of the model's responses. The project includes the LongBench-Cite evaluation benchmark, the CoF automated data construction process, the LongCite-45k dataset, and the LongCite-8B and LongCite-9B models trained on this dataset. These models can process long texts and provide accurate answers with direct citations, improving transparency and reliability.

Main Features of LongCite

Generate Fine-Grained Citations: LongCite enables language models to generate precise, sentence-level citations when answering long-text questions, allowing users to trace back to specific information in the original text.
Improve Answer Faithfulness: LongCite helps ensure that the model's answers are more faithful to the original text, reducing "hallucinations" (i.e., generating information that does not match the original text).
Enhance Verifiability: Users can verify the authenticity and accuracy of answers based on the fine-grained citations provided by the model, increasing the credibility of the model's output.
Automated Data Construction: LongCite uses the CoF (Coarse to Fine) process to automatically generate high-quality long-text question-answering data with fine-grained citations, providing rich annotated resources for model training.
Evaluation Benchmark: LongCite introduces the LongBench-Cite evaluation benchmark to measure the model's ability to generate citations in long-text question-answering tasks, including correctness and citation quality.

Technical Principles of LongCite

Long-Text Processing Capability: LongCite supports large language models with ultra-long context windows (e.g., GLM-4-9B-1M, Gemini 1.5), capable of processing and understanding texts up to tens of thousands of words.
Fine-Grained Citation Generation: LongCite trains models to generate precise, sentence-level citations, allowing each answer to be traced back to specific sentences in the original text, enhancing answer verifiability.
Automated Data Construction Process (CoF): Uses the Self-Instruct method to automatically generate question-answer pairs from long texts. Retrieves sentence blocks related to the answers from the long text and generates block-level citations. Based on block-level citations, extracts specific sentences supporting each statement to generate sentence-level citations.
Supervised Fine-Tuning (SFT): Fine-tunes large language models using the high-quality dataset generated by the CoF process with fine-grained citations, improving the model's performance in long-text question-answering tasks.

Project Links for LongCite

GitHub Repository: https://github.com/THUDM/LongCite
HuggingFace Model Library: https://huggingface.co/THUDM
arXiv Technical Paper: https://arxiv.org/pdf/2409.02897

Application Scenarios of LongCite

Academic Research: Researchers and scholars use LongCite to query extensive literature and obtain detailed answers with citations, supporting research work.
Legal Consultation: Legal professionals use LongCite to analyze legal documents, obtaining specific legal provisions or case citations to support legal analysis and case studies.
Financial Analysis: Financial analysts and investors use LongCite to understand complex financial reports and market research, obtaining accurate citations for key data and trends.
Medical Consultation: Medical professionals rely on LongCite to query medical literature, obtaining diagnostic and treatment recommendations based on the latest research findings with citations.
News Reporting: Journalists and news agencies use LongCite to verify information in reports, ensuring the accuracy of published news content and providing reliable source citations.

Model Capabilities

Model Type

language

Supported Tasks

Long-Text Question-Answering Citation Generation Verification

Usage & Integration

Pricing

free

License

Open Source Apache-2.0

Screenshots & Images

Primary Screenshot

Additional Images

Try Now Documentation

Stats

90 Views

0 Favorites

Community & Support

GitHub Repository

Similar Models

Ola by Tsinghua University, Tencent Hunyuan Research Team, NUS S-Lab

453

Zonos by Zyphra

389

Step-Video-T2V by Leapfrogging Star

459

LongCite

What is LongCite?

Main Features of LongCite

Technical Principles of LongCite

Project Links for LongCite

Application Scenarios of LongCite

Model Capabilities

Usage & Integration

Screenshots & Images

Stats

Community & Support

Similar Models

What’s in Startup Plan?

What’s in Startup Plan?

What’s in Startup Plan?

What’s in Startup Plan?

Details

Frameworks

Database

Billing

Completed

Project Type

Project Settings

Drop files here or click to upload.

Budget

Build a Team

Set First Target

Upload Files

Drop files here or click to upload.

Project Created!

No result found

Advanced Search

Search Preferences

LongCite

What is LongCite?

Main Features of LongCite

Technical Principles of LongCite

Project Links for LongCite

Application Scenarios of LongCite

Model Capabilities

Usage & Integration

Screenshots & Images

Stats

Community & Support

Similar Models

Drop files here or click to upload.

Drop files here or click to upload.