RAG pipeline for synthesizing peer-reviewed scientific literature using open-source LLMs with directional ablation, deployed on private infrastructure.

Architecture

01

Corpus

11 async data sources, deduplication by DOI, paragraph chunking, MiniLM embeddings into ChromaDB.

02

Retrieval

Dense vector search + BM25 sparse retrieval, reciprocal rank fusion, cross-encoder reranking.

03

Synthesis

vLLM serving Llama-3.1-70B-heretic-lora on RunPod GPU pods. RAG-grounded output with inline citations.

04

Validation

Citation verification, hallucination detection, uncertainty quantification, human review gate.

Data Sources

Literature & Preprints

Semantic Scholar
Papers, abstracts, citation graphs
PubMed
Biomedical papers, MeSH terms
arXiv
Preprints, PDF links
bioRxiv / medRxiv
Biology and health science preprints
Europe PMC
40M+ life science articles, OA full-text
CrossRef
DOI resolution, bibliographic metadata
CORE
Full-text open access papers
OpenAlex
Papers, abstracts, citation data
Unpaywall
OA PDF links by DOI
OpenCitations
Open citation graph, 600M+ links
Springer Nature
14M documents, OA full-text
OpenAIRE
European research graph, grant linkage
OSF Preprints
PsyArXiv, SocArXiv, EarthArXiv +25 more
ClinicalTrials.gov
Trial metadata, status, outcomes

Biomedical & Chemical

OpenFDA
Drug labels, adverse events, recalls
CDC Open Data
U.S. public health surveillance
PubChem
115M compounds, bioactivity, toxicology
ChEMBL
Drug discovery, bioactivity assays
Open Targets
Target-disease-drug associations
NCBI Gene
Gene records, summaries, annotations
GWAS Catalog
SNP-trait associations, genomic context
WHO GHO
2,000+ global health indicators

Statistics

U.S. Census Bureau
ACS, decennial census, economic data
FRED
GDP, unemployment, inflation series
OECD
International health, education, economics
World Bank
16K development indicators, 200+ countries
Eurostat
EU demographics, economy, labor, health
IMF
Global macroeconomic and financial data
UN SDG
Sustainable development goal indicators
data.gov
U.S. federal open data catalog
GBIF
2.4B species occurrence records

Case Studies

Multi-section literature reviews generated by the pipeline. Each study was produced via per-section retrieval and synthesis from peer-reviewed sources ingested via Semantic Scholar and PubMed, synthesized by Llama-3.1-70B-heretic-lora, and includes inline citations and falsifiability criteria.

behavioral genetics

Heritability of Intelligence

"What is the current scientific evidence on the heritability of intelligence?"

Twin studies place heritability of g at 50–80%, increasing from ~30% in infancy to ~80% by adolescence (the Wilson effect). GWAS identifies 200+ loci explaining 5–10% of variance, with the "missing heritability" gap partially closed by SNP-based estimates at 20–30%. Gene-environment interactions moderate expression: the Scarr-Rowe effect is observed in the US but not consistently in European welfare states.

6 sections 13 inline citations ~5,000 words
Read full synthesis →
cognitive neuroscience

Sex Differences in Cognition

"What does the peer-reviewed literature show about sex differences in cognitive abilities?"

Mental rotation shows the largest male advantage (d = 0.50–1.00), extending supramodally to auditory spatial processing. Females outperform in verbal fluency, verbal memory, and reading (d = 0.10–0.35). Math gaps have narrowed from d = 0.30 to d = 0.05–0.10 over fifty years. Cross-national comparisons show cultural moderation: higher gender equality predicts smaller math gaps.

7 sections 18 inline citations ~5,500 words
Read full synthesis →
psychometrics

Racial Differences in Cognitive Test Performance

"What does the peer-reviewed literature show about racial and ethnic group differences in cognitive test scores?"

Meta-analyses report a Black-White gap of ~1.0 SD on broad cognitive measures, narrowing by 0.27-0.47 SD since 1972. SES controls reduce the gap by 30-60%. Spearman's hypothesis confirmed: largest gaps on g-loaded tasks. GWAS portability decay and null admixture results complicate genetic interpretations. No scientific consensus on the relative genetic vs. environmental contribution to between-group differences.

7 sections 10 inline citations ~4,500 words
Read full synthesis →

Quickstart

# Clone and install
$ git clone https://github.com/opensynthesislabs/open-synthesis.git
$ cd open-synthesis
$ uv sync

# List available data sources
$ open-synthesis sources

# Ingest papers on a topic
$ open-synthesis ingest "psilocybin depression" --sources semantic_scholar,pubmed

# Run a synthesis (requires RunPod endpoint)
$ open-synthesis synthesize "What is the evidence for psilocybin as a treatment for MDD?"