Skip to content

A layout analysis pipeline using Detectron2 + Label Studio for extracting based on Fast R-CNN and annotating sections like abstract, table, figure, and references from academic PDFs

Notifications You must be signed in to change notification settings

shallowManica/doc-layout-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

doc-layout-parser

This project develops a layout parsing pipeline to extract key components (e.g., abstract, context, table, reference) from academic PDFs using a Detectron2-based model trained on annotations from Label Studio.

πŸ” Purpose

To identify and segment document elements like titles, authors, abstracts, tables, figures, and references using object detection techniques, improving downstream analysis and semantic classification with LLMs.

βš™οΈ Features

  • Fast R-CNN architecture (Detectron2) for layout detection
  • Layout categories: Abstract, Author, Context, Header, Image, Reference, Sub-title, Table, Title
  • Integration-ready with LLMs for content-based filtering or labeling
  • Configuration through config.yaml

πŸ—ƒ File Structure

  • config.yaml - Detectron2 configuration for the layout model
  • result.json - Output annotations from model inference
  • parsing.ipynb - Sample notebook to run detection and visualize results

πŸ“¦ Dependencies

Install via pip:

!pip install pycocotools
!pip install layoutparser
!pip install "layoutparser[effdet]"
!pip install layoutparser torchvision
!python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
!pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/detectron2.git@v0.5#egg=detectron2"
!pip install "layoutparser[paddledetection]"
!pip install "layoutparser[ocr]"

Install via Conda:

conda install detectron2 pytorch opencv omegaconf hydra-core -c conda-forge

πŸš€ How to Run

# Inside parsing.ipynb
from layoutparser.models import Detectron2LayoutModel

model = Detectron2LayoutModel(
    config_path='config.yaml',
    extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.8],
    label_map={0: "Abstract", 1: "Author", ...}
)
πŸ“„ Annotation Categories
	β€’	Abstract
	β€’	Author
	β€’	Context
	β€’	Header
	β€’	Image
	β€’	Reference
	β€’	Sub-title
	β€’	Table
	β€’	Title

About

A layout analysis pipeline using Detectron2 + Label Studio for extracting based on Fast R-CNN and annotating sections like abstract, table, figure, and references from academic PDFs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published