Skip to content

SyncfusionExamples/how-to-extract-text-from-a-PDF-document-in-net

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

How to Extract Text from a PDF Document in .NET using the PDF Library

Introduction

A quick start .NET console project that shows how to extract text from a PDF document using the Syncfusion® PDF Library.

System requirement

Framework and SDKs

  • .NET SDK (version 5.0 or later)

IDEs

  • Visual Studio 2019/ Visual Studio 2022

Extract text from a specific page

We will create a new .NET console application, add the Syncfusion® PDF library package, and write the code

//Get stream from an existing PDF document. 
FileStream docStream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read);
//Load the PDF document. 
PdfLoadedDocument loadedDocument = new PdfLoadedDocument(docStream);
//Load the first page. 
PdfPageBase page = loadedDocument.Pages[0];
//Extract text from first page. 
string extractedText = page.ExtractText();
//Save the text.
File.WriteAllText("Result.txt", extractedText);
//Close the document.
loadedDocument.Close(true);

Output Image output_image

Layout-based text extraction

We will create a new .NET console application, add the Syncfusion® PDF library package, and write the code

//Get stream from an existing PDF document. 
FileStream docStream = new FileStream("Invoice.pdf", FileMode.Open, FileAccess.Read);
//Load the PDF document.
PdfLoadedDocument loadedDocument = new PdfLoadedDocument(docStream);
//Load first page.
PdfPageBase page = loadedDocument.Pages[0];
//Extract text from first page.
string extractedTexts = page.ExtractText(true);
//Save the text.
File.WriteAllText("data.txt", extractedTexts);
//Close the document.
loadedDocument.Close(true);

Output Image output_image

Extract text from the entire PDF document

We will create a new .NET console application, add the Syncfusion® PDF library package, and write the code

//Get stream from an existing PDF document.
 FileStream docStream = new FileStream("Data.pdf", FileMode.Open, FileAccess.Read);
//Load the PDF document.
PdfLoadedDocument loadedDocument = new PdfLoadedDocument(docStream);
string extractedText = string.Empty;
//Extract all the text from the PDF document pages.
foreach (PdfLoadedPage loadedPage in loadedDocument.Pages) {
    extractedText += loadedPage.ExtractText();
}
//Save the text to file.
File.WriteAllText("data.txt", extractedText);
//Close the document.
loadedDocument.Close(true);

Output Image output_image

Extract text from predefined bounds

We will create a new .NET console application, add the Syncfusion® PDF library package, and write the code

//Get stream from an existing PDF document. 
FileStream docStream = new FileStream("Invoice.pdf", FileMode.Open, FileAccess.Read);
//Load the PDF document.
PdfLoadedDocument loadedDocument = new PdfLoadedDocument(docStream);
//Get the first page of the loaded PDF document.
PdfPageBase page = loadedDocument.Pages[0];
//Create line collection. 
var lineCollection = new TextLineCollection();
//Extract text from the first page.
page.ExtractText(out lineCollection);
RectangleF textBounds = new RectangleF(474.96198f, 161.62997f, 50.040073f, 9);
string invoiceNumber = "";
//Get the text provided in the bounds.
foreach (TextLine textLine in lineCollection.TextLine) {
    foreach (TextWord word in textLine.WordCollection) {
        if (textBounds==word.Bounds) {
            invoiceNumber = word.Text;
            break;
        }
    }
}
//Save the text to file.
File.WriteAllText("data.txt", invoiceNumber);
//Close the PDF document. 
loadedDocument.Close(true);

Output Image output_image

How to run the examples

  • Download this project to a location in your disk.
  • Open the solution file using Visual Studio.
  • Rebuild the solution to install the required NuGet package.
  • Run the application.

Resources

Support and feedback

License

This is a commercial product and requires a paid license for possession or use. Syncfusion’s licensed software, including this component, is subject to the terms and conditions of Syncfusion's EULA. You can purchase a licnense here or start a free 30-day trial here.

About Syncfusion®

Founded in 2001 and headquartered in Research Triangle Park, N.C., Syncfusion® has more than 26,000+ customers and more than 1 million users, including large financial institutions, Fortune 500 companies, and global IT consultancies.

Today, we provide 1600+ components and frameworks for web (Blazor, ASP.NET Core, ASP.NET MVC, ASP.NET WebForms, JavaScript, Angular, React, Vue, and Flutter), mobile (Xamarin, Flutter, UWP, and JavaScript), and desktop development (WinForms, WPF, WinUI(Preview), Flutter and UWP). We provide ready-to-deploy enterprise software for dashboards, reports, data integration, and big data processing. Many customers have saved millions in licensing fees by deploying our software.

About

How to Extract Text from a PDF Document in .NET using the PDF Library

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7

Languages