Read From Text by Word in C
With GemBox.Certificate you tin open up and read many Word file formats (like DOCX, Medico, RTF, and HTML) in the same manner. The documents can be loaded using 1 of the DocumentModel.Load
methods from your C# and VB.NET application. These methods enable y'all to piece of work with a physical file (when providing the file's path) or with an in-memory file (when providing the file'south Stream
).
You tin specify the format of your Discussion file by providing an object from the LoadOptions
derived class (like DocxLoadOptions
, DocLoadOptions
, RtfLoadOptions
, and HtmlLoadOptions
). Or you tin permit GemBox.Document choose the appropriate options for you when opening the file by omitting the LoadOptions
.
The following case shows the easiest way how you tin read the document's text from a Discussion file.

using Arrangement; using Organization.Linq; using GemBox.Certificate; class Program { static void Main() { // If using Professional person version, put your serial central below. ComponentInfo.SetLicense("FREE-LIMITED-Primal"); // Load Give-and-take certificate from file'south path. var document = DocumentModel.Load("%InputFileName%"); // Go Discussion document's evidently text. string text = document.Content.ToString(); // Get Word document's count statistics. int charactersCount = text.Supersede(Surround.NewLine, string.Empty).Length; int wordsCount = certificate.Content.CountWords(); int paragraphsCount = document.GetChildElements(truthful, ElementType.Paragraph).Count(); int pageCount = document.GetPaginator().Pages.Count; // Brandish file'southward count statistics. Console.WriteLine($"Characters count: {charactersCount}"); Panel.WriteLine($" Words count: {wordsCount}"); Console.WriteLine($"Paragraphs count: {paragraphsCount}"); Console.WriteLine($" Pages count: {pageCount}"); Console.WriteLine(); // Display file's text content. Console.WriteLine(text); } }
Imports System Imports System.Linq Imports GemBox.Document Module Plan Sub Main() ' If using Professional version, put your serial key beneath. ComponentInfo.SetLicense("Free-Limited-Cardinal") ' Load Give-and-take document from file'southward path. Dim document = DocumentModel.Load("%InputFileName%") ' Get Word document's plain text. Dim text Equally String = document.Content.ToString() ' Get Word certificate's count statistics. Dim charactersCount Equally Integer = text.Replace(Surroundings.NewLine, String.Empty).Length Dim wordsCount Equally Integer = certificate.Content.CountWords() Dim paragraphsCount As Integer = document.GetChildElements(True, ElementType.Paragraph).Count() Dim pageCount As Integer = document.GetPaginator().Pages.Count ' Display file's count statistics. Panel.WriteLine($"Characters count: {charactersCount}") Console.WriteLine($" Words count: {wordsCount}") Panel.WriteLine($"Paragraphs count: {paragraphsCount}") Panel.WriteLine($" Pages count: {pageCount}") Console.WriteLine() ' Brandish file's text content. Panel.WriteLine(text) End Sub Finish Module
Reading Word document's elements
Likewise reading the text of the whole document, you can also read just some part of it, like a specific Department
element or HeaderFooter
element. Each element has a Content
property with which you can extract its text via the Content.ToString
method.
The following example shows how yous can open a document and traverse through all Paragraph
elements and their kid Run
elements, and read their text and formatting. To read more about the visual data of the content elements, see the Formattings and Styles aid page.

using System; using System.IO; using System.Linq; using GemBox.Document; class Programme { static void Primary() { // If using Professional version, put your series primal below. ComponentInfo.SetLicense("FREE-LIMITED-KEY"); var document = DocumentModel.Load("%InputFileName%"); using (var writer = File.CreateText("Output.txt")) { // Iterate through all Paragraph elements in the Discussion document. foreach (Paragraph paragraph in document.GetChildElements(truthful, ElementType.Paragraph)) { // Iterate through all Run elements in the Paragraph element. foreach (Run run in paragraph.GetChildElements(true, ElementType.Run)) { string text = run.Text; CharacterFormat format = run.CharacterFormat; // Supervene upon text with bold formatting to 'Mathematical Assuming Italic' Unicode characters. // For example, "ABC" to "𝑨𝑩𝑪". if (format.Bold) { text = string.Concat(text.Select( c => c >= 'A' && c <= 'Z' ? char.ConvertFromUtf32(119847 + c) : c >= 'a' && c <= 'z' ? char.ConvertFromUtf32(119841 + c) : c.ToString())); } writer.Write(text); } writer.WriteLine(); } } } }
Imports System Imports Organization.IO Imports Organisation.Linq Imports GemBox.Document Module Program Sub Chief() ' If using Professional version, put your serial primal below. ComponentInfo.SetLicense("FREE-Express-KEY") Dim document = DocumentModel.Load("%InputFileName%") Using author = File.CreateText("Output.txt") ' Iterate through all Paragraph elements in the Word document. For Each paragraph As Paragraph In document.GetChildElements(True, ElementType.Paragraph) ' Iterate through all Run elements in the Paragraph element. For Each run As Run In paragraph.GetChildElements(True, ElementType.Run) Dim text As String = run.Text Dim format As CharacterFormat = run.CharacterFormat ' Replace text with bold formatting to 'Mathematical Bold Italic' Unicode characters. ' For example, "ABC" to "𝑨𝑩𝑪". If format.Bold Then text = String.Concat(text.Select( Office(c) Return If(c >= "A"c AndAlso c <= "Z"c, Char.ConvertFromUtf32(119847 + AscW(c)), If(c >= "a"c AndAlso c <= "z"c, Char.ConvertFromUtf32(119841 + AscW(c)), c.ToString())) End Function)) End If writer.Write(text) Side by side writer.WriteLine() Next End Using Finish Sub Cease Module
By combining these two examples you tin achieve various tasks, like selecting only the Table
elements and reading their text content, or selecting only the Picture show
elements and extracting their images, or reading the Run.Text
property of only the highlighted elements (the ones that take CharacterFormat.HighlightColor
).
Reading Word certificate's pages
Word files (DOCX, DOC, RTF, HTML, etc.) don't take a page concept, which means they don't comprise information about how many pages they occupy nor which element is on which page.
They are of a flow document type and their content is written in a flow-able fashion. The page concept is specific to the Discussion application(south) that renders or displays the document.
On the other hand, files of stock-still certificate blazon (PDF, XPS, etc.) practice have a folio concept. Their content is fixed: it'south defined on which verbal folio location the elements are rendered.
GemBox.Document uses its rendering engine to paginate and render the document's content when saving to PDF, XPS, or paradigm format. So, the best and the easiest way to read the text content of some specific page is to convert a Word document to a PDF file (or save a specific Word page every bit a PDF) with GemBox.Document and so read the PDF folio'south text content with our other component, GemBox.Pdf.
Nevertheless, the following example shows how you can utilize GemBox.Document's rendering engine to retrieve each document page as a FrameworkElement
object from a WPF framework and and so extract text from information technology using the provided FrameworkElement.ToText
extension method.

using System; using Organization.Collections.Generic; using Arrangement.Linq; using System.Text; using System.Windows; using System.Windows.Media; using GemBox.Document; class Program { [STAThread] static void Chief() { // If using Professional person version, put your serial key below. ComponentInfo.SetLicense("Gratis-Limited-KEY"); var document = DocumentModel.Load("Reading.docx"); var pages = document.GetPaginator().Pages; for (int i = 0, count = pages.Count; i < count; ++i) { Console.WriteLine(new cord('-', fifty)); Panel.WriteLine($"Page {i + 1} of {count}"); Console.WriteLine(new string('-', 50)); // Go FrameworkElement object from Word certificate's page. DocumentModelPage folio = pages[i]; FrameworkElement pageContent = folio.PageContent; // Extract text from FrameworkElement object. Console.WriteLine(pageContent.ToText()); } } } /// <summary> /// Contains methods that are used to extract text out of a FrameworkElement object. /// </summary> public static class GemBoxDocumentHelper { public static string ToText(this FrameworkElement root) { var builder = new StringBuilder(); foreach (var visual in root.GetSelfAndDescendants().OfType<DrawingVisual>()) { GlyphRun previousRun = zip; // Lodge runs first vertically (Y), then horizontally (Ten). foreach (var currentRun in visual.Cartoon .GetSelfAndDescendants() .OfType<GlyphRunDrawing>() .Select(glyph => glyph.GlyphRun) .OrderBy(run => run.BaselineOrigin.Y) .ThenBy(run => run.BaselineOrigin.X)) { if (previousRun != null) { // If base-line of current text segment is left from base-line of previous text segment, and then assume that it is new line. if (currentRun.BaselineOrigin.10 <= previousRun.BaselineOrigin.X) { builder.AppendLine(); } else { Rect currentRect = currentRun.ComputeAlignmentBox(); Rect previousRect = previousRun.ComputeAlignmentBox(); double spaceWidth = currentRun.BaselineOrigin.X + currentRect.Left - previousRun.BaselineOrigin.10 - previousRect.Right; double spaceHeight = (currentRect.Summit + previousRect.Superlative) / two; // If space between successive text segments has width greater than a sixth of its peak, then assume that information technology is a discussion (add a space). if (spaceWidth > spaceHeight / 6) builder.Append(' '); } } builder.Append(currentRun.Characters.ToArray()); previousRun = currentRun; } } return architect.ToString(); } private static IEnumerable<DependencyObject> GetSelfAndDescendants(this DependencyObject parent) { yield return parent; for (int i = 0, count = VisualTreeHelper.GetChildrenCount(parent); i < count; i++) foreach (var descendant in VisualTreeHelper.GetChild(parent, i).GetSelfAndDescendants()) yield return descendant; } individual static IEnumerable<Drawing> GetSelfAndDescendants(this DrawingGroup parent) { yield return parent; foreach (var child in parent.Children) { var drawings = child every bit DrawingGroup; if (drawings != null) foreach (var descendant in drawings.GetSelfAndDescendants()) yield return descendant; else yield return kid; } } }
Imports System Imports Arrangement.Collections.Generic Imports Organisation.Linq Imports Organisation.Text Imports System.Windows Imports System.Windows.Media Imports GemBox.Certificate Module Program <STAThread> Sub Main() ' If using Professional person version, put your serial fundamental below. ComponentInfo.SetLicense("Complimentary-LIMITED-KEY") Dim document = DocumentModel.Load("Reading.docx") Dim pages = certificate.GetPaginator().Pages Dim count = pages.Count For i = 0 To count - i Console.WriteLine(New Cord("-"c, 50)) Panel.WriteLine($"Page {i + i} of {count}") Panel.WriteLine(New String("-"c, 50)) ' Become FrameworkElement object from Discussion certificate's page. Dim page As DocumentModelPage = pages(i) Dim pageContent Equally FrameworkElement = folio.PageContent ' Extract text from FrameworkElement object. Console.WriteLine(pageContent.ToText()) Next End Sub Cease Module ''' <summary> ''' Contains methods that are used to extract text out of a FrameworkElement object. ''' </summary> Module GemBoxDocumentHelper <Runtime.CompilerServices.Extension> Function ToText(ByVal root Every bit FrameworkElement) Every bit String Dim builder As New StringBuilder() For Each visual In root.GetSelfAndDescendants().OfType(Of DrawingVisual)() Dim previousRun Every bit GlyphRun = Nothing ' Social club runs kickoff vertically (Y), then horizontally (X). For Each currentRun In visual.Drawing _ .GetSelfAndDescendants() _ .OfType(Of GlyphRunDrawing)() _ .Select(Function(glyph) glyph.GlyphRun) _ .OrderBy(Function(run) run.BaselineOrigin.Y) _ .ThenBy(Role(run) run.BaselineOrigin.10) If previousRun IsNot Nothing And so ' If base-line of electric current text segment is left from base-line of previous text segment, then assume that it is new line. If currentRun.BaselineOrigin.Ten <= previousRun.BaselineOrigin.X And then builder.AppendLine() Else Dim currentRect Every bit Rect = currentRun.ComputeAlignmentBox() Dim previousRect As Rect = previousRun.ComputeAlignmentBox() Dim spaceWidth Equally Double = currentRun.BaselineOrigin.X + currentRect.Left - previousRun.BaselineOrigin.X - previousRect.Right Dim spaceHeight As Double = (currentRect.Height + previousRect.Height) / 2 ' If infinite betwixt successive text segments has width greater than a sixth of its elevation, then assume that it is a word (add a infinite). If spaceWidth > spaceHeight / half-dozen Then builder.Append(" "c) Terminate If End If builder.Suspend(currentRun.Characters.ToArray()) previousRun = currentRun Next Side by side Return architect.ToString() End Role <Runtime.CompilerServices.Extension> Private Iterator Part GetSelfAndDescendants(ByVal parent As DependencyObject) As IEnumerable(Of DependencyObject) Yield parent Dim count = VisualTreeHelper.GetChildrenCount(parent) For i = 0 To count - ane For Each descendant In VisualTreeHelper.GetChild(parent, i).GetSelfAndDescendants() Yield descendant Next Next Terminate Function <Runtime.CompilerServices.Extension> Private Iterator Part GetSelfAndDescendants(ByVal parent Every bit DrawingGroup) Equally IEnumerable(Of Drawing) Yield parent For Each kid In parent.Children Dim drawings = TryCast(kid, DrawingGroup) If drawings IsNot Goose egg Then For Each descendant In drawings.GetSelfAndDescendants() Yield descendant Adjacent Else Yield child Finish If Side by side End Role End Module
Published: September 14, 2018 | Modified: September 8, 2021 | Author: Mario Zorica
Source: https://main.gemboxsoftware.com/document/examples/c-sharp-vb-net-open-read-word-file/301
0 Response to "Read From Text by Word in C"
Post a Comment