Read From Text by Word in C

With GemBox.Certificate you tin open up and read many Word file formats (like DOCX, Medico, RTF, and HTML) in the same manner. The documents can be loaded using 1 of the DocumentModel.Load methods from your C# and VB.NET application. These methods enable y'all to piece of work with a physical file (when providing the file's path) or with an in-memory file (when providing the file'south Stream).

You tin specify the format of your Discussion file by providing an object from the LoadOptions derived class (like DocxLoadOptions, DocLoadOptions, RtfLoadOptions, and HtmlLoadOptions). Or you tin permit GemBox.Document choose the appropriate options for you when opening the file by omitting the LoadOptions.

The following case shows the easiest way how you tin read the document's text from a Discussion file.

Opening and reading Word document's text in C# and VB.NET
Screenshot of read text from input Word document
              using Arrangement; using Organization.Linq; using GemBox.Certificate;  class Program {     static void Main()     {         // If using Professional person version, put your serial central below.         ComponentInfo.SetLicense("FREE-LIMITED-Primal");          // Load Give-and-take certificate from file'south path.         var document = DocumentModel.Load("%InputFileName%");          // Go Discussion document's evidently text.         string text = document.Content.ToString();          // Get Word document's count statistics.         int charactersCount = text.Supersede(Surround.NewLine, string.Empty).Length;         int wordsCount = certificate.Content.CountWords();         int paragraphsCount = document.GetChildElements(truthful, ElementType.Paragraph).Count();         int pageCount = document.GetPaginator().Pages.Count;          // Brandish file'southward count statistics.         Console.WriteLine($"Characters count: {charactersCount}");         Panel.WriteLine($"     Words count: {wordsCount}");         Console.WriteLine($"Paragraphs count: {paragraphsCount}");         Console.WriteLine($"     Pages count: {pageCount}");         Console.WriteLine();          // Display file's text content.         Console.WriteLine(text);     } }            
              Imports System Imports System.Linq Imports GemBox.Document  Module Plan      Sub Main()          ' If using Professional version, put your serial key beneath.         ComponentInfo.SetLicense("Free-Limited-Cardinal")          ' Load Give-and-take document from file'southward path.         Dim document = DocumentModel.Load("%InputFileName%")          ' Get Word document's plain text.         Dim text Equally String = document.Content.ToString()          ' Get Word certificate's count statistics.         Dim charactersCount Equally Integer = text.Replace(Surroundings.NewLine, String.Empty).Length         Dim wordsCount Equally Integer = certificate.Content.CountWords()         Dim paragraphsCount As Integer = document.GetChildElements(True, ElementType.Paragraph).Count()         Dim pageCount As Integer = document.GetPaginator().Pages.Count          ' Display file's count statistics.         Panel.WriteLine($"Characters count: {charactersCount}")         Console.WriteLine($"     Words count: {wordsCount}")         Panel.WriteLine($"Paragraphs count: {paragraphsCount}")         Panel.WriteLine($"     Pages count: {pageCount}")         Console.WriteLine()          ' Brandish file's text content.         Panel.WriteLine(text)      End Sub Finish Module            

Reading Word document's elements

Likewise reading the text of the whole document, you can also read just some part of it, like a specific Department element or HeaderFooter element. Each element has a Content property with which you can extract its text via the Content.ToString method.

The following example shows how yous can open a document and traverse through all Paragraph elements and their kid Run elements, and read their text and formatting. To read more about the visual data of the content elements, see the Formattings and Styles aid page.

Opening and reading Word document's text and formatting in C# and VB.NET
Screenshot of read elements from input Word document
                using System; using System.IO; using System.Linq; using GemBox.Document;  class Programme {     static void Primary()     {         // If using Professional version, put your series primal below.         ComponentInfo.SetLicense("FREE-LIMITED-KEY");          var document = DocumentModel.Load("%InputFileName%");         using (var writer = File.CreateText("Output.txt"))         {             // Iterate through all Paragraph elements in the Discussion document.             foreach (Paragraph paragraph in document.GetChildElements(truthful, ElementType.Paragraph))             {                 // Iterate through all Run elements in the Paragraph element.                 foreach (Run run in paragraph.GetChildElements(true, ElementType.Run))                 {                     string text = run.Text;                     CharacterFormat format = run.CharacterFormat;                      // Supervene upon text with bold formatting to 'Mathematical Assuming Italic' Unicode characters.                     // For example, "ABC" to "𝑨𝑩𝑪".                     if (format.Bold)                     {                         text = string.Concat(text.Select(                             c => c >= 'A' && c <= 'Z' ? char.ConvertFromUtf32(119847 + c) :                                  c >= 'a' && c <= 'z' ? char.ConvertFromUtf32(119841 + c) :                                  c.ToString()));                     }                      writer.Write(text);                 }                  writer.WriteLine();             }         }     } }              
                Imports System Imports Organization.IO Imports Organisation.Linq Imports GemBox.Document  Module Program      Sub Chief()          ' If using Professional version, put your serial primal below.         ComponentInfo.SetLicense("FREE-Express-KEY")          Dim document = DocumentModel.Load("%InputFileName%")         Using author = File.CreateText("Output.txt")              ' Iterate through all Paragraph elements in the Word document.             For Each paragraph As Paragraph In document.GetChildElements(True, ElementType.Paragraph)                  ' Iterate through all Run elements in the Paragraph element.                 For Each run As Run In paragraph.GetChildElements(True, ElementType.Run)                      Dim text As String = run.Text                     Dim format As CharacterFormat = run.CharacterFormat                      ' Replace text with bold formatting to 'Mathematical Bold Italic' Unicode characters.                     ' For example, "ABC" to "𝑨𝑩𝑪".                     If format.Bold Then                         text = String.Concat(text.Select(                             Office(c)                                 Return If(c >= "A"c AndAlso c <= "Z"c, Char.ConvertFromUtf32(119847 + AscW(c)),                                        If(c >= "a"c AndAlso c <= "z"c, Char.ConvertFromUtf32(119841 + AscW(c)),                                        c.ToString()))                             End Function))                     End If                      writer.Write(text)                 Side by side                  writer.WriteLine()             Next         End Using      Finish Sub Cease Module              

By combining these two examples you tin achieve various tasks, like selecting only the Table elements and reading their text content, or selecting only the Picture show elements and extracting their images, or reading the Run.Text property of only the highlighted elements (the ones that take CharacterFormat.HighlightColor).

Reading Word certificate's pages

Word files (DOCX, DOC, RTF, HTML, etc.) don't take a page concept, which means they don't comprise information about how many pages they occupy nor which element is on which page.

They are of a flow document type and their content is written in a flow-able fashion. The page concept is specific to the Discussion application(south) that renders or displays the document.

On the other hand, files of stock-still certificate blazon (PDF, XPS, etc.) practice have a folio concept. Their content is fixed: it'south defined on which verbal folio location the elements are rendered.

GemBox.Document uses its rendering engine to paginate and render the document's content when saving to PDF, XPS, or paradigm format. So, the best and the easiest way to read the text content of some specific page is to convert a Word document to a PDF file (or save a specific Word page every bit a PDF) with GemBox.Document and so read the PDF folio'south text content with our other component, GemBox.Pdf.

Nevertheless, the following example shows how you can utilize GemBox.Document's rendering engine to retrieve each document page as a FrameworkElement object from a WPF framework and and so extract text from information technology using the provided FrameworkElement.ToText extension method.

Opening and reading Word document's page in C# and VB.NET
Screenshot of read page from input Word document
                using System; using Organization.Collections.Generic; using Arrangement.Linq; using System.Text; using System.Windows; using System.Windows.Media; using GemBox.Document;  class Program {     [STAThread]     static void Chief()     {         // If using Professional person version, put your serial key below.         ComponentInfo.SetLicense("Gratis-Limited-KEY");          var document = DocumentModel.Load("Reading.docx");         var pages = document.GetPaginator().Pages;          for (int i = 0, count = pages.Count; i < count; ++i)         {             Console.WriteLine(new cord('-', fifty));             Panel.WriteLine($"Page {i + 1} of {count}");             Console.WriteLine(new string('-', 50));              // Go FrameworkElement object from Word certificate's page.             DocumentModelPage folio = pages[i];             FrameworkElement pageContent = folio.PageContent;              // Extract text from FrameworkElement object.             Console.WriteLine(pageContent.ToText());         }     } }  /// <summary> /// Contains methods that are used to extract text out of a FrameworkElement object. /// </summary> public static class GemBoxDocumentHelper {     public static string ToText(this FrameworkElement root)     {         var builder = new StringBuilder();          foreach (var visual in root.GetSelfAndDescendants().OfType<DrawingVisual>())         {             GlyphRun previousRun = zip;              // Lodge runs first vertically (Y), then horizontally (Ten).             foreach (var currentRun in visual.Cartoon                 .GetSelfAndDescendants()                 .OfType<GlyphRunDrawing>()                 .Select(glyph => glyph.GlyphRun)                 .OrderBy(run => run.BaselineOrigin.Y)                 .ThenBy(run => run.BaselineOrigin.X))             {                 if (previousRun != null)                 {                     // If base-line of current text segment is left from base-line of previous text segment, and then assume that it is new line.                     if (currentRun.BaselineOrigin.10 <= previousRun.BaselineOrigin.X)                     {                         builder.AppendLine();                     }                     else                     {                         Rect currentRect = currentRun.ComputeAlignmentBox();                         Rect previousRect = previousRun.ComputeAlignmentBox();                          double spaceWidth = currentRun.BaselineOrigin.X + currentRect.Left - previousRun.BaselineOrigin.10 - previousRect.Right;                         double spaceHeight = (currentRect.Summit + previousRect.Superlative) / two;                          // If space between successive text segments has width greater than a sixth of its peak, then assume that information technology is a discussion (add a space).                         if (spaceWidth > spaceHeight / 6)                             builder.Append(' ');                     }                 }                  builder.Append(currentRun.Characters.ToArray());                 previousRun = currentRun;             }         }          return architect.ToString();     }      private static IEnumerable<DependencyObject> GetSelfAndDescendants(this DependencyObject parent)     {         yield return parent;          for (int i = 0, count = VisualTreeHelper.GetChildrenCount(parent); i < count; i++)             foreach (var descendant in VisualTreeHelper.GetChild(parent, i).GetSelfAndDescendants())                 yield return descendant;     }      individual static IEnumerable<Drawing> GetSelfAndDescendants(this DrawingGroup parent)     {         yield return parent;          foreach (var child in parent.Children)         {             var drawings = child every bit DrawingGroup;             if (drawings != null)                 foreach (var descendant in drawings.GetSelfAndDescendants())                     yield return descendant;             else                 yield return kid;         }     } }              
                Imports System Imports Arrangement.Collections.Generic Imports Organisation.Linq Imports Organisation.Text Imports System.Windows Imports System.Windows.Media Imports GemBox.Certificate  Module Program      <STAThread>     Sub Main()          ' If using Professional person version, put your serial fundamental below.         ComponentInfo.SetLicense("Complimentary-LIMITED-KEY")          Dim document = DocumentModel.Load("Reading.docx")         Dim pages = certificate.GetPaginator().Pages         Dim count = pages.Count          For i = 0 To count - i             Console.WriteLine(New Cord("-"c, 50))             Panel.WriteLine($"Page {i + i} of {count}")             Panel.WriteLine(New String("-"c, 50))              ' Become FrameworkElement object from Discussion certificate's page.             Dim page As DocumentModelPage = pages(i)             Dim pageContent Equally FrameworkElement = folio.PageContent              ' Extract text from FrameworkElement object.             Console.WriteLine(pageContent.ToText())         Next      End Sub Cease Module  ''' <summary> ''' Contains methods that are used to extract text out of a FrameworkElement object. ''' </summary> Module GemBoxDocumentHelper     <Runtime.CompilerServices.Extension>     Function ToText(ByVal root Every bit FrameworkElement) Every bit String         Dim builder As New StringBuilder()          For Each visual In root.GetSelfAndDescendants().OfType(Of DrawingVisual)()             Dim previousRun Every bit GlyphRun = Nothing              ' Social club runs kickoff vertically (Y), then horizontally (X).             For Each currentRun In visual.Drawing _                 .GetSelfAndDescendants() _                 .OfType(Of GlyphRunDrawing)() _                 .Select(Function(glyph) glyph.GlyphRun) _                 .OrderBy(Function(run) run.BaselineOrigin.Y) _                 .ThenBy(Role(run) run.BaselineOrigin.10)                  If previousRun IsNot Nothing And so                     ' If base-line of electric current text segment is left from base-line of previous text segment, then assume that it is new line.                     If currentRun.BaselineOrigin.Ten <= previousRun.BaselineOrigin.X And then                         builder.AppendLine()                     Else                         Dim currentRect Every bit Rect = currentRun.ComputeAlignmentBox()                         Dim previousRect As Rect = previousRun.ComputeAlignmentBox()                          Dim spaceWidth Equally Double = currentRun.BaselineOrigin.X + currentRect.Left - previousRun.BaselineOrigin.X - previousRect.Right                         Dim spaceHeight As Double = (currentRect.Height + previousRect.Height) / 2                          ' If infinite betwixt successive text segments has width greater than a sixth of its elevation, then assume that it is a word (add a infinite).                         If spaceWidth > spaceHeight / half-dozen Then builder.Append(" "c)                     Terminate If                 End If                  builder.Suspend(currentRun.Characters.ToArray())                 previousRun = currentRun             Next         Side by side          Return architect.ToString()     End Role      <Runtime.CompilerServices.Extension>     Private Iterator Part GetSelfAndDescendants(ByVal parent As DependencyObject) As IEnumerable(Of DependencyObject)         Yield parent          Dim count = VisualTreeHelper.GetChildrenCount(parent)         For i = 0 To count - ane             For Each descendant In VisualTreeHelper.GetChild(parent, i).GetSelfAndDescendants()                 Yield descendant             Next         Next     Terminate Function      <Runtime.CompilerServices.Extension>     Private Iterator Part GetSelfAndDescendants(ByVal parent Every bit DrawingGroup) Equally IEnumerable(Of Drawing)         Yield parent          For Each kid In parent.Children             Dim drawings = TryCast(kid, DrawingGroup)             If drawings IsNot Goose egg Then                 For Each descendant In drawings.GetSelfAndDescendants()                     Yield descendant                 Adjacent             Else                 Yield child             Finish If         Side by side     End Role End Module              

Next example GitHub

Check the next example or select an case from the menu. You can as well download our examples from the GitHub.

Download Buy

If you desire to try the GemBox.Document yourself, you can download the free version. It delivers the same performance and fix of features as the professional version, but with some operations express. To remove the limitation, yous need to purchase a license.

tynespiceplonse.blogspot.com

Source: https://main.gemboxsoftware.com/document/examples/c-sharp-vb-net-open-read-word-file/301

0 Response to "Read From Text by Word in C"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel