I want to search paragraph / title in word documents.
I use Apache POI to do this.
One example I use:
fs = new POIFSFileSystem (new FileInputStream (filesname)); Hppf document doc = new hppf document (fs); WordExtractor We = New WordExtractor (Doctor); ArrayList Title = New Array List (); Try {for (int i = 0; i & lt; we.getText (). Length () - 1; i ++) {int startIndex = i; Int endIndex = i + 1; Range Range = New Range (Startindex, End Indic, Doctor); Characterrane CR = Range. Jetclarkan (0); If (cr.isBold () || cr.isItalic () || cr.getUnderlineCode ()! = 0) {while (cr.isBold ()} cr.isItalic () || cr.getUnderlineCode ()! = 0) {I ++; EndIndex + = 1; Category = New Range (And Index, End Index + 1, Doctor); CR = range.jetclarkaran (0); } Category = New Range (Startindex, and Index - 1, Doctor); Titles.add (range.text ()); }}} Hold (index atobed exception eob) {// Sometimes it happens to know why to do this} 'Enter the code here' All this bold, italic or underlined text Works for
But what I want is to find the font that is most often used. And then to find differences in that font style.
Any ideas?
Good, some ideas will have to be made to try some of the following:
- < Code> cr.getFontSize () can be used at the beginning of the parameter to see if the font size changes in the range it will be a good identifier in combination with bold, italic or underlining.
-
cr.getFontName () can also be determined when and where the font changes in a specified range. -
cr.getColor () Another possibility to help an identity is that the user is using different colors for the font. I think I will be repeated on the range and many times the CharacterRun item will change text attributes every time. Then evaluate each item on the basis of the paragraph position as well as all pre-mentioned attributes (size, color, name, bold, italic, etc.). Probably make some kind of load scale based on the most common values. It can also be of value to create a title object and helps to store values for each attribute, later optimizing searches in characters in the same document Runs.
Comments
Post a Comment