{"id":315,"date":"2010-07-21T09:11:41","date_gmt":"2010-07-21T13:11:41","guid":{"rendered":"http:\/\/www.blogseye.com\/"},"modified":"2010-07-21T09:13:52","modified_gmt":"2010-07-21T13:13:52","slug":"php-rtf-to-text-converter","status":"publish","type":"page","link":"http:\/\/blogseye\/php-rtf-to-text-converter.html","title":{"rendered":"PHP RTF to Text Converter"},"content":{"rendered":"

I am involved in writing short stories and have a website that includes some writing utilities. I wanted to include a “Weasel Word” analyzer that highlights words that should not be used in writing. It occurred to me that it would be nice if the user could upload an RTF file so that I could convert it text, search for the words and display the results. The RTF converters all seem to cost money or don’t work. As usual, I had to write one of my own.<\/p>\n

I’ve tested this with WORD 2008 and the RTF files from Google Docs. I don’t know if it will work on other flavors of RTF. The code below is my test code. I will be making it into a function, but I wanted to share this for those trying to convert RTF to TXT on the fly.<\/p>\n


\n
$text = file_get_contents('testfile.rtf');\r\nif (!strlen($text)) {\r\n echo \"bad file\";\r\n exit();\r\n\r\n}\r\n\/\/ we'll try to fix up the parts of the rtf as best we can\r\n\/\/ clean up the file a little to simplify parsing\r\n$text=str_replace(\"\\r\",' ',$text); \/\/ returns\r\n$text=str_replace(\"\\n\",' ',$text); \/\/ new lines\r\n$text=str_replace('\u00a0 ',' ',$text); \/\/ double spaces\r\n$text=str_replace('\u00a0 ',' ',$text); \/\/ double spaces\r\n$text=str_replace('\u00a0 ',' ',$text); \/\/ double spaces\r\n$text=str_replace('\u00a0 ',' ',$text); \/\/ double spaces\r\n$text=str_replace('} {','}{',$text); \/\/ embedded spaces\r\n\/\/ skip over the heading stuff\r\n$j=strpos($text,'{',1); \/\/ skip ahead to the first part of the header\r\n\r\n$loc=1;\r\n$t=\"\";\r\n\r\n$ansa=\"\";\r\n$len=strlen($text);\r\ngetpgraph(); \/\/ skip by the first paragrap\r\n\r\nwhile($j<$len) {\r\n $c=substr($text,$j,1);\r\n if ($c==\"\\\\\") {\r\n \/\/ have a tag\r\n $tag=gettag();\r\n if (strlen($tag)>0) {\r\n \/\/ process known tags\r\n switch ($tag) {\r\n case 'par':\r\n $ansa.=\"\\r\\n\";\r\n break;\r\n \/\/ ad a list of common tags\r\n \/\/ parameter tags\r\n case 'spriority1':\r\n case 'fprq2':\r\n case 'author':\r\n case 'operator':\r\n case 'sqformat':\r\n case 'company':\r\n case 'xmlns1':\r\n case 'wgrffmtfilter':\r\n case 'pnhang':\r\n case 'themedata':\r\n case 'colorschememapping':\r\n $tt=gettag();\r\n break;\r\n case '*':\r\n case 'info':\r\n case 'stylesheet':\r\n \/\/ gets to end of paragraph\r\n $j--;\r\n getpgraph();\r\n default:\r\n \/\/ ignore the tag\r\n }\r\n }\r\n } else {\r\n $ansa.=$c;\r\n }\r\n $j++;\r\n}\r\n$ansa=str_replace('{','',$ansa);\r\n$ansa=str_replace('}','',$ansa);\r\necho \"<pre>$ansa<\/pre>\";\r\n\r\nfunction getpgraph() {\r\n \/\/ if the first char after a tag is { then throw out the entire paragraph\r\n \/\/ this has to be nested\r\n global $text;\r\n global $j;\r\n global $len;\r\n $nest=0;\r\n while(true) {\r\n $j++;\r\n if ($j>=$len) break;\r\n if (substr($text,$j,1)=='}') {\r\n if ($nest==0) return;\r\n $nest--;\r\n }\r\n if (substr($text,$j,1)=='{') {\r\n $nest++;\r\n }\r\n }\r\n return;\r\n}\r\n\r\nfunction gettag() {\r\n \/\/ gets the text following the \/ character or gets the param if it there\r\n global $text;\r\n global $j;\r\n global $len;\r\n $tag='';\r\n while(true) {\r\n $j++;\r\n if ($j>=$len) break;\r\n $c=substr($text,$j,1);\r\n if ($c==' ') break;\r\n if ($c==';') break;\r\n if ($c=='}') break;\r\n if ($c==\"\\\\\") {\r\n $j--;\r\n break;\r\n }\r\n if ($c==\"{\") {\r\n \/\/getpgraph();\r\n break;\r\n }\r\n if ((($c>='0')&&($c<='9'))||(($c>='a')&&($c<='z'))||(($c>='A')&&($c<='Z'))||$c==\"'\"||$c==\"-\"||$c==\"*\" ){\r\n $tag=$tag.$c;\r\n } else {\r\n \/\/ end of tag\r\n $j--;\r\n break;\r\n }\r\n }\r\n return $tag;\r\n\r\n}\r\n<\/pre>\n
\n","protected":false},"excerpt":{"rendered":"

I am involved in writing short stories and have a website that includes some writing utilities. I wanted to include a “Weasel Word” analyzer that highlights words that should not be used in writing. It occurred to me that it would be nice if the user could upload an RTF file so that I could […]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"open","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"http:\/\/blogseye\/wp-json\/wp\/v2\/pages\/315"}],"collection":[{"href":"http:\/\/blogseye\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/blogseye\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/blogseye\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/blogseye\/wp-json\/wp\/v2\/comments?post=315"}],"version-history":[{"count":0,"href":"http:\/\/blogseye\/wp-json\/wp\/v2\/pages\/315\/revisions"}],"wp:attachment":[{"href":"http:\/\/blogseye\/wp-json\/wp\/v2\/media?parent=315"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}