PHP RTF to Text Converter
I am involved in writing short stories and have a website that includes some writing utilities. I wanted to include a “Weasel Word” analyzer that highlights words that should not be used in writing. It occurred to me that it would be nice if the user could upload an RTF file so that I could convert it text, search for the words and display the results. The RTF converters all seem to cost money or don’t work. As usual, I had to write one of my own.
I’ve tested this with WORD 2008 and the RTF files from Google Docs. I don’t know if it will work on other flavors of RTF. The code below is my test code. I will be making it into a function, but I wanted to share this for those trying to convert RTF to TXT on the fly.
$text = file_get_contents('testfile.rtf'); if (!strlen($text)) { echo "bad file"; exit(); } // we'll try to fix up the parts of the rtf as best we can // clean up the file a little to simplify parsing $text=str_replace("\r",' ',$text); // returns $text=str_replace("\n",' ',$text); // new lines $text=str_replace(' ',' ',$text); // double spaces $text=str_replace(' ',' ',$text); // double spaces $text=str_replace(' ',' ',$text); // double spaces $text=str_replace(' ',' ',$text); // double spaces $text=str_replace('} {','}{',$text); // embedded spaces // skip over the heading stuff $j=strpos($text,'{',1); // skip ahead to the first part of the header $loc=1; $t=""; $ansa=""; $len=strlen($text); getpgraph(); // skip by the first paragrap while($j<$len) { $c=substr($text,$j,1); if ($c=="\\") { // have a tag $tag=gettag(); if (strlen($tag)>0) { // process known tags switch ($tag) { case 'par': $ansa.="\r\n"; break; // ad a list of common tags // parameter tags case 'spriority1': case 'fprq2': case 'author': case 'operator': case 'sqformat': case 'company': case 'xmlns1': case 'wgrffmtfilter': case 'pnhang': case 'themedata': case 'colorschememapping': $tt=gettag(); break; case '*': case 'info': case 'stylesheet': // gets to end of paragraph $j--; getpgraph(); default: // ignore the tag } } } else { $ansa.=$c; } $j++; } $ansa=str_replace('{','',$ansa); $ansa=str_replace('}','',$ansa); echo "<pre>$ansa</pre>"; function getpgraph() { // if the first char after a tag is { then throw out the entire paragraph // this has to be nested global $text; global $j; global $len; $nest=0; while(true) { $j++; if ($j>=$len) break; if (substr($text,$j,1)=='}') { if ($nest==0) return; $nest--; } if (substr($text,$j,1)=='{') { $nest++; } } return; } function gettag() { // gets the text following the / character or gets the param if it there global $text; global $j; global $len; $tag=''; while(true) { $j++; if ($j>=$len) break; $c=substr($text,$j,1); if ($c==' ') break; if ($c==';') break; if ($c=='}') break; if ($c=="\\") { $j--; break; } if ($c=="{") { //getpgraph(); break; } if ((($c>='0')&&($c<='9'))||(($c>='a')&&($c<='z'))||(($c>='A')&&($c<='Z'))||$c=="'"||$c=="-"||$c=="*" ){ $tag=$tag.$c; } else { // end of tag $j--; break; } } return $tag; }
Ha this actually works, thanks mate. I’ve searched for this for 2 days and tried php classes written with more than 1000 lines and this tiny little script less than 120 lines works.
By the way I am a PHP developer and currently working on a web portal that gets its information from a soap api powered by C#. Some text is saved as rtf inside a SQL database and I could not find a proper way to convert it just back to plain text without any nonsense.
Look at https://sourceforge.net/projects/phprtf project. It has full API for creating rtf files.
As already stated being able to convert from RTF/DOC to Text (WordPress) would be a great plugin. I believe their is more need for it than you think!
-Skip
I am writing a Submission/Slush management plugin for magazines that use WordPress. I think that the RTF to Post functionality may be part of it eventually.
This is a complicated plugin and it will take a while to finish.
It would be actually amazing to have a wordpress RTF/DOC to WordPress post converter 🙂
Thanks for this code.
Okay, thanks!
I need to convert RTFs, so yeah >.>
I have no plans for it right now. The code is right here and I have no thoughts of enforcing copyright on on. It is free for anyone to use. I was going to use in a document system, but I did not finish that part of it. I could make an RTF uploader for creating wordpress posts, but I don’t think there is much demand.
Mostly it is just interesting code and the kind of thing that when you need it, it is nice to have.
Keith
I don’t suppose you’re planning to GPL this and/or make it a WordPress plugin? >.>b Because I could really use something like this, just with a few modifications.