So it seems that a lot of what I have done around here is writing up the daily kanji posts (approx. 66% of my posts). Boy, oh boy do I love writing those posts. However, the thought of opening the edict in a text editor and using the ol' ctr-f left me feeling nauseous with cold sweats, ants in my pants and even caused my lawn to turn a shade directly between green and dead. To get over this horrible feeling I immediately opened my nearest text editor and started checking what was new on slashdot. Once I had that out of my system, I came up with a little script to do all that searching for me. After adding in the ability to know what words were already shown, I had a way to find words from edict that would include only the kanji I have posted, contain the (P) flag, which is the words flagged as priority to learn, and only show new words that haven't been shown yet. So when we did 二 it showed a bunch of words based on that kanji. When I showed 十, it showed 二十 because 二 was a kanji we had done before. I also made it match with any hirigana or katakana.

To note: I had saved the edict file in UTF-8 encoding. It will not work correctly with the JIS encoding it comes in.

You can get the edict file here.

<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><head>
<body>
<?php

//This code is not under any liceance and can be considered public domain

setlocale( LC_ALL, 'en_US.UTF-8' );

//This section grabs all the words that have already been posted.
$shownalreadywords=array();
if(file_exists("showalready.txt"))
$shownalready = fopen("showalready.txt","rw");
else
$shownalready = fopen("showalready.txt","x+");
while (!feof($shownalready)) {
$shownalreadywords[trim(fgets($shownalready))]=true;
}
fclose($shownalready);
//end section

//reopen the file for appending new words found
$shownalready = fopen("showalready.txt","a+");

$handle = fopen("edict.txt","r");

//This regex will get run on each line as we read through the file.
//To get the new words, simply add the new kanji into the regex.
$preg='/^([\p{Katakana}\p{Hiragana}一二三四五六七八九十右左百上下千中人日円金見天気水木火]*) \[.*\(P\).*/u';

//this loop will grab each line of the edict file and run the regex
//against it.  When it finds a match it will check if that word
//was posted yet.  If it has not it echos it to the browser and
//appends it to the showalready.txt file.
while (!feof($handle)) {
$buffer = fgets($handle);
if(preg_match($preg,$buffer,$matches))
{
if(!isset($shownalreadywords[$matches[1]]))
{
//write out that we have shown this word.
fputs($shownalready,$matches[1]."\n");
//There seems to be multiple listings for the same word.
//This will make sure we only get it once.
$shownalreadywords[$matches[1]]=true;
echo $matches[1]."<BR>";
}
}
}
fclose($handle);
fclose($shownalready);

?>
</body>
</html>

Comments (0)