[Windows Mobile .NET CF] 中英字典 – Day9

2024-03-20 08:02:28

写了一些网络应用相关软件, 靠着网络, 手机成了强大的工具界面.

但是接下来我想写一些可以不需要网络的 Windows Mobile Applications, 比方说, 中英字典.

字典这样的软件其实不难写, 门槛就在于字典档…

可以从网络上找到一些免费的字典档, 比方说 pydict

根据研究它的字典档, 有几个特性:
1. 依照 a ~ z 分开存放英文数据
2. 已经排序 (a-z)
3. 是 big5 编码.
4. 用 ‘=’ 切割数据, 分别为英文=中文=音标

根据 pydict 的程序内容, 我写了一个最基本的 class 来代表一笔字典档的数据,
也把音标以及词性转换出来, 程序如下:

/// 
/// 代表一笔中英文字典数据
/// 
public class dictdata
{
    /// 
    /// file pos , 内部使用
    /// 
    public long filepos { get; set; }

    /// 
    /// english
    /// 
    public string eng { get; set; }

    /// 
    /// chinese
    /// 
    public string chinese { get; set; }

    /// 
    /// soundmark
    /// 
    public string soundmark { get; set; }


    /// 
    /// 词性表
    /// 
    private static string[] prop = new string[] { 
        " ", " ", " ", "<<形容词>>", "<<副词>>", "art. ", 
        "<<连接词>>", "int.  ", "<<名词>>", " ", " ", "num. ", 
        "prep. ", " ", "pron.  ", "<<动词>>", "<<助动词>>", 
        "<<非及物动词>>", "<<及物动词>>", "vbl. ", " ", "st. ", 
        "pr. ", "<<过去分词>>", "<<复数>>", "ing. ", " ", "<<形容词>>", 
        "<<副词>>", "pla. ", "pn. ", " " };

    /// 
    /// 显示中文内容
    /// 
    public string displaychinese
    {
        get
        {
            //根据 词性表 改变内文
            StringBuilder result = new StringBuilder();
            for (int i = 0; i < chinese.Length; i++)
            {
                int idx = Convert.ToInt32(chinese[i]);
                if (idx < prop.Length)
                {
                    if (result.Length != 0)
                    {
                        result.Append("rn");
                    }
                    result.Append(prop[idx]);
                }
                else
                    result.Append(chinese[i]);
            }
            return result.ToString();
        }
    }

    /// 
    /// 音标表
    /// 
    private static Dictionary soundmarktable = new Dictionary() {
    {0x01,"I"},
    {0x02,"E"},
    {0x03,"ae"},
    {0x04,"a"},
    {0x06,"c"},
    {0x0b,"8"},
    {0x0e, "U"},
    {0x0f, "^"},
    {0x10, "2"},
    {0x11, "2*"},
    {0x13, "2~"},
    {0x17, "l."},
    {0x19, "n_"},
    {0x1c, "&"},
    {0x1d, "S"},
    {0x1e, "3"},
    };

    public string displaysoundmark
    {
        get
        {
            // 根据音标表显示
            StringBuilder result = new StringBuilder();
            for (int i = 0; i < soundmark.Length; i++)
            {
                int idx = Convert.ToInt32(soundmark[i]);
                if ((i == 0) && (idx == 0x65))
                    continue;
                string founddisplay;
                if (soundmarktable.TryGetValue(idx, out founddisplay) == true)
                    result.Append(founddisplay);
                else
                    result.Append(soundmark[i]);
            }
            return result.ToString();
        }
    }

    public string displaysoundmarkandchinese
    {
        get
        {
            return "音标:" + displaysoundmark + "rn" + displaychinese;
        }
    }


    private static char[] splitchar = new char[] { '=' };

    public dictdata(string data, long pos)
    {
        filepos = pos;
        if (string.IsNullOrEmpty(data) == false)
        {
            string[] parts = data.Split(splitchar);
            eng = (parts.Length >= 1) ? parts[0] : string.Empty;
            chinese = (parts.Length >= 2) ? parts[1] : string.Empty;
            soundmark = (parts.Length >= 3) ? parts[2] : string.Empty;
        }
        else
        {
            eng = string.Empty;
            chinese = string.Empty;
            soundmark = string.Empty;
        }
    }

    public override string ToString()
    {
        return eng;
    }
},>,>

我打算展示两种方法来查询这样的字典档,
一种是直接在文件做搜寻, 另一种则是直接全部载入到内存做搜寻.
当然是全部载入到内存, 直接利用 .NET Framework 的 Container Class 搜寻简单得多.
但是大家应该知道, 之所以简单得多, 是因为 .NET Framework 已经帮我们做掉很多了.

所谓倒吃甘蔗, 所以先介绍如何在文件直接搜寻.

因为是已经根据英文排序好的数据, 所以就要善用排序搜寻.
已经排序好的数据, 又快又好写的搜寻方法就是 BinarySearch.

虽然 .NET Framework 有 BinarySearch, 但是由于文件是以 byte 为单位,
而数据却是以一行一行为单位, 所以内建的 BinarySearch 是不能用的,
所以我们不但要自己写, 还要在计算的时候做 byte 转换到一行一行的数据.
也就是说, 这算是有一点变化的 BinarySearch 欧 ^_^
(在下的本行可以算是搜寻吧?! 所以这一定要写得好一点才不会丢脸!)

BinarySearch 的主要函数程序 :

/// 
/// 用 Binary Search 找到英文字 index , 找不到就返回最接近的.
/// 
/// 
/// 
/// 
/// 
/// 
/// 
public dictdata SeekToLine(Stream r, string index, long zonebegin, long zoneend, Encoding encode)
{
    // 默认 middle 为 readpreline.
    r.Position = (zonebegin + zoneend) / 2;
    dictdata middledata = ReadPreLine(r, encode);

    // 找到正确的英文字
    if (string.Compare(middledata.eng, index, true) == 0)
        return middledata;

    // read pre line 找不到正确的英文字, 有可能是 middle 要采用 readnextline...
    if (middledata.filepos == zonebegin)
    {
        r.Position = (zonebegin + zoneend) / 2;
        middledata = ReadNextLine(r, encode);

        // 找到正确的英文字
        if (string.Compare(middledata.eng, index, true) == 0)
            return middledata;

        // 找不到正确的英文字
        if (middledata.filepos == zoneend)
            return middledata;
    }

    string middleindexlow = middledata.eng.ToLower();
    int cmp = index.CompareTo(middleindexlow);
    if (cmp < 0)
    {
        // 搜寻 Binray Tree 左边
        return SeekToLine(r, index, zonebegin, middledata.filepos, encode);
    }
    else
    {
        // 搜寻 Binray Tree 右边
        return SeekToLine(r, index, middledata.filepos, zoneend, encode);
    }
}

当然, 要搭配将 byte index 转换为以一行一行数据为基本的函数码:

/// 
/// 往前搜寻直到发现 endtag, 回传的 position 指向 endtag 下一个 byte
/// 
/// 
/// 
/// 
private long seekback(Stream r, int endtag)
{
    long currentpos = r.Position;
    while (currentpos > 0)
    {
        currentpos--;
        r.Position = currentpos;
        if (r.ReadByte() == endtag)
            return currentpos+1;            
    }
    r.Position = 0;
    return 0;
}

/// 
/// 往后搜寻直到发现 begintag, 回传的 position 指向 begintag 下一个 byte
/// 
/// 
/// 
/// 
private long seeknext(Stream r, int begintag)
{
    long currentpos = r.Position;
    long finalpos = r.Length;
    while (currentpos < finalpos)
    {
        currentpos++;
        r.Position = currentpos;
        if (r.ReadByte() == begintag)
            return currentpos + 1;
    }
    r.Position = finalpos;
    return finalpos;
}

/// 
/// 读出 stream r 目前位置的前一行
/// 
/// 
/// 
/// 
public dictdata ReadPreLine(Stream r, Encoding encode)
{
    long pos = seekback(r, 0x0a);
    StreamReader endReader = new StreamReader(r, encode);
    try
    {
        return new dictdata(endReader.ReadLine(), pos);
    }
    finally
    {
        endReader.DiscardBufferedData();
    }
}

/// 
/// 读出 stream r 目前位置的下一行
/// 
/// 
/// 
/// 
public dictdata ReadNextLine(Stream r, Encoding encode)
{
    long pos = seeknext(r, 0x0a);
    StreamReader sr = new StreamReader(r, encode);
    try
    {
        return new dictdata(sr.ReadLine(), pos);
    }
    finally
    {
        sr.DiscardBufferedData();
    }
}

于是, 我们就可以写出英文找到中文的搜寻程序:

/// 
/// 找回最接近 index 的数笔数据, 最多回传 maxcount 笔
/// 
/// 
/// 
/// 
/// 
/// 
private List SeekData(Stream r, string index, Encoding encode, int maxcount)
{           
    dictdata firstdata = SeekToLine(r, index, 0, r.Length, encode);
    return ReadData(r, firstdata.filepos, encode, maxcount);
}

/// 
/// 读取数据
/// 
/// 
/// 
/// 
/// 
/// 
private List ReadData(Stream r, long beginpos, Encoding encode, int maxcount)
{
    List result = new List();
    r.Position = beginpos;
    StreamReader sr = new StreamReader(r, encode);
    try
    {
        while (maxcount-- > 0)
            result.Add(new dictdata(sr.ReadLine(), 0));
    }
    finally
    {
        sr.DiscardBufferedData();
    }
    return result;
}

/// 
/// 查询英文, 回传最多 maxcount 个字典数据
/// 
/// 
/// 
/// 
public List EnglishToChinese(string english, int maxcount)
{
    if (english.Length == 0)
        return new List();

    string filename = Path.Combine(libpath, english[0] + ".lib");
    if ((filename != lastopenfile) || (lastopenfs == null))
    {
        if (lastopenfs != null)
        {
            lastopenfs.Dispose();
            lastopenfs = null;
        }
        if (File.Exists(filename))
        {
            lastopenfile = filename;
            lastopenfs = File.OpenRead(filename);
        }
    }

    if (lastopenfs != null)
    {
        // 当 english term 长度为 1 时, 我们可以做加速的动作
        // 通常第一行就是该英文.
        if (english.Length == 1)
        {
            lastopenfs.Position = 0;
            StreamReader sr = new StreamReader(lastopenfs, encode);
            try
            {
                dictdata firstitem = new dictdata(sr.ReadLine(), 0);
                if (firstitem.eng == english.ToLower())
                {
                    // 是的, 找到第一行就是我们要的
                    return ReadData(lastopenfs, 0, encode, maxcount);
                }
            }
            finally
            {
                sr.DiscardBufferedData();
            }
        }

        return SeekData(lastopenfs, english.ToLower(), encode, maxcount);
    }
    else
        return new List();
}

我们当然可以做中翻英的功能, 很简单, 很暴力:

public List ChineseToEnglish(string chinese)
{
    var result = new List();
    List libfiles = new List(Directory.GetFiles(libpath, "*.lib"));
    libfiles.Sort();
    foreach (string libfile in libfiles)
    {
        using (FileStream fs = File.OpenRead(libfile))
        using (StreamReader sr = new StreamReader(fs, encode))
        {
            string linedata;
            while ((linedata = sr.ReadLine()) != null)
            {
                var dictitem = new dictdata(linedata, 0);
                if (dictitem.chinese.IndexOf(chinese) >= 0)
                {
                    // found.
                    result.Add(dictitem);
                }
            }
        }
    }
    return result;
}

如果, 内存够大 (整个字典档统统载入内存大约会耗费 10MB),
就直接在内存搜寻, 那么整个程序会简单的多…
所以, 我们可以设计一个共通的界面, 让外部使用的人可以轻松切换内存搜寻,
或是文件搜寻.

/// 
/// 字典界面
/// 
public interface IDict : IDisposable
{
    List EnglishToChinese(string english, int maxcount);
    List ChineseToEnglish(string chinese);
}

于是, 文件搜寻的程序会像这样:

/// 
/// 使用 pydict 的字典档, 不载入内存, 直接在文件中搜寻
/// 
public class dict : IDict
{
    /// 
    /// dict lib path
    /// 
    private string libpath;

    /// 
    /// 上次打开的文件名称, 加速用
    /// 
    private string lastopenfile;

    /// 
    /// 上次打开的 FileStream, 加速用
    /// 
    private FileStream lastopenfs;

    /// 
    /// 编码
    /// 
    private static Encoding encode = Encoding.GetEncoding("Big5");

    public dict()
    {
        libpath =
            Path.Combine(
            System.IO.Path.GetDirectoryName(System.Reflection.Assembly.GetExecutingAssembly().GetName().CodeBase),
            "lib");
    }

    // 略...内容就是上面提到的...

    #region IDisposable 成员
     public void Dispose()
    {
        if (lastopenfs != null)
        {
            lastopenfs.Dispose();
            lastopenfile = null;
            lastopenfs = null;

        }
    }
    #endregion

}

而整个在内存搜寻的字典程序 (是不是比直接在文件上面搜寻简单多了!):

/// 
/// 将 pydict 的字典档载入内存, 直接在内存中搜寻
/// 
public class memdict : IDict
{
    /// 
    /// dict lib path
    /// 
    private string libpath;

    /// 
    /// 全部的字典档内容
    /// 
    private List allsorteddict;

    private static Encoding encode = Encoding.GetEncoding("Big5");

    public memdict()
    {
        libpath =
            Path.Combine(
            System.IO.Path.GetDirectoryName(System.Reflection.Assembly.GetExecutingAssembly().GetName().CodeBase),
            "lib");

                    List libfiles = new List(Directory.GetFiles(libpath, "*.lib"));
        libfiles.Sort();
        allsorteddict = new List();
        foreach (string libfile in libfiles)
        {
            using (FileStream fs = File.OpenRead(libfile))
            using (StreamReader sr = new StreamReader(fs, encode))
            {
                string linedata;
                while ((linedata = sr.ReadLine()) != null)
                {
                    allsorteddict.Add(new dictdata(linedata, 0));
                }
            }
        }
    }

    /// 
    /// 查询英文, 回传最多 maxcount 个字典数据
    /// 
    /// 
    /// 
    /// 
    public List EnglishToChinese(string english, int maxcount)
    {
        if (english.Length == 0)
            return new List();

        int idx = allsorteddict.BinarySearch(new dictdata(english.ToLower() + "=", 0),
            new ComparisonComparer((x,y) => x.eng.CompareTo(y.eng)));

        bool isfound = (idx >= 0);

        if (isfound == false)
            idx = ~idx;

        var result = new List();
        for (int i = idx; (i < allsorteddict.Count) && (i < (idx + maxcount)); i++)
        {
            result.Add(allsorteddict[i]);
        }
        return result;
    }

    public List ChineseToEnglish(string chinese)
    {
        var result = new List();
        foreach (var dictitem in allsorteddict)
        {
            if (dictitem.chinese.IndexOf(chinese) >= 0)
            {
                // found.
                result.Add(dictitem);
            }
        }
        return result;
    }

    #region IDisposable 成员

    public void Dispose()
    {
    }

    #endregion
}

欧, 还要搭配一个小工具把 Comparesion delegate 转为实做 IComparer 的对象:

/// 
/// 将 Comparesion delegate 转为一个实做 Comparer 的对象
/// 
/// 
public sealed class ComparisonComparer : IComparer
{
    private readonly Comparison comparison;

    public ComparisonComparer(Comparison comparison)
    {
        this.comparison = comparison;
    }

    public int Compare(T x, T y)
    {
        return comparison(x, y);
    }
}

是的, 按照惯例, 功能部分的程序写完了,
就来拖拉 UI 啦

你可以看到很简单的几个设计, 在上面的 TextBox 输入文字,
如果是英翻中, 因为 BinarySearch 很快 (不论文件或是内存搜寻皆然),
所以我们可以作即时搜寻, 这点就是网络不容易作到的事情.
而如果切换为中翻英, 就要用暴力法查询, 会需要等待, 所以不能作即时搜寻,
要靠右边的搜寻按键.

中翻英的搜寻功能就做在上方 TextBox 的 TextChanged 触发函数:

private void textBox1_TextChanged(object sender, EventArgs e)
{
    // 中翻英没办法做到即时查询
    if (IsEnglishToChinese == false)
        return;

    int maxcount = 20;
    var dicitems = dic.EnglishToChinese(textBox1.Text, maxcount);
    updatelist(dicitems,
        (dicitems.Count > 0) ?
        (String.Compare(dicitems[0].eng, textBox1.Text, true) == 0) : false);
}

private void updatelist(List dictdatas, bool shoulddisplayfirst)
{
    listBox1.BeginUpdate();
    try
    {
        listBox1.Items.Clear();

        foreach (var ditem in dictdatas)
            listBox1.Items.Add(ditem);

        if ((dictdatas.Count > 0) && (shoulddisplayfirst == true))
        {
            listBox1.SelectedIndex = 0;
            textBox2.Text = dictdatas[0].displaysoundmarkandchinese;
        }
        else
            textBox2.Text = string.Empty;
    }
    finally
    {
        listBox1.EndUpdate();
    }
}

然后, 我们可以在使用者点选左边候选英文列表时, 在右边显示中文内容:

private void listBox1_SelectedIndexChanged(object sender, EventArgs e)
{
    dictdata dict = listBox1.SelectedItem as dictdata;
    if (dict != null)
        textBox2.Text = dict.displaysoundmarkandchinese;
    else
        textBox2.Text = string.Empty;
}

中文查询的功能就做在 Search Button 按下的时候, UI 也需要显示等待的状况:

private void button1_Click(object sender, EventArgs e)
{
    // 英翻中已经做到即时查询, 不需要再查一次
    if (IsEnglishToChinese == true)
        return;

    Cursor.Current = Cursors.WaitCursor;
    try
    {
        var result = dic.ChineseToEnglish(textBox1.Text);
        updatelist(result, true);
    }
    finally
    {
        Cursor.Current = Cursors.Default;
    }
}

最后, 切换中翻英, 英翻中的程序:

private void menuItem4_Click(object sender, EventArgs e)
{
    updateEnglishToChineseStatus(false);
}

private void updateEnglishToChineseStatus(bool isengtochinese)
{
    IsEnglishToChinese = isengtochinese;
    menuItem4.Checked = !IsEnglishToChinese;
    menuItem5.Checked = IsEnglishToChinese;
    this.Text = IsEnglishToChinese ? "英翻中" : "中翻英";
}

private void menuItem5_Click(object sender, EventArgs e)
{
    updateEnglishToChineseStatus(true);
}

因为我们设计了统一继承的界面, 所以要切换内存搜寻就很简单啰:

private void menuItem3_Click(object sender, EventArgs e)
{
    if (dic is memdict)
        return; // 已经载入内存了
    dic.Dispose();

    Cursor.Current = Cursors.WaitCursor;
    try
    {
        dic = new memdict();
    }
    finally
    {
        Cursor.Current = Cursors.Default;
    }
}

使用的范例画面如下:

…

是的, 打算朝范例迈进啊~~~

原始文件若包含了所有字典档会传不上来..
我仅仅保留 a 的字典档, 其他得有兴趣的人自己补上就好 : wm6dict.zip

原文:大专栏 [Windows Mobile .NET CF] 中英字典 – Day9

码农公寓

相关文章