[DICT.TW] DICT.TW 線上字典

DICT@FreeBSD 字典資料庫編譯程序


Chienwen, DICT.TW

2006.02.07
update:2007.04.04

    安裝前準備:

  1. 你必須熟悉如何使用電腦,包括如何開機與關機。

  2. 你己經安裝好 FreeBSD,並且 ports tree 已更新完畢。

  3. 你己經安裝好 Apache http server,並可執行 cgi 程式。

  4. 你己經安裝好 DICT server & client。 (參考: DICT@FreeBSD 架設程序)

  5. 你必須熟悉 PERL 程式語言,以處理文件格式轉換。 (或其他功能相同的程式語言)


  6. 安裝程式:

  7. 由 ports 安裝 dictfmt 字典資料庫編譯程式:

    # cd /usr/ports/textproc/dictfmt ; make install


  8. 文件格式概念:

  9. dictfmt 預設的文件格式,有下列七種:
    FORMATTING OPTIONS

    -c5 FILE is formatted with headwords preceded by 5 or more underscore characters (_) and a blank line. All text until the next headword is considered the definition. Any leading `@' characters are stripped out, but the file is otherwise unchanged. This option was written to format the CIA WORLD FACTBOOK 1995.
     
    -t -c5, --without-info and --without-headword options are implied. Use this option, if an input database comes from dictunformat utility.
     
    -e FILE is in html format, with the headword tagged as bold. (<B>headword - </B>)
    This option was written to format EASTON'S 1897 BIBLE DICTIONARY. A typical entry from Easton is:

    <A NAME="T0000005">
    <B>Abagtha - </B>
    one of the seven eunuchs in Ahasuerus's court (Esther 1:10; 2:21).

    This is converted to:
    Abagtha
       one of the seven eunuchs in Ahasuerus's court (Esther 1:10; 2:21).

    The heading "<A NAME="T0000005"> is omitted, and the headword `Abagtha' is indexed.

    NOTE: This option should be used with caution. It removes several html tags (enough to format Easton properly), but not all. The Makefile that was originally written to format dict-easton uses sed scripts to modify certain cross reference tags. It may be necessary to pipe the input file through a sed script, or hack the source of dictfmt in order to properly format other html databases.
     
    -f FILE is formatted with the headwords starting in column 0, with the definition indented at least one space (or tab character) on subsequent lines. The third line starting in column 0 is taken as the first headword , and the first two lines starting in column 0 are treated as part of the 00-database-info header. This option was written to format the F.O.L.D.O.C.
     
    -h FILE is formatted with the headwords starting in column 0, followed by a comma, with the definition continuing on the same line. All text before the first single character line is included in 00-database-info header, and lines with only one character are omitted from the .dict file. The first headword is on the line following the first single character line. The headword is indexed; the text of the file is not changed. This option was written to format HITCHCOCK'S BIBLE NAMES DICTIONARY.
     
    -j FILE is formatted with headwords starting in col 0, enclosed in colons, followed by the definition. The colons surrounding the headword are removed, and the headword is indexed. Lines beginning with '*', '=', or '-' are also removed. All text before the first headword is included in the headers. This option was written to format the JARGON FILE.

    NOTE: Some recent versions of the JARGON FILE had three blanks inserted before the first colon at each headword. These must be removed before processing with dictfmt. (sed scripts have been used for this purpose. ed, awk, or perl scripts are also possible.)
     
    -p FILE is formatted with `%h' in column 0, followed by a blank, followed by the headword, optionally followed by a line containing `%d' in column 0. The definition starts on the following line. The first line beginning '%h' and any lines beginning '%d' are stripped from the .dict file, and '%h ' is stripped from in front of the headword. All text before the first headword is included in the headers. The second line beginning '%h' is taken as the first headword. This option was written to format Jay Kominek's elements database.
     
    關於 dictfmt 更多的說明,請查詢 man dictfmt。
    若原始的資料檔案格式不同,則用 PERL 加以轉換。


  10. PERL 程式範例:

  11. 天主教英漢袖珍辭典為例,我們可將 http 格式的資料,使用 PERL 轉換為 dictfmt (-p) 格式。

  12. 程式目的:

    將 http 格式:
    <p class="style9"><span class="style11">AAS </span>:教廷公報;宗座公報。全名是 Acta Apostolicae Sedis 。 </p>
    轉換為 dictfmt 格式:
    %h AAS
    %d
    <b>AAS</b>
    教廷公報; 宗座公報。 全名是 Acta Apostolicae Sedis 。

  13. 程式碼: (檔案名稱 ./catholic.pl )

    #!/usr/bin/perl
    # 2006.10.14

    use strict ;

    &journal() ;

    ####

    sub journal {

        my @all_option = qw // ;
        my @file_array = qw /a b c d e f g h i j k l m n o p q r s t u v w xyz/ ;

        foreach my $file (@file_array) {
            open ( FILE, "download/$file.htm") or die "開啟檔案失敗: $!" ;
            my @FileData = <FILE> ;
            close (FILE) ;
            push @all_option, @FileData ;
            } ;

        open ( TXT, ">catholic.txt" ) or die "開啟檔案失敗: $!" ;

    print TXT <<__END ;
    %h 00-database-info
    %d
    天主教英漢袖珍辭典 - 由主徒會恒毅月刊社出版, 2001 年元旦。
    來源: http://stteresa.catholic.org.hk/website/catechumenate/dictionary/
    __END

        foreach my $data (@all_option) {

            if ( $data =~ /.*?\<p class=\".*?\"\>\<span class=\".*?\"\>(.*?)\<\/span\>\s*:(.*?)\s*\<\/p\>.*/ )
            {
            my $head = $1 ;
            my $def = $2 ;

            $def =~ s/;/\; /g ;
            $def =~ s/:/\: /g ;
            $def =~ s/,/\, /g ;
            $def =~ s/(/ \(/g ;
            $def =~ s/\xA1\x5E/\) /g ; # ) A1 5E
            $def =~ s/?/\? /g ;
            $def =~ s/!/\! /g ;
            $def =~ s/。/。 /g ;

            $def = "<b>" . $head . "<\/b>\n" . $def ;

            print TXT "%h $head\n%d\n$def\n" ;
            }
        }
        close (TXT);
    }


  14. 編譯資料:

  15. 編輯 .sh 檔: (檔案名稱 ./make.sh )

    #!/bin/sh
    # 2006.10.14

    perl catholic.pl

    dictfmt --locale zh_TW.Big5 --allchars -p -u http://jesus.tw \
        --columns 80 --without-headword \
        -s "Catholic DICT" \
        catholic < catholic.txt

    dictzip catholic.dict

  16. 下載天主教英漢袖珍辭典網頁 ( A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, XYZ ) ,到 ./download 資料夾。

  17. 執行:

    # sh make.sh


  18. 掛載資料庫:

  19. 執行:

    # cp catholic.dict.dz /usr/local/lib/dict/
    # cp catholic.index /usr/local/lib/dict/

  20. 修改 /usr/local/etc/dictd.conf,加入這些設定:

    database catholic  { data "/usr/local/lib/dict/catholic.dict.dz"
                         index "/usr/local/lib/dict/catholic.index" }

  21. 重新啟動 dictd:

    # /usr/local/etc/rc.d/dictd.sh restart



DICT.TW
線上字典