[Libreoffice] Script to find german comments in the code
Miklos Vajna
vmiklos at frugalware.org
Sun Oct 24 05:25:26 PDT 2010
Hi,
Last night before we came home we write a small script with Jonas that
tries to find German comments in the LibreOffice source code (tracked
files with hxx or cxx extension).
I'm attaching as a patch, I think it may make sense to have it in the
scratch/ directory of build.git.
What do you think about it?
A bit more details:
- It's a python script (for quick prototyping), and it uses the original
text_cat perl script to guess the language
- It does not use OOo's bundled libtextcat as we want to choose between
English and German here, not among several languages
- It may still have bugs, though I run it on the startmath module and
manually checked the result, also I run it on the sw module and did
read the output at random places and it seems to output no false
positives at the moment.
A possible future usage is to run the script from cron periodically and
publish the results on some webpage, so that translators don't have to
run it themselves.
Patch attached - is OK to push it? :)
Thanks.
-------------- next part --------------
From c4d279ffd089842a728e6d8cbde9f8936d37d4dc Mon Sep 17 00:00:00 2001
From: Miklos Vajna <vmiklos at frugalware.org>
Date: Sat, 23 Oct 2010 18:13:00 +0200
Subject: [PATCH] find-german-comments: simple hack to find german comments in the source code
---
scratch/german-comments/find-german-comments.py | 162 ++++++++
scratch/german-comments/t/test.cxx | 59 +++
scratch/german-comments/text_cat/COPYING | 504 +++++++++++++++++++++++
scratch/german-comments/text_cat/Copyright | 21 +
scratch/german-comments/text_cat/LM/english.lm | 400 ++++++++++++++++++
scratch/german-comments/text_cat/LM/german.lm | 400 ++++++++++++++++++
scratch/german-comments/text_cat/text_cat | 229 ++++++++++
scratch/german-comments/text_cat/version | 2 +
8 files changed, 1777 insertions(+), 0 deletions(-)
create mode 100755 scratch/german-comments/find-german-comments.py
create mode 100644 scratch/german-comments/t/bogus.fxx
create mode 100644 scratch/german-comments/t/test.cxx
create mode 100644 scratch/german-comments/text_cat/COPYING
create mode 100644 scratch/german-comments/text_cat/Copyright
create mode 100644 scratch/german-comments/text_cat/LM/english.lm
create mode 100644 scratch/german-comments/text_cat/LM/german.lm
create mode 100755 scratch/german-comments/text_cat/text_cat
create mode 100644 scratch/german-comments/text_cat/version
diff --git a/scratch/german-comments/find-german-comments.py b/scratch/german-comments/find-german-comments.py
new file mode 100755
index 0000000..1538c6d
--- /dev/null
+++ b/scratch/german-comments/find-german-comments.py
@@ -0,0 +1,162 @@
+#!/usr/bin/env python
+########################################################################
+#
+# Copyright (c) 2010 Jonas Jensen, Miklos Vajna
+#
+# Permission is hereby granted, free of charge, to any person
+# obtaining a copy of this software and associated documentation
+# files (the "Software"), to deal in the Software without
+# restriction, including without limitation the rights to use,
+# copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following
+# conditions:
+#
+# The above copyright notice and this permission notice shall be
+# included in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+# OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+# HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+# WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+# OTHER DEALINGS IN THE SOFTWARE.
+#
+########################################################################
+
+
+import sys, re, subprocess, os, optparse, string
+
+class Parser:
+ """
+ This parser extracts comments from source files, tries to guess
+ their language and then prints out the german ones.
+ """
+ def __init__(self):
+ self.strip = string.punctuation + " \n"
+ op = optparse.OptionParser()
+ op.set_usage("%prog [options] <rootdir>\n\n" +
+ "Searches for german comments in cxx/hxx source files inside a given root\n" +
+ "directory recursively.")
+ op.add_option("-v", "--verbose", action="store_true", dest="verbose", default=False,
+ help="Turn on verbose mode (print progress to stderr)")
+ self.options, args = op.parse_args()
+ try:
+ dir = args[0]
+ except IndexError:
+ dir = "."
+ self.check_source_files(dir)
+
+ def get_comments(self, filename):
+ """
+ Extracts the source code comments.
+ """
+ linenum = 0
+ if self.options.verbose:
+ sys.stderr.write("processing file '%s'...\n" % filename)
+ sock = open(filename)
+ # add an empty line to trigger the output of collected oneliner
+ # comment group
+ lines = sock.readlines() + ["\n"]
+ sock.close()
+
+ in_comment = False
+ buf = []
+ count = 1
+ for i in lines:
+ if "//" in i and not in_comment:
+ # if we find a new //-style comment, then we
+ # just append it to a previous one if: there is
+ # only whitespace before the // mark that is
+ # necessary to make comments longer, giving
+ # more reliable output
+ if not len(re.sub("(.*)//.*", r"\1", i).strip(self.strip)):
+ s = re.sub(".*// ?", "", i).strip(self.strip)
+ if len(s):
+ buf.append(s)
+ else:
+ # otherwise it's an independent //-style comment in the next line
+ yield (count, "\n ".join(buf))
+ buf = [re.sub(".*// ?", "", i.strip(self.strip))]
+ elif "//" not in i and not in_comment and len(buf) > 0:
+ # first normal line after a // block
+ yield (count, "\n ".join(buf))
+ buf = []
+ elif "/*" in i and "*/" not in i and not in_comment:
+ # start of a real multiline comment
+ in_comment = True
+ linenum = count
+ s = re.sub(".*/\*+", "", i.strip(self.strip))
+ if len(s):
+ buf.append(s.strip(self.strip))
+ elif in_comment and not "*/" in i:
+ # in multiline comment
+ s = re.sub("^( |\|)*\*?", "", i)
+ if len(s.strip(self.strip)):
+ buf.append(s.strip(self.strip))
+ elif "*/" in i and in_comment:
+ # end of multiline comment
+ in_comment = False
+ s = re.sub(r"\*+/.*", "", i.strip(self.strip))
+ if len(s):
+ buf.append(s)
+ yield (count, "\n ".join(buf))
+ buf = []
+ elif "/*" in i and "*/" in i:
+ # c-style oneliner comment
+ yield (count, re.sub(".*/\*(.*)\*/.*", r"\1", i).strip(self.strip))
+ count += 1
+
+ def get_lang(self, s):
+ """ the output is 'german' or 'english' or 'german or english'. when
+ unsure, just don't warn, there are strings where you just can't
+ teremine the results reliably, like '#110680#' """
+ cwd = os.getcwd()
+ # change to our directory
+ os.chdir(os.path.split(os.path.abspath(sys.argv[0]))[0])
+ sock = subprocess.Popen(["text_cat/text_cat", "-d", "text_cat/LM"], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
+ sock.stdin.write(s)
+ sock.stdin.close()
+ lang = sock.stdout.read().strip()
+ sock.stdout.close()
+ os.chdir(cwd)
+ return lang
+
+ def is_german(self, s):
+ """
+ determines if a string is german or not
+ """
+ # for short strings we can't do reliable recognition, so skip
+ # short strings and less than 4 words
+ s = s.replace('\n', ' ')
+ if len(s) < 32 or len(s.split()) < 4:
+ return False
+ return "german" == self.get_lang(s)
+
+ def check_file(self, path):
+ """
+ checks each comment in a file
+ """
+ for linenum, s in self.get_comments(path):
+ if self.is_german(s):
+ print "%s:%s: %s" % (path, linenum, s)
+
+ def check_source_files(self, dir):
+ """
+ checks each _tracked_ file in a directory recursively
+ """
+ sock = os.popen(r"git ls-files '%s' |egrep '\.(c|h)xx$'" % dir)
+ lines = sock.readlines()
+ sock.close()
+ for path in lines:
+ self.check_file(path.strip())
+
+try:
+ Parser()
+except KeyboardInterrupt:
+ print "Interrupted!"
+ sys.exit(0)
+
+# vim:set shiftwidth=4 softtabstop=4 expandtab:
diff --git a/scratch/german-comments/t/bogus.fxx b/scratch/german-comments/t/bogus.fxx
new file mode 100644
index 0000000..e69de29
diff --git a/scratch/german-comments/t/test.cxx b/scratch/german-comments/t/test.cxx
new file mode 100644
index 0000000..9f0b4eb
--- /dev/null
+++ b/scratch/german-comments/t/test.cxx
@@ -0,0 +1,59 @@
+before_comment();
+foo(); // single line comment
+//single line comment 2
+after_comment();
+
+/*
+ * If there was some unconverted bytes on the last cycle then they
+ */
+
+not_a_comment();
+/*
+ * dann mal einen harten Seitenumbruch einfuegen
+ */
+
+not_a_comment();
+
+/**************************************************************************/
+
+// #110680#
+
+ DEFAULTFONT_SERIF, // FNT_VARIABLE
+
+// Set base URI
+
+/*************************************************************************
+|*
+|* Deinitialisierung
+|*
+\************************************************************************/
+
+/*************************************************************************
+|*
+|* Deinitialising
+|*
+\************************************************************************/
+
+/* dann mal einen harten Seitenumbruch einfuegen */
+
+ // used to convert the 4 special ExtraProg/UINames for
+ // RES_POOLCOLL_LABEL_DRAWING, RES_POOLCOLL_LABEL_ABB,
+ // RES_POOLCOLL_LABEL_TABLE, RES_POOLCOLL_LABEL_FRAME
+ // forth and back.
+ // Non-matching names remain unchanged.
+
+ bCntntCheck( FALSE ), // --> FME 2005-05-13 #i43742# <--
+ bInFrontOfLabel( FALSE ), // #i27615#
+ bInNumPortion(FALSE), // #i23726#
+ nInNumPostionOffset(0) // #i23726#
+
+////////////////////////////////////////////////////////////
+//
+// SmFontPickListBox
+//
+
+// --------------------
+// SwGrfNode
+// --------------------
+
+//#define WID_??? 1024
diff --git a/scratch/german-comments/text_cat/COPYING b/scratch/german-comments/text_cat/COPYING
new file mode 100644
index 0000000..5ab7695
--- /dev/null
+++ b/scratch/german-comments/text_cat/COPYING
@@ -0,0 +1,504 @@
+ GNU LESSER GENERAL PUBLIC LICENSE
+ Version 2.1, February 1999
+
+ Copyright (C) 1991, 1999 Free Software Foundation, Inc.
+ 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+[This is the first released version of the Lesser GPL. It also counts
+ as the successor of the GNU Library Public License, version 2, hence
+ the version number 2.1.]
+
+ Preamble
+
+ The licenses for most software are designed to take away your
+freedom to share and change it. By contrast, the GNU General Public
+Licenses are intended to guarantee your freedom to share and change
+free software--to make sure the software is free for all its users.
+
+ This license, the Lesser General Public License, applies to some
+specially designated software packages--typically libraries--of the
+Free Software Foundation and other authors who decide to use it. You
+can use it too, but we suggest you first think carefully about whether
+this license or the ordinary General Public License is the better
+strategy to use in any particular case, based on the explanations below.
+
+ When we speak of free software, we are referring to freedom of use,
+not price. Our General Public Licenses are designed to make sure that
+you have the freedom to distribute copies of free software (and charge
+for this service if you wish); that you receive source code or can get
+it if you want it; that you can change the software and use pieces of
+it in new free programs; and that you are informed that you can do
+these things.
+
+ To protect your rights, we need to make restrictions that forbid
+distributors to deny you these rights or to ask you to surrender these
+rights. These restrictions translate to certain responsibilities for
+you if you distribute copies of the library or if you modify it.
+
+ For example, if you distribute copies of the library, whether gratis
+or for a fee, you must give the recipients all the rights that we gave
+you. You must make sure that they, too, receive or can get the source
+code. If you link other code with the library, you must provide
+complete object files to the recipients, so that they can relink them
+with the library after making changes to the library and recompiling
+it. And you must show them these terms so they know their rights.
+
+ We protect your rights with a two-step method: (1) we copyright the
+library, and (2) we offer you this license, which gives you legal
+permission to copy, distribute and/or modify the library.
+
+ To protect each distributor, we want to make it very clear that
+there is no warranty for the free library. Also, if the library is
+modified by someone else and passed on, the recipients should know
+that what they have is not the original version, so that the original
+author's reputation will not be affected by problems that might be
+introduced by others.
+
+ Finally, software patents pose a constant threat to the existence of
+any free program. We wish to make sure that a company cannot
+effectively restrict the users of a free program by obtaining a
+restrictive license from a patent holder. Therefore, we insist that
+any patent license obtained for a version of the library must be
+consistent with the full freedom of use specified in this license.
+
+ Most GNU software, including some libraries, is covered by the
+ordinary GNU General Public License. This license, the GNU Lesser
+General Public License, applies to certain designated libraries, and
+is quite different from the ordinary General Public License. We use
+this license for certain libraries in order to permit linking those
+libraries into non-free programs.
+
+ When a program is linked with a library, whether statically or using
+a shared library, the combination of the two is legally speaking a
+combined work, a derivative of the original library. The ordinary
+General Public License therefore permits such linking only if the
+entire combination fits its criteria of freedom. The Lesser General
+Public License permits more lax criteria for linking other code with
+the library.
+
+ We call this license the "Lesser" General Public License because it
+does Less to protect the user's freedom than the ordinary General
+Public License. It also provides other free software developers Less
+of an advantage over competing non-free programs. These disadvantages
+are the reason we use the ordinary General Public License for many
+libraries. However, the Lesser license provides advantages in certain
+special circumstances.
+
+ For example, on rare occasions, there may be a special need to
+encourage the widest possible use of a certain library, so that it becomes
+a de-facto standard. To achieve this, non-free programs must be
+allowed to use the library. A more frequent case is that a free
+library does the same job as widely used non-free libraries. In this
+case, there is little to gain by limiting the free library to free
+software only, so we use the Lesser General Public License.
+
+ In other cases, permission to use a particular library in non-free
+programs enables a greater number of people to use a large body of
+free software. For example, permission to use the GNU C Library in
+non-free programs enables many more people to use the whole GNU
+operating system, as well as its variant, the GNU/Linux operating
+system.
+
+ Although the Lesser General Public License is Less protective of the
+users' freedom, it does ensure that the user of a program that is
+linked with the Library has the freedom and the wherewithal to run
+that program using a modified version of the Library.
+
+ The precise terms and conditions for copying, distribution and
+modification follow. Pay close attention to the difference between a
+"work based on the library" and a "work that uses the library". The
+former contains code derived from the library, whereas the latter must
+be combined with the library in order to run.
+
+ GNU LESSER GENERAL PUBLIC LICENSE
+ TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+ 0. This License Agreement applies to any software library or other
+program which contains a notice placed by the copyright holder or
+other authorized party saying it may be distributed under the terms of
+this Lesser General Public License (also called "this License").
+Each licensee is addressed as "you".
+
+ A "library" means a collection of software functions and/or data
+prepared so as to be conveniently linked with application programs
+(which use some of those functions and data) to form executables.
+
+ The "Library", below, refers to any such software library or work
+which has been distributed under these terms. A "work based on the
+Library" means either the Library or any derivative work under
+copyright law: that is to say, a work containing the Library or a
+portion of it, either verbatim or with modifications and/or translated
+straightforwardly into another language. (Hereinafter, translation is
+included without limitation in the term "modification".)
+
+ "Source code" for a work means the preferred form of the work for
+making modifications to it. For a library, complete source code means
+all the source code for all modules it contains, plus any associated
+interface definition files, plus the scripts used to control compilation
+and installation of the library.
+
+ Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope. The act of
+running a program using the Library is not restricted, and output from
+such a program is covered only if its contents constitute a work based
+on the Library (independent of the use of the Library in a tool for
+writing it). Whether that is true depends on what the Library does
+and what the program that uses the Library does.
+
+ 1. You may copy and distribute verbatim copies of the Library's
+complete source code as you receive it, in any medium, provided that
+you conspicuously and appropriately publish on each copy an
+appropriate copyright notice and disclaimer of warranty; keep intact
+all the notices that refer to this License and to the absence of any
+warranty; and distribute a copy of this License along with the
+Library.
+
+ You may charge a fee for the physical act of transferring a copy,
+and you may at your option offer warranty protection in exchange for a
+fee.
+
+ 2. You may modify your copy or copies of the Library or any portion
+of it, thus forming a work based on the Library, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+ a) The modified work must itself be a software library.
+
+ b) You must cause the files modified to carry prominent notices
+ stating that you changed the files and the date of any change.
+
+ c) You must cause the whole of the work to be licensed at no
+ charge to all third parties under the terms of this License.
+
+ d) If a facility in the modified Library refers to a function or a
+ table of data to be supplied by an application program that uses
+ the facility, other than as an argument passed when the facility
+ is invoked, then you must make a good faith effort to ensure that,
+ in the event an application does not supply such function or
+ table, the facility still operates, and performs whatever part of
+ its purpose remains meaningful.
+
+ (For example, a function in a library to compute square roots has
+ a purpose that is entirely well-defined independent of the
+ application. Therefore, Subsection 2d requires that any
+ application-supplied function or table used by this function must
+ be optional: if the application does not supply it, the square
+ root function must still compute square roots.)
+
+These requirements apply to the modified work as a whole. If
+identifiable sections of that work are not derived from the Library,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works. But when you
+distribute the same sections as part of a whole which is a work based
+on the Library, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote
+it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Library.
+
+In addition, mere aggregation of another work not based on the Library
+with the Library (or with a work based on the Library) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+ 3. You may opt to apply the terms of the ordinary GNU General Public
+License instead of this License to a given copy of the Library. To do
+this, you must alter all the notices that refer to this License, so
+that they refer to the ordinary GNU General Public License, version 2,
+instead of to this License. (If a newer version than version 2 of the
+ordinary GNU General Public License has appeared, then you can specify
+that version instead if you wish.) Do not make any other change in
+these notices.
+
+ Once this change is made in a given copy, it is irreversible for
+that copy, so the ordinary GNU General Public License applies to all
+subsequent copies and derivative works made from that copy.
+
+ This option is useful when you wish to copy part of the code of
+the Library into a program that is not a library.
+
+ 4. You may copy and distribute the Library (or a portion or
+derivative of it, under Section 2) in object code or executable form
+under the terms of Sections 1 and 2 above provided that you accompany
+it with the complete corresponding machine-readable source code, which
+must be distributed under the terms of Sections 1 and 2 above on a
+medium customarily used for software interchange.
+
+ If distribution of object code is made by offering access to copy
+from a designated place, then offering equivalent access to copy the
+source code from the same place satisfies the requirement to
+distribute the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+ 5. A program that contains no derivative of any portion of the
+Library, but is designed to work with the Library by being compiled or
+linked with it, is called a "work that uses the Library". Such a
+work, in isolation, is not a derivative work of the Library, and
+therefore falls outside the scope of this License.
+
+ However, linking a "work that uses the Library" with the Library
+creates an executable that is a derivative of the Library (because it
+contains portions of the Library), rather than a "work that uses the
+library". The executable is therefore covered by this License.
+Section 6 states terms for distribution of such executables.
+
+ When a "work that uses the Library" uses material from a header file
+that is part of the Library, the object code for the work may be a
+derivative work of the Library even though the source code is not.
+Whether this is true is especially significant if the work can be
+linked without the Library, or if the work is itself a library. The
+threshold for this to be true is not precisely defined by law.
+
+ If such an object file uses only numerical parameters, data
+structure layouts and accessors, and small macros and small inline
+functions (ten lines or less in length), then the use of the object
+file is unrestricted, regardless of whether it is legally a derivative
+work. (Executables containing this object code plus portions of the
+Library will still fall under Section 6.)
+
+ Otherwise, if the work is a derivative of the Library, you may
+distribute the object code for the work under the terms of Section 6.
+Any executables containing that work also fall under Section 6,
+whether or not they are linked directly with the Library itself.
+
+ 6. As an exception to the Sections above, you may also combine or
+link a "work that uses the Library" with the Library to produce a
+work containing portions of the Library, and distribute that work
+under terms of your choice, provided that the terms permit
+modification of the work for the customer's own use and reverse
+engineering for debugging such modifications.
+
+ You must give prominent notice with each copy of the work that the
+Library is used in it and that the Library and its use are covered by
+this License. You must supply a copy of this License. If the work
+during execution displays copyright notices, you must include the
+copyright notice for the Library among them, as well as a reference
+directing the user to the copy of this License. Also, you must do one
+of these things:
+
+ a) Accompany the work with the complete corresponding
+ machine-readable source code for the Library including whatever
+ changes were used in the work (which must be distributed under
+ Sections 1 and 2 above); and, if the work is an executable linked
+ with the Library, with the complete machine-readable "work that
+ uses the Library", as object code and/or source code, so that the
+ user can modify the Library and then relink to produce a modified
+ executable containing the modified Library. (It is understood
+ that the user who changes the contents of definitions files in the
+ Library will not necessarily be able to recompile the application
+ to use the modified definitions.)
+
+ b) Use a suitable shared library mechanism for linking with the
+ Library. A suitable mechanism is one that (1) uses at run time a
+ copy of the library already present on the user's computer system,
+ rather than copying library functions into the executable, and (2)
+ will operate properly with a modified version of the library, if
+ the user installs one, as long as the modified version is
+ interface-compatible with the version that the work was made with.
+
+ c) Accompany the work with a written offer, valid for at
+ least three years, to give the same user the materials
+ specified in Subsection 6a, above, for a charge no more
+ than the cost of performing this distribution.
+
+ d) If distribution of the work is made by offering access to copy
+ from a designated place, offer equivalent access to copy the above
+ specified materials from the same place.
+
+ e) Verify that the user has already received a copy of these
+ materials or that you have already sent this user a copy.
+
+ For an executable, the required form of the "work that uses the
+Library" must include any data and utility programs needed for
+reproducing the executable from it. However, as a special exception,
+the materials to be distributed need not include anything that is
+normally distributed (in either source or binary form) with the major
+components (compiler, kernel, and so on) of the operating system on
+which the executable runs, unless that component itself accompanies
+the executable.
+
+ It may happen that this requirement contradicts the license
+restrictions of other proprietary libraries that do not normally
+accompany the operating system. Such a contradiction means you cannot
+use both them and the Library together in an executable that you
+distribute.
+
+ 7. You may place library facilities that are a work based on the
+Library side-by-side in a single library together with other library
+facilities not covered by this License, and distribute such a combined
+library, provided that the separate distribution of the work based on
+the Library and of the other library facilities is otherwise
+permitted, and provided that you do these two things:
+
+ a) Accompany the combined library with a copy of the same work
+ based on the Library, uncombined with any other library
+ facilities. This must be distributed under the terms of the
+ Sections above.
+
+ b) Give prominent notice with the combined library of the fact
+ that part of it is a work based on the Library, and explaining
+ where to find the accompanying uncombined form of the same work.
+
+ 8. You may not copy, modify, sublicense, link with, or distribute
+the Library except as expressly provided under this License. Any
+attempt otherwise to copy, modify, sublicense, link with, or
+distribute the Library is void, and will automatically terminate your
+rights under this License. However, parties who have received copies,
+or rights, from you under this License will not have their licenses
+terminated so long as such parties remain in full compliance.
+
+ 9. You are not required to accept this License, since you have not
+signed it. However, nothing else grants you permission to modify or
+distribute the Library or its derivative works. These actions are
+prohibited by law if you do not accept this License. Therefore, by
+modifying or distributing the Library (or any work based on the
+Library), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Library or works based on it.
+
+ 10. Each time you redistribute the Library (or any work based on the
+Library), the recipient automatically receives a license from the
+original licensor to copy, distribute, link with or modify the Library
+subject to these terms and conditions. You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties with
+this License.
+
+ 11. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License. If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Library at all. For example, if a patent
+license would not permit royalty-free redistribution of the Library by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Library.
+
+If any portion of this section is held invalid or unenforceable under any
+particular circumstance, the balance of the section is intended to apply,
+and the section as a whole is intended to apply in other circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system which is
+implemented by public license practices. Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+ 12. If the distribution and/or use of the Library is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Library under this License may add
+an explicit geographical distribution limitation excluding those countries,
+so that distribution is permitted only in or among countries not thus
+excluded. In such case, this License incorporates the limitation as if
+written in the body of this License.
+
+ 13. The Free Software Foundation may publish revised and/or new
+versions of the Lesser General Public License from time to time.
+Such new versions will be similar in spirit to the present version,
+but may differ in detail to address new problems or concerns.
+
+Each version is given a distinguishing version number. If the Library
+specifies a version number of this License which applies to it and
+"any later version", you have the option of following the terms and
+conditions either of that version or of any later version published by
+the Free Software Foundation. If the Library does not specify a
+license version number, you may choose any version ever published by
+the Free Software Foundation.
+
+ 14. If you wish to incorporate parts of the Library into other free
+programs whose distribution conditions are incompatible with these,
+write to the author to ask for permission. For software which is
+copyrighted by the Free Software Foundation, write to the Free
+Software Foundation; we sometimes make exceptions for this. Our
+decision will be guided by the two goals of preserving the free status
+of all derivatives of our free software and of promoting the sharing
+and reuse of software generally.
+
+ NO WARRANTY
+
+ 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO
+WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW.
+EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR
+OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY
+KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE
+LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME
+THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+ 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN
+WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY
+AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU
+FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR
+CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
+LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
+RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
+FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
+SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
+DAMAGES.
+
+ END OF TERMS AND CONDITIONS
+
+ How to Apply These Terms to Your New Libraries
+
+ If you develop a new library, and you want it to be of the greatest
+possible use to the public, we recommend making it free software that
+everyone can redistribute and change. You can do so by permitting
+redistribution under these terms (or, alternatively, under the terms of the
+ordinary General Public License).
+
+ To apply these terms, attach the following notices to the library. It is
+safest to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least the
+"copyright" line and a pointer to where the full notice is found.
+
+ <one line to give the library's name and a brief idea of what it does.>
+ Copyright (C) <year> <name of author>
+
+ This library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ This library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with this library; if not, write to the Free Software
+ Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+
+Also add information on how to contact you by electronic and paper mail.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the library, if
+necessary. Here is a sample; alter the names:
+
+ Yoyodyne, Inc., hereby disclaims all copyright interest in the
+ library `Frob' (a library for tweaking knobs) written by James Random Hacker.
+
+ <signature of Ty Coon>, 1 April 1990
+ Ty Coon, President of Vice
+
+That's all there is to it!
+
+
diff --git a/scratch/german-comments/text_cat/Copyright b/scratch/german-comments/text_cat/Copyright
new file mode 100644
index 0000000..c1e75d3
--- /dev/null
+++ b/scratch/german-comments/text_cat/Copyright
@@ -0,0 +1,21 @@
+Copyright (c) 1994, 1995, 1996, 1997 by Gertjan van Noord.
+
+ This library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ This library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with this library; if not, write to the
+ Free Software Foundation, Inc.,
+ 51 Franklin Street, Fifth Floor, Boston,
+ MA 02110-1301 USA
+
+cf. the file COPYING
+
+
diff --git a/scratch/german-comments/text_cat/LM/english.lm b/scratch/german-comments/text_cat/LM/english.lm
new file mode 100644
index 0000000..ab71632
--- /dev/null
+++ b/scratch/german-comments/text_cat/LM/english.lm
@@ -0,0 +1,400 @@
+_ 20326
+e 6617
+t 4843
+o 3834
+n 3653
+i 3602
+a 3433
+s 2945
+r 2921
+h 2507
+e_ 2000
+d 1816
+_t 1785
+c 1639
+l 1635
+th 1535
+he 1351
+_th 1333
+u 1309
+f 1253
+m 1175
+p 1151
+_a 1145
+the 1142
+_the 1060
+s_ 978
+er 968
+_o 967
+he_ 928
+d_ 888
+t_ 885
+the_ 844
+_the_ 843
+on 842
+in 817
+y 783
+n_ 773
+b 761
+re 754
+, 734
+,_ 732
+an 732
+g 728
+w 718
+_i 707
+en 676
+f_ 599
+y_ 595
+of 594
+_of 592
+es 589
+ti 587
+v 580
+_of_ 575
+of_ 575
+nd 568
+at 549
+r_ 540
+_w 534
+it 522
+ed 496
+_p 494
+nt 485
+_c 462
+o_ 457
+io 450
+_an 439
+te 432
+or 425
+_b 418
+nd_ 407
+to 406
+st 402
+is 401
+_s 396
+_in 389
+ion 385
+and 385
+de 384
+ve 382
+ha 375
+ar 366
+_m 361
+and_ 360
+_and 360
+_and_ 358
+se 353
+_to 347
+me 346
+to_ 344
+ed_ 339
+. 330
+be 329
+_f 329
+._ 329
+_to_ 320
+co 317
+ic 316
+ns 308
+al 307
+le 304
+ou 304
+ce 293
+ent 279
+l_ 278
+_co 277
+tio 275
+on_ 274
+_d 274
+tion 268
+ri 266
+_e 264
+ng 253
+hi 251
+er_ 249
+ea 246
+as 245
+_be 242
+pe 242
+h_ 234
+_r 232
+ec 227
+ch 223
+ro 222
+ct 220
+_h 219
+pr 217
+in_ 217
+ne 214
+ll 214
+rt 213
+s,_ 210
+s, 210
+li 209
+ra 208
+T 207
+wh 204
+a_ 203
+ac 201
+_wh 199
+_n 196
+ts 196
+di 196
+es_ 195
+si 194
+re_ 193
+at_ 192
+nc 192
+ie 190
+_a_ 188
+_in_ 185
+ing 184
+us 182
+_re 182
+g_ 179
+ng_ 178
+op 178
+con 177
+tha 175
+_l 174
+_tha 174
+ver 173
+ma 173
+ion_ 171
+_con 171
+ci 170
+ons 170
+_it 170
+po 169
+ere 168
+is_ 167
+ta 167
+la 166
+_pr 165
+fo 164
+ho 164
+ir 162
+ss 161
+men 160
+be_ 160
+un 159
+ty 159
+_be_ 158
+ing_ 157
+om 156
+ot 156
+hat 155
+ly 155
+_g 155
+em 153
+_T 151
+rs 150
+mo 148
+ch_ 148
+wi 147
+we 147
+ad 147
+ts_ 145
+res 143
+_wi 143
+I 143
+hat_ 142
+ei 141
+ly_ 141
+ni 140
+os 140
+ca 139
+ur 139
+A 138
+ut 138
+that 138
+_that 137
+ati 137
+_fo 137
+st_ 137
+il 136
+or_ 136
+for 136
+pa 136
+ul 135
+ate 135
+ter 134
+it_ 134
+nt_ 133
+that_ 132
+_ha 129
+al_ 128
+el 128
+as_ 127
+ll_ 127
+_ma 125
+no 124
+ment 124
+an_ 124
+tion_ 122
+su 122
+bl 122
+_de 122
+nce 120
+pl 120
+fe 119
+tr 118
+so 118
+int 115
+ov 114
+e, 114
+e,_ 114
+_u 113
+ent_ 113
+Th 113
+her 113
+j 112
+atio 112
+ation 112
+_Th 111
+le_ 110
+ai 110
+_it_ 110
+_on 110
+_for 109
+ect 109
+k 109
+hic 108
+est 108
+der 107
+tu 107
+na 106
+_by_ 106
+by_ 106
+E 106
+by 106
+_by 106
+ve_ 106
+_di 106
+en_ 104
+vi 104
+m_ 103
+_whi 102
+iv 102
+whi 102
+ns_ 102
+_A 101
+ich 100
+ge 100
+pro 99
+ess 99
+_whic 99
+ers 99
+hich 99
+ce_ 99
+which 99
+whic 99
+all 98
+ove 98
+_is 98
+ich_ 97
+ee 97
+hich_ 97
+n,_ 96
+n, 96
+im 95
+ir_ 94
+hei 94
+ions 94
+sti 94
+se_ 94
+per 93
+The 93
+_pa 93
+heir 93
+id 93
+eir 93
+eir_ 93
+ig 93
+heir_ 93
+_no 93
+ev 93
+era 92
+_int 92
+ted 91
+_The 91
+ies 91
+art 91
+thei 90
+_ar 90
+_thei 90
+their 90
+_pro 90
+et 89
+_pe 88
+_mo 88
+ther 88
+x 87
+gh 87
+S 87
+_is_ 87
+ol 87
+ty_ 87
+_I 86
+nde 86
+am 86
+rn 86
+nte 86
+mp 85
+_su 84
+_we 84
+par 84
+_v 84
+pu 82
+his 82
+ow 82
+mi 82
+go 81
+N 81
+ue 81
+ple 81
+ep 80
+ab 80
+;_ 80
+; 80
+ex 80
+ain 80
+over 80
+_un 79
+q 79
+qu 79
+pp 79
+ith 79
+ry 79
+_as 79
+ber 79
+ub 78
+av 78
+uc 78
+s._ 77
+s. 77
+enc 77
+are 77
+iti 77
+gr 76
+his_ 76
+ua 76
+part 76
+ff 75
+eve 75
+O 75
+rea 74
+ous 74
+ia 74
+The_ 73
+ag 73
+mb 73
+_go 73
+fa 72
+on,_ 72
+ern 72
+t,_ 72
+on, 72
+t, 72
+_me 71
diff --git a/scratch/german-comments/text_cat/LM/german.lm b/scratch/german-comments/text_cat/LM/german.lm
new file mode 100644
index 0000000..6f14f51
--- /dev/null
+++ b/scratch/german-comments/text_cat/LM/german.lm
@@ -0,0 +1,400 @@
+_ 31586
+e 15008
+n 9058
+i 7299
+r 6830
+t 5662
+s 5348
+a 4618
+h 4176
+d 4011
+er 3415
+en 3412
+u 3341
+l 3266
+n_ 2848
+c 2636
+ch 2460
+g 2407
+o 2376
+e_ 2208
+r_ 2128
+m 2077
+_d 1948
+de 1831
+en_ 1786
+ei 1718
+er_ 1570
+in 1568
+te 1505
+ie 1505
+b 1458
+t_ 1425
+f 1306
+k 1176
+ge 1144
+s_ 1137
+un 1113
+, 1104
+,_ 1099
+w 1099
+z 1060
+nd 1039
+he 1004
+st 989
+_s 952
+_de 949
+. 909
+_e 906
+ne 906
+der 880
+._ 847
+be 841
+es 829
+ic 796
+_a 791
+ie_ 779
+is 769
+ich 763
+an 755
+re 749
+di 732
+ein 730
+se 730
+" 720
+ng 709
+_i 706
+sc 683
+sch 681
+it 673
+der_ 652
+h_ 651
+ch_ 642
+S 630
+le 609
+p 609
+? 607
+? 603
+au 603
+v 602
+che 599
+_w 596
+d_ 585
+die 576
+_di 572
+m_ 562
+_die 559
+el 548
+_S 540
+_der 529
+li 527
+_der_ 523
+si 515
+al 514
+ns 507
+on 501
+or 495
+ti 490
+ten 487
+ht 486
+die_ 485
+_die_ 483
+D 479
+rt 478
+nd_ 476
+_u 470
+nt 468
+A 466
+in_ 464
+den 461
+cht 447
+und 443
+me 440
+_z 429
+ung 426
+ll 423
+_un 421
+_ei 419
+_n 415
+hr 412
+ine 412
+_A 408
+_ein 405
+ar 404
+ra 403
+_v 400
+_g 400
+as 395
+zu 392
+et 389
+em 385
+_D 380
+eine 376
+gen 376
+g_ 376
+da 368
+we 366
+K 365
+lt 360
+B 354
+_" 353
+nde 349
+ni 347
+und_ 345
+E 345
+ur 345
+_m 342
+ri 341
+ha 340
+eh 339
+ten_ 338
+es_ 336
+_K 336
+_und 335
+ig 335
+_b 335
+hen 334
+_und_ 332
+_au 329
+_B 327
+_da 325
+_zu 324
+_in 322
+at 321
+us 318
+wi 307
+n, 305
+n,_ 304
+nn 304
+te_ 301
+eit 301
+_h 300
+ter 299
+M 298
+n. 295
+? 294
+ng_ 289
+sche 289
+- 283
+rs 282
+den_ 282
+_si 280
+G 280
+im 278
+_ge 277
+chen 276
+rd 273
+_E 273
+n._ 270
+icht 270
+rn 268
+uf 267
+isch 264
+isc 264
+nen 263
+_in_ 262
+_M 260
+_er 257
+ich_ 255
+ac 253
+lic 252
+_G 252
+ber 252
+la 251
+vo 251
+eb 250
+ke 249
+F 248
+as_ 248
+hen_ 248
+ach 245
+en, 244
+ung_ 243
+lich 243
+ste 243
+en,_ 243
+_k 241
+ben 241
+_f 241
+en. 241
+_be 239
+it_ 239
+L 238
+_se 237
+mi 236
+ve 236
+na 236
+on_ 236
+P 235
+ss 234
+ist 234
+? 234
+ht_ 233
+ru 233
+st_ 229
+_F 229
+ts 227
+ab 226
+W 226
+ol 225
+_eine 225
+hi 225
+so 224
+em_ 223
+"_ 223
+ren 222
+en._ 221
+chen_ 221
+R 221
+ta 221
+ere 220
+ische 219
+ers 218
+ert 217
+_P 217
+tr 217
+ed 215
+ze 215
+eg 215
+ens 215
+?r 213
+ah 212
+_vo 212
+ne_ 211
+cht_ 210
+uc 209
+_wi 209
+nge 208
+lle 208
+fe 207
+_L 207
+ver 206
+hl 205
+V 204
+ma 203
+wa 203
+auf 201
+H 198
+_W 195
+T 195
+nte 193
+uch 193
+l_ 192
+sei 192
+nen_ 190
+u_ 189
+_den 189
+_al 189
+_V 188
+t. 188
+lte 187
+ut 186
+ent 184
+sich 183
+sic 183
+il 183
+ier 182
+am 181
+gen_ 180
+sen 179
+f? 178
+um 178
+t._ 177
+f_ 174
+he_ 174
+ner 174
+nst 174
+ls 174
+_sei 173
+ro 173
+ir 173
+ebe 173
+mm 173
+ag 172
+ern 169
+t,_ 169
+t, 169
+eu 169
+ft 168
+icht_ 167
+hre 167
+Be 166
+nz 165
+nder 165
+_T 164
+_den_ 164
+iche 163
+tt 163
+zu_ 162
+and 162
+J 161
+rde 160
+rei 160
+_we 159
+_H 159
+ige 159
+_Be 158
+rte 157
+hei 156
+das 155
+aus 155
+che_ 154
+_das 154
+_zu_ 154
+tz 154
+_ni 153
+das_ 153
+_R 153
+N 153
+des 153
+_ve 153
+_J 152
+I 152
+_das_ 152
+men 151
+_so 151
+_ver 151
+_auf 150
+ine_ 150
+_ha 150
+rg 149
+ind 148
+eben 148
+kt 147
+mit 147
+_an 147
+her 146
+Ge 146
+Sc 145
+_sich 145
+U 145
+Sch 145
+_sic 145
+end 145
+Di 144
+abe 143
+ck 143
+sse 142
+?r_ 142
+ell 142
+ik 141
+o_ 141
+nic 141
+nich 141
+sa 141
+_f? 140
+hn 140
+zi 140
+no 140
+nicht 140
+im_ 139
+von_ 139
+von 139
+_nic 139
+_nich 139
+eine_ 139
+oc 138
+wei 138
+io 138
+schen 138
+gt 138
diff --git a/scratch/german-comments/text_cat/text_cat b/scratch/german-comments/text_cat/text_cat
new file mode 100755
index 0000000..6c6b0d1
--- /dev/null
+++ b/scratch/german-comments/text_cat/text_cat
@@ -0,0 +1,229 @@
+#!/usr/bin/perl -w
+# ? Gertjan van Noord, 1997.
+# mailto:vannoord at let.rug.nl
+
+use strict;
+use vars qw($opt_d $opt_f $opt_h $opt_i $opt_l $opt_n $opt_s $opt_t $opt_v $opt_u $opt_a);
+use Getopt::Std;
+use Benchmark;
+
+my $non_word_characters='0-9\s';
+
+# OPTIONS
+getopts('a:d:f:hi:lnst:u:v');
+
+# defaults: set $opt_X unless already defined (Perl Cookbook p. 6):
+$opt_a ||= 10;
+$opt_d ||= '/users1/vannoord/Perl/TextCat/LM';
+$opt_f ||= 0;
+$opt_t ||= 400;
+$opt_u ||= 1.05;
+
+sub help {
+ print <<HELP
+Text Categorization. Typically used to determine the language of a
+given document.
+
+Usage
+-----
+
+* print help message:
+
+$0 -h
+
+* for guessing:
+
+$0 [-a Int] [-d Dir] [-f Int] [-i N] [-l] [-t Int] [-u Int] [-v]
+
+ -a the program returns the best-scoring language together
+ with all languages which are $opt_u times worse (cf option -u).
+ If the number of languages to be printed is larger than the value
+ of this option (default: $opt_a) then no language is returned, but
+ instead a message that the input is of an unknown language is
+ printed. Default: $opt_a.
+ -d indicates in which directory the language models are
+ located (files ending in .lm). Currently only a single
+ directory is supported. Default: $opt_d.
+ -f Before sorting is performed the Ngrams which occur this number
+ of times or less are removed. This can be used to speed up
+ the program for longer inputs. For short inputs you should use
+ -f 0.
+ Default: $opt_f.
+ -i N only read first N lines
+ -l indicates that input is given as an argument on the command line,
+ e.g. text_cat -l "this is english text"
+ Cannot be used in combination with -n.
+ -s Determine language of each line of input. Not very efficient yet,
+ because language models are re-loaded after each line.
+ -t indicates the topmost number of ngrams that should be used.
+ If used in combination with -n this determines the size of the
+ output. If used with categorization this determines
+ the number of ngrams that are compared with each of the language
+ models (but each of those models is used completely).
+ -u determines how much worse result must be in order not to be
+ mentioned as an alternative. Typical value: 1.05 or 1.1.
+ Default: $opt_u.
+ -v verbose. Continuation messages are written to standard error.
+
+* for creating new language model, based on text read from standard input:
+
+$0 -n [-v]
+
+ -v verbose. Continuation messages are written to standard error.
+
+
+HELP
+}
+
+if ($opt_h) { help(); exit 0; };
+
+if ($opt_n) {
+ my %ngram=();
+ my @result = create_lm(input(),\%ngram);
+ print join("\n",map { "$_\t $ngram{$_}" ; } @result),"\n";
+} elsif ($opt_l) {
+ classify($ARGV[0]);
+} elsif ($opt_s) {
+ while (<>) {
+ chomp;
+ classify($_);
+ }
+} else {
+ classify(input());
+}
+
+# CLASSIFICATION
+sub classify {
+ my ($input)=@_;
+ my %results=();
+ my $maxp = $opt_t;
+ # open directory to find which languages are supported
+ opendir DIR, "$opt_d" or die "directory $opt_d: $!\n";
+ my @languages = sort(grep { s/\.lm// && -r "$opt_d/$_.lm" } readdir(DIR));
+ closedir DIR;
+ @languages or die "sorry, can't read any language models from $opt_d\n" .
+ "language models must reside in files with .lm ending\n";
+
+
+ # create ngrams for input. Note that hash %unknown is not used;
+ # it contains the actual counts which are only used under -n: creating
+ # new language model (and even then they are not really required).
+ my @unknown=create_lm($input);
+ # load model and count for each language.
+ my $language;
+ my $t1 = new Benchmark;
+ foreach $language (@languages) {
+ # loads the language model into hash %$language.
+ my %ngram=();
+ my $rang=1;
+ open(LM,"$opt_d/$language.lm") || die "cannot open $language.lm: $!\n";
+ while (<LM>) {
+ chomp;
+ # only use lines starting with appropriate character. Others are
+ # ignored.
+ if (/^[^$non_word_characters]+/o) {
+ $ngram{$&} = $rang++;
+ }
+ }
+ close(LM);
+ #print STDERR "loaded language model $language\n" if $opt_v;
+
+ # compares the language model with input ngrams list
+ my ($i,$p)=(0,0);
+ while ($i < @unknown) {
+ if ($ngram{$unknown[$i]}) {
+ $p=$p+abs($ngram{$unknown[$i]}-$i);
+ } else {
+ $p=$p+$maxp;
+ }
+ ++$i;
+ }
+ #print STDERR "$language: $p\n" if $opt_v;
+
+ $results{$language} = $p;
+ }
+ print STDERR "read language models done (" .
+ timestr(timediff(new Benchmark, $t1)) .
+ ".\n" if $opt_v;
+ my @results = sort { $results{$a} <=> $results{$b} } keys %results;
+
+ print join("\n",map { "$_\t $results{$_}"; } @results),"\n" if $opt_v;
+ my $a = $results{$results[0]};
+
+ my @answers=(shift(@results));
+ while (@results && $results{$results[0]} < ($opt_u *$a)) {
+ @answers=(@answers,shift(@results));
+ }
+ if (@answers > $opt_a) {
+ print "I don't know; " .
+ "Perhaps this is a language I haven't seen before?\n";
+ } else {
+ print join(" or ", @answers), "\n";
+ }
+}
+
+# first and only argument is reference to hash.
+# this hash is filled, and a sorted list (opt_n elements)
+# is returned.
+sub input {
+ my $read="";
+ if ($opt_i) {
+ while(<>) {
+ if ($. == $opt_i) {
+ return $read . $_;
+ }
+ $read = $read . $_;
+ }
+ return $read;
+ } else {
+ local $/; # so it doesn't affect $/ elsewhere
+ undef $/;
+ $read = <>; # swallow input.
+ $read || die "determining the language of an empty file is hard...\n";
+ return $read;
+ }
+}
+
+
+sub create_lm {
+ my $t1 = new Benchmark;
+ my $ngram;
+ ($_,$ngram) = @_; #$ngram contains reference to the hash we build
+ # then add the ngrams found in each word in the hash
+ my $word;
+ foreach $word (split("[$non_word_characters]+")) {
+ $word = "_" . $word . "_";
+ my $len = length($word);
+ my $flen=$len;
+ my $i;
+ for ($i=0;$i<$flen;$i++) {
+ $$ngram{substr($word,$i,5)}++ if $len > 4;
+ $$ngram{substr($word,$i,4)}++ if $len > 3;
+ $$ngram{substr($word,$i,3)}++ if $len > 2;
+ $$ngram{substr($word,$i,2)}++ if $len > 1;
+ $$ngram{substr($word,$i,1)}++;
+ $len--;
+ }
+ }
+ ###print "@{[%$ngram]}";
+ my $t2 = new Benchmark;
+ print STDERR "count_ngrams done (".
+ timestr(timediff($t2, $t1)) .").\n" if $opt_v;
+
+ # as suggested by Karel P. de Vos, k.vos at elsevier.nl, we speed up
+ # sorting by removing singletons
+ map { my $key=$_; if ($$ngram{$key} <= $opt_f)
+ { delete $$ngram{$key}; }; } keys %$ngram;
+ #however I have very bad results for short inputs, this way
+
+
+ # sort the ngrams, and spit out the $opt_t frequent ones.
+ # adding `or $a cmp $b' in the sort block makes sorting five
+ # times slower..., although it would be somewhat nicer (unique result)
+ my @sorted = sort { $$ngram{$b} <=> $$ngram{$a} } keys %$ngram;
+ splice(@sorted,$opt_t) if (@sorted > $opt_t);
+ print STDERR "sorting done (" .
+ timestr(timediff(new Benchmark, $t2)) .
+ ").\n" if $opt_v;
+ return @sorted;
+}
diff --git a/scratch/german-comments/text_cat/version b/scratch/german-comments/text_cat/version
new file mode 100644
index 0000000..e6ba9d5
--- /dev/null
+++ b/scratch/german-comments/text_cat/version
@@ -0,0 +1,2 @@
+1.10
+
--
1.7.3.1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/libreoffice/attachments/20101024/d2da4589/attachment-0001.pgp>
More information about the LibreOffice
mailing list