Help

Upload

It is possible to upload a zip file or one or more text file(s), but not both. At this stage TANIT (Text ANalysIs Tools) can analyse only Hungarian texts.

Zip

Every file in the zip should use the same encoding. This has to be set correctly, otherwise errors might appear during the processing of the input files. If the encoding is left blank, UTF-8 will be used.

The supported encodings are: utf-8, utf-16, utf-16-be, utf-16-le, us-ascii and iso-8859-1

Only the files in the root of zip will be processed. If the zip file includes a directory, which contains further text files, the directory and its content will be skipped during the analysis. The files must have ".txt" extension, otherwise they will be skipped.

Texts

The number of uploadable text files is limited to 50. If the limit is reached, no further upload option will appear.

Each file encoding must be set properly If the encoding is left blank, UTF-8 will be used.

Stopwords

By selecting the "Enable stopwords" checkbox, before counting any distribution, a stopword filtering is applied.

The user can upload a stopword list. If this is left blank, but the checkbox is selected, the default stopword list will be used.

During topic modeling, the system will use its own stopword list, whether the user uploaded one or not (even if no stopword-filtering was set). The reason for this is that the model was trained with the default stopwords.

Miscellaneous

The whole corpus can be transformed to texts containing only lower case characters.

Every word containing only digits can be replaced with "NUM". For example, "100" will be replaced by "NUM", "hundred" remains "hundred".

If selected, every word matching a regular expression designed to identify urls, will be replaced by the term "URL".

Distributions

If selected, a distribution of parts of speech will be counted displaying every POS found in the corpus, and the quantity of its appearance. A distribution of punctuations will also be computed, if this feature is selected.

Lemma distribution is basically the same as the above except for the fact that we count lemma appearances instead of parts of speech.

Lexical richness

By selecting any of the checkboxes, the given lexical richness metric gets computed over every uploaded document.

Lexical richness metrics require POS distribution, so if any is selected, POS distribution will be computed whether it was requested or not.

Topic modeling

If selected, the LDA topic modeling algorithm will be executed on the uploaded corpus. The results will be displayed on two separate tabs: one will contain the top 20 words of each topic, and the other one shows the probabilities of every uploaded document on every topic. It is possible to apply an LDA topic modeling based upon the literature subcorpus of the Hungarian National Corpus.

Download

The results of the analysis can be downloaded in an xlsx document. The documents will be stored on the server for 24 hours but after navigating anywhere from the result page, the download link cannot be recovered.

The result of the morphological analysis can be downloadad as well in a zip file. It contains two files for each input file, one of them has the lemmas (one per line), the other one stores the whole output of Magyarlanc (one token per line, fields separated by a semicolon)

API

TANIT is built on the JAX-WS server-client architecture and uses Remote Procedure Calls (RPC) to perform computationally intensive tasks. Developers can access the service directly in two different ways, which are described below.

Use wsimport

If you want to use the service from java code, the easiest way is to use wsimport. This command line tool will parse the wsdl file of the service and will generate the bytecode (or even source) that you need in order to use the service.

The wsdl file is located at http://rgai2.inf.u-szeged.hu/ml-webserv/mlws?wsdl

For further information, see wsimport help.

Use it in a restful way

It is possible, however not trivial, to perform RPC in other languages as well.

JAX-WS uses SOAP (Simple Object Access Protocol) for communication between the client and the service. Some languages might support sending SOAP envelops but others can be used as well. We provide a javascript implementation, which can be used in web applications. This can be a good point to start if you want to implement the same functionality in another languages.

You can find the implementation here. Please read the linked file for further documentation.