Apache Tika is a useful open source library written in Java for detecting file types. Most people use it to validate files they accept, such as through a web interface. This is useful for security, since any file extension can be assigned to any file. Only through content inspection can you be sure of a file’s real type.
What This Tutorial Covers
- Installing Apache Tika
- Importing & Using Apache Tika
What You Need For This Tutorial
Java 8
Installing Apache Tika
To install Tika, add the following line to your Gradle build:
compile('org.apache.tika:tika-core:1.17')
Other install instructions can be found at Apache Tike Getting Started.
How Apache Tika Works
The first thing Tika does is check for magic numbers, which are bytes at the beginning of a file that indicate it’s file type. For example, .xlsx files always start with the bytes: 50 4B (those are hexadecimal numbers by the way).
A list of magic numbers can be found at: File Signatures
Some files don’t have magic numbers though. In that case, Tika will attempt to determine if the file is a text file by trying to determine its encoding. For example, UTF-8 encoded text files have to follow a certain encoding pattern. Here is a quick explanation is below:
# The following binary patterns indicate how many bytes a UTF-8 character will have (one to four):
0xxx xxxx
110x xxxx xxxx xxxx
1110 xxxx xxxx xxxx xxxx xxxx
1111 0xxx xxxx xxxx xxxx xxxx xxxx xxxx # x are filled in with the character's Unicode code point, which is just a binary number. This way we can save space using…