Code emulation is a technique for detecting technical events that are attributable to computer viruses. The technique is used in antivirus software and in intrusion detection systems. It compares an executable file or memory dump from one system with an archival copy of the same that has been saved on the remote system to see if there are any differences, from virus-specific code changes to subtle changes in the binary’s machine code due to compiler optimization.
If there are significant differences detected by this method, such as novel functionality introduced into the executable that was not present before, then it can be concluded with high confidence that it is more likely than not a computer virus without having to read further into whether it is actually malicious. Code emulation is used by all major antivirus programs, and their use continues to increase.
Code Emulation Technique:
The technique does not detect executable code which is outside the range of detection for the program that scans for viruses; it just detects code pages within a document that have been changed as a result of compiler or virus optimization, potentially in cases where the specific virus has been altered to use new instructions from non-standard compilers or has exploited new software vulnerabilities. It is possible that there could be other changes due to compiler optimization on other pages in memory, but if those other changes are irrelevant to the detection of a virus, they would not be detected by this checking.
It is important to note that the techniques used in code emulation are very similar to the techniques used in dynamic binary instrumentation. The main difference is that code instrumentation techniques look at the binary assembly code directly, whereas code emulation uses machine language binaries.
Compilers and virus writers may introduce implementation-specific optimizations directly in their virus code to improve performance in specific circumstances. This may include optimizations of base instructions (such as XOR), which will not be detected if only differences between two copies of an executable are considered. The exact instructions that were optimized can potentially vary from port to port and even compiler release to compiler release, but what would be considered an “optimization” can likewise vary between systems.
One of the main problems in detecting code changes due to optimization or other factors is that fundamental techniques used to detect viruses, such as comparing an executable binary with a reference binary (a “snapshot”), are not always reliable. In particular, compilers sometimes generate very short and even not representable byte sequences during optimizations. This may occur because compilers often optimize out the unused bytes from certain instruction sequences (like shuffles) but don’t necessarily reserve space for those bytes when writing out their output, so if the instruction sequence ends up not using the reserved space for the sequence yet is still present in the binary because it was skipped by optimization, it will be detected as being a virus.
Significance:
By looking at the differences between two machine code images of an executable file, it is possible to detect the presence of a virus that has inserted its own code in the file and modified it by either inserting new instructions or skipping existing ones. This allows the detection and removal of viruses even if they have been removed from the master copy of their code.
Limitations:
Code emulation detects malicious programs that legitimate software may also modify. This occurs because.
- changes due to compiler optimization or other factors are frequent;
- not all changes can be detected reliably, and no requirements exist regarding how software publishers should distribute program binaries or how they should be packaged.
- Code emulation techniques are used by many major antivirus and intrusion detection tools.
- The technique is similar in concept to dynamic binary instrumentation (abbreviated DBI, DBI). DBI uses dynamic tables in the operating system, or kernel mode hooks for drivers, to intercept various system calls as they are made that can be used to log events. In Linux and Windows systems, it is possible to use virtual machine facilities exposed through the Linux kernel’s KDB device driver or Microsoft’s Native Win32 Subsystem V2.0 (NSSV2) device driver, respectively.
- The main difference between code emulation and dynamic binary instrumentation is that code instrumentation techniques look at the binary’s assembly code directly, whereas code emulation uses machine language binaries. Code emulation is useful only for detecting malware that has been written to exploit a particular operating system’s vulnerabilities. For example, it is not useful for detecting the possibility of a specific version of the malware exploiting any Windows vulnerability or the virus’ operation depending on what version of Windows it was written to exploit.
- Another major problem with DBI techniques is that they are not very consistent across different platforms and their performance can vary significantly from platform to platform.
Conclusion:
Code emulation techniques are not very effective if the code emulated is not known to have been compiled specifically for that platform. For example, code emulated for a modern Linux-based system is wholly irrelevant if the emulated binary was written specifically for an older Windows operating system. It is difficult to determine whether more than one copy of a computer virus has existed at any given time. In the case of, say, an MS-DOS infection spread by floppy disks and infected applications, this issue may be solved by using dynamic binary instrumentation or version merging to emulate as many versions of flu variants as possible; in particular, all binaries that were generated during this time period should be available in the same configuration.