This article focuses on analyzing unstructured data and how to derive sense out of it for investigation and forensic purposes. The analysis snapshot can serve as a recovery point for redundancy. This space is growing exponentially thanks to the prominence of facebook, twitter,youtube and other social networking sites. The data growth is huge and we need to have a holistic strategy to address it.
Cyber crime unit would need an holistic view of data both on structured and unstructured forms. The structured forms include documents, presentations, databases and unstructured data like social networks, emails, instant message conversations, voice, video. The key challenges on retaining and discovering the artifacts include:
1. Providing a 360 degree view of artifacts
Organizing, classifying & searching the artifacts
Visualizing the artifacts in various forms ex: timelines, sender, receiver
Preserving, securing and recovering them based on policies
Firstly the unstructured data needs a robust platform to store and organize. Hadoop would be a good choice to store data for persistence and analysis. The data needs to be converted into star schema based structure which ensures the data to be organized into semi structured format. Hadoop also ensures redundancy and high availability requirements are met.
The next step is to get the sense out of the data through an analytical process. This can be powered by modern tools like text analytics, search & classification. The final step is to link the source data with the analyzed data through meta data management, e-discovery & linking.
Source data can be managed through retention policies, data leak prevention, e-Discovery modules which sets governance at run time triggers. The artifact life cycle will be managed based on the run time usage on the analytical data. This way there is a guaranteed linking between the source and where it is getting used for intelligence purposes. Also the analytics serves as a duplication of the source data and a tracking back mechanism without exposing the source data itself. The source data will be pulled on demand.
The holistic view of structured and unstructured data can be accomplished by semantic analysis, context sensitive view of the data. For example, enterprise systems can provide detail of a particular person based on SSN and the same person can be searched on facebook to analyze his interaction patterns and how contextually his personal information or activity has changed within the enterprise. To be specific if you need to analyze whether the person parties all night and drives the car in a drunken state can be analyzed by the photos, activity and statuses posted on facebook. A visualization can be created with the above data and can be stored for investigation purposes. This serves as a snapshot data at a point in time offering redundancy.
In summary unstructured data intelligence aids in cyber crime as it operates on real time data and also it has ability to carry snapshot for redundancy. This intelligence can be leveraged for investigation purposes as it saves great deal time to do the same after the fact? Hadoop as a platform offers high availability and scales massively for the data volume.