TextParser ライブラリ
抽出結果のクラス/CSVへのエクスポート
チュートリアル > 抽出結果のクラス/CSVへのエクスポート

TextParser provides different techniques for extracting text from an input source and generate output in JSON string format. However, consider a scenario where a user wants to extract text from plain text and HTML documents using several extractors. Further, the user also wants to store it in a purposeful approach. This walkthrough explains how you can retrieve the extracted text into a custom user defined class. It also demonstrates how you can export the extraction result into a CSV file.

After completing the implementation of this walkthrough, you will be able to:

  1. Extract text using Template based extractor
  2. Retrieve extraction results in a custom class
  3. Export extraction results to CSV

For an example, let's take a scenario where the user wants to extract all the ‘ERROR’ logs from the server log file (‘input.txt’). Following drop down section shows the input source.

Click here to see the input

2012-11-11 00:51:25,676 INFO - Starting Backup Manager 5.0.0 build 18536
2012-11-11 00:51:25,789 WARN - Generating Self-Signed SSL Certificate (alias = cdp)
2012-11-11 00:51:26,566 WARN - Saved SSL Certificate (alias = cdp) to Key Store /usr/sbin/r1soft/conf/comkeystore
2012-11-11 00:51:26,789 INFO - Operating System: Linux
2012-11-11 00:51:27,234 INFO - Architecture: amd64
2012-11-11 00:51:27,986 INFO - OS Version: 2.6.32-279.11.1.e16.x86_64
2012-11-11 00:51:28,123 INFO - Processors Detected: 1
2012-11-11 00:51:28,954 INFO - Max Configured Heap Memory: 989.9 MB
2012-11-11 00:51:29,276 ERROR - Unsuccessful: create index stateIndex on RecoveryPoint (state)
2012-11-11 00:51:29,980 ERROR - Index 'STATEINDEX' already exists in Schema 'R1DERBYUSER'.
2012-11-11 00:51:30,213 WARN - Invalid feature (0xECEBE6F7).
2012-11-11 00:51:30,736 INFO - Tomcat Wrapper starting
2012-11-11 00:51:30,800 INFO - Tomcat Wrapper started

Extracting information from this input file can help in troubleshooting the errors quickly. From the above input file, you can observe that each log entry follows a predefined fixed structure, which consists of four major elements; the date, the time (up to ms), the log type and description of the log. Considering this, it would be ideal to use the Template-Based extractor to extract the desired text from the input file.

Step 1: Extract text using Template-Based extractor

  1. Create a new application (any target that supports .NET Standard 2.0).
  2. Create a sample input text file named “input.txt”, by copy pasting the contents described above and place the input file in the project’s root directory.
  3. Install the ‘C1.TextParser’ NuGet package in your application. For more information, refer Adding NuGet Packages to your app.
  4. To create a template that defines the structure of a log entry (the text to be extracted from the input file), add a new XML file to your project. Name it as ‘template.xml’ and add the following code to it.
    Note: For more information on defining a template, refer ‘Defining the Nested Template’.

      <template rootElement="errorLog">
      
      <element name="date" childrenSeparatorRegex="-" childrenOrderMatter="true">
        <element name="year" extractFormat="int"/>
        <element name="month" extractFormat="int"/>
        <element name="day" extractFormat="int"/>
      </element>
      
      <element name="timeHMS" childrenSeparatorRegex=":" childrenOrderMatter="true">
        <element name="hour" extractFormat="int"/>
        <element name="minute" extractFormat="int"/>
        <element name="second" extractFormat="int"/>
      </element>
      
      <element name="time" childrenSeparatorRegex="," childrenOrderMatter="true">
        <element template="timeHMS"/>
        <element name="millisecond" extractFormat="int"/>
      </element>
        
      <element name="errorLog" childrenOrderMatter="true">
        <element template="date"/>
        <element template="time"/>
        <element extractFormat="regex:ERROR"/>
        <element name="description" startingRegex="-" extractFormat="regex:(.)+(?=(\r\n))"/>
      </element>
    
    </template>
    
  5. In order to extract the desired text from the input stream based on the above template, add the following lines of code to Program.cs. The code provided below initializes and configures the TemplateBasedExtractor class to perform the text extraction and display the extracted result in the JSON format on the console. After extraction, the results are returned into a variable of type IExtractionResult.
    //ユーザー定義の XML テンプレートを含むストリームを開きます。
    FileStream templateStream = File.Open(@"template.xml", FileMode.Open); 
    //データを抽出するストリームを開きます。
    FileStream inputStream = File.Open(@"input.txt", FileMode.Open); 
    
    //テンプレート形式に一致する入力データを解析するために TemplateBasedExtractor クラスを初期化します。
    TemplateBasedExtractor templateBasedExtractor = new TemplateBasedExtractor(templateStream); 
    
    //入力ストリームから必要なテキストを抽出し、入力/テンプレート ストリームを閉じます。
    IExtractionResult extractedResult = templateBasedExtractor.Extract(inputStream);
    inputStream.Close(); 
    templateStream.Close();
    
    //解析結果(Json形式)をコンソールウィンドウに書き込みます。
    Console.WriteLine(extractedResult.ToJsonString()); 
    

Step 2: Retrieve extraction results in a custom class

  1. Define the following classes to map the extraction results to a custom class. It is important to note that each class property has a DataMember Attribute, the ‘Name’ property of which corresponds to the “name” property of the template element to which is should be mapped.
    public class TimeHMS
    {
        [DataMember(Name = "hour")]
        public int Hour { get; set; }
    
        [DataMember(Name = "minute")]
        public int Minute { get; set; }
    
        [DataMember(Name = "second")]
        public int Second { get; set; }
    }
    
    public class Time
    {
        [DataMember(Name = "timeHMS")]
        public TimeHMS TimeHMS { get; set; }
    
        [DataMember(Name = "millisecond")]
        public int MilliSecond { get; set; }
    }
    
    public class Log
    {
        [DataMember(Name = "description")]
        public String Description { get; set; }
    
        [DataMember(Name = "time")]
        public Time Time { get; set; }
    }
    
    public class Logs
    {
        [DataMember(Name = "errorLog")]
        public List<Log> ErrorLogs { get; set; }
    }
    
  2. Retrieve the extraction result into the custom class using the Get method of the IExtractionResult interface as shown:
    //抽出した結果をユーザー定義クラス「Logs」にマッピングします。
    Logs logs = extractedResult.Get<Logs>();
    

Step 3: Export extraction results to CSV

The extracted text can further be output to a CSV file. This section describes the same in detail:
  1. Add a new class file to the project. Name it as ‘CsvExportHelper.cs’. This class will be used to convert the IEnumerablecollection containing the extraction results into a string formatted in CSV format. Add the following code to the ‘CsvExportHelper.cs’ file:
    public static class CsvExportHelper
    {
        public static StringBuilder ExportList<T>(IEnumerable<T> list)
        {
            var stringBuilder = new StringBuilder();
            //ヘッダー部分を作成します。
            var headerProperties = typeof(T).GetProperties();
            for (int i = 0; i < headerProperties.Length - 1; i++)
            {
                stringBuilder.Append(headerProperties[i].Name + ",");
            }
            var lastProp = headerProperties[headerProperties.Length - 1].Name;
            stringBuilder.Append(lastProp + Environment.NewLine);
    
            if (list == null) return stringBuilder;
            //行を作成します。
            foreach (var item in list)
            {
                var rowValues = typeof(T).GetProperties();
                for (int i = 0; i < rowValues.Length; i++)
                {
                    var prop = rowValues[i];
                    var obj = prop.GetValue(item);
                    stringBuilder.Append("\"" + obj.ToCustomString() + "\"" + ",");
                }
                stringBuilder.Append(Environment.NewLine);
            }
            return stringBuilder;
        }
    }
    
    public static class Extension
    {
        public static string ToCustomString(this object obj)
        {
            Type objType = obj.GetType();
            if (objType.IsPrimitive || objType == typeof(string))
            {
                return obj.ToString();
            }
    
            StringBuilder sb = new StringBuilder();
            if (objType.FullName.StartsWith("System.Collections.Generic.List"))
            {
                sb.Append('"');
                int i = 1;
                foreach (object child in (IList)obj)
                {
                    sb.Append(i);
                    sb.Append(' ');
                    sb.Append(child.ToCustomString());
                    sb.Append(' ');
                    i++;
                }
                sb.Append('"');
                return sb.ToString();
            }
    
            var objProperties = objType.GetProperties();
    
            for (int i = 0; i < objProperties.Length; i++)
            {
                var prop = objProperties[i];
                var obj1 = prop.GetValue(obj);
                sb.Append(prop.Name);
                sb.Append(" : ");
                sb.Append(obj1.ToCustomString());
                if (i < objProperties.Length - 1)
                    sb.Append(' ');
            }
    
            string val = sb.ToString();
            val = '"' + val.Replace("\"", string.Empty) + '"';
            return val;
        }
    }
    
  2. Invoke the ExportList method of the CsvExportHelper class to convert the IEnumerable collection containing the extraction results into a string formatted in CSV format.
    //抽出結果をcsvファイルにエクスポートします。
    StringBuilder sb = CsvExportHelper.ExportList(logs.ErrorLogs);
    
  3. Finally write the string content to a CSV file as shown:
    string str = sb.ToString();
    File.WriteAllText("ExtractErrorLogs.csv", sb.ToString());
    
  4. Run the application. Observe that the extraction results have been successfully exported to "ExtractErrorLogs.csv" as shown in the image below:

        Extraction Result