[Python] rake-nltk 모듈을 이용한 키워드 추출
서비스 개발에 앞서 텍스트에서 주요한 키워드를 추출할 수 있는 기능을 테스트해보고 있다. RAKE (Rapid Automatic Keyword Extraction) 알고리즘을 이용했는데 파이썬에서는 쉽게 사용할 수 있는 모듈이 있어서 쉽게 구현할 수 있다. 사용한 모듈은 rake-nltk 이다. 그 외에 python-rake 모듈도 있으나 rake-nltk 모듈은 nltk 가 적용돼서 뭔가 좀 더 좋은 게 아닌가 싶어 사용했다. 사용은 PHP 에서 .py 파일을 실행한 후 결과를 받아서 출력하는 것으로 정했다. 아래 코드들은 기능 테스트에 사용하기 위한 것이므로 실제 사용에는 무리가 있을 수 있다.
<?php
$content = "Samsung Galaxy S8 64GB Unlocked Phone - International Version (Midnight Black) The revolutionary design of the Galaxy S8 begins from the inside out. We rethought every part of the phone's layout to break through the confines of the smartphone screen. So all you see is pure content and no bezel. It's the biggest, most immersive screen on a Galaxy smartphone of this size. And it's easy to hold in one hand. Meet the Galaxy S8 - Infinitely Brilliant. The Galaxy S8 has the world's first Infinity Screen, A screen without limits. The expansive display stretches from edge to edge, giving you the most amount of screen in the least amount of space. The revolutionary design of the Galaxy S8 begins from the inside out. We rethought every part of the phone's layout to break through the confines of the smartphone screen. So all you see is pure content and no bezel. It's the biggest, most immersive screen on a Galaxy smartphone of this size. And it's easy to hold in one hand. The Infinity Display has an incredible end-to-end screen that spills over the phone’s sides, forming a completely smooth, continuous surface with no bumps or angles. It’s pure, pristine, uninterrupted glass. And it takes up the entire front of the phone, flowing seamlessly into the aluminum shell. The result is a beautifully curved, perfectly symmetrical, singular object. Capture life as it happens with the Galaxy S8 cameras. The 12MP rear camera and the 8MP front camera are so accurate and fast that you won't miss a moment, day or night. Prying eyes are not a problem when you have iris scanning on the Galaxy S8. No two irises have the same pattern, not even yours, and they're nearly impossible to replicate. That means with iris scanning, your phone and its contents open to your eyes only. And when you need to unlock really fast, face recognition is a handy option. You never really stop using your phone. That's why Galaxy S8 is driven by the world's first 10nm processor. It's fast and powerful and increases battery efficiency. Plus, there's the ability to expand storage, and to work through rain and dust with IP68-rated performance. The Infinity Display sets a new standard for uninterrupted, immersive experiences. It enables an expanded screen size without necessitating a larger phone. So while the view is grander, Galaxy S8 feels small in your hand, making them easy to hold and use. You'll immediately notice the comfortable grip of the smooth curves, which allow you to hold on easily while you watch movies on the larger screen. And important short cuts are a swipe away, as the edge screen is available on Galaxy S8. Complimenting the uniform front is the equally smooth silhouette on the back. The rear camera sits perfectly flushed with the surface for a visually tranquil profile.";
$content = preg_replace('#[^a-z0-9\'\\s\.\?\,\!\*\&\$]#i', '', $content);
exec('cd /home/keyword');
exec('python3 keywordExtract.py "'.$content.'"', $output);
foreach($output as $out) {
print_r(json_decode($out));
}
?>
위 코드는 keywordExtract.php 파일의 코드이다. 아래는 keywordExtract.py 파일의 코드이다.
# _*_ coding: utf-8 _*_
from rake_nltk import Rake
import json
import sys
r = Rake()
words = []
if(len(sys.argv) > 1) :
content = sys.argv[1];
r.extract_keywords_from_text(content)
#words = r.get_ranked_phrases_with_scores()
words = r.get_ranked_phrases()
print(json.dumps(words))
php 의 exec 함수를 이용해 keywordExtract.py 파일을 실행하고 그 결과를 받아서 화면에 출력한다. 출력 결과는 아래와 같다.
Array
(
[0] => samsung galaxy s8 64gb unlocked phone international version midnight black
[1] => rear camera sits perfectly flushed
[2] => expanded screen size without necessitating
[3] => galaxy s8 infinitely brilliant
[4] => galaxy s8 feels small
[5] => never really stop using
[6] => galaxy s8 cameras
[7] => galaxy s8 begins
[8] => 12mp rear camera
[9] => screen without limits
[10] => visually tranquil profile
[11] => rethought every part
[12] => increases battery efficiency
[13] => important short cuts
[14] => first 10nm processor
[15] => 8mp front camera
[16] => expansive display stretches
[17] => incredible endtoend screen
[18] => unlock really fast
...............
)
실제 결과는 위 내용보다 더 많이 출력된다. php 파일의 exec 함수 관련 코드에서 경로는 적절히 수정해야 한다.